Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs

arXiv cs.CL Papers

Summary

This paper investigates whether LLMs can infer individual domain knowledge from long-term Slack logs, comparing seven models and finding Gemini 2.5 Flash achieves the lowest error, highlighting feasibility and limits of automated expertise mapping.

arXiv:2605.22971v1 Announce Type: new Abstract: Employees often struggle to identify ``who knows what,'' leading to organizational productivity losses. We investigate whether Large Language Models (LLMs) can infer individual domain knowledge directly from long-term Slack logs. Analyzing 27,188 messages from 43 users, we evaluated seven models (including Gemini, Claude, and GPT families) by comparing their zero-shot estimates against self-reported skill ratings from 27 participants. Gemini 2.5 Flash achieved the lowest error (MAE 21.13%), while GPT models showed significantly larger discrepancies. Notably, estimation accuracy depended only weakly on message volume, indicating that more text alone does not guarantee better inference. These findings demonstrate the feasibility and current limits of automated expertise mapping, highlighting the need for privacy-preserving deployments and richer, structure-aware representations of human knowledge.
Original Article
View Cached Full Text

Cached at: 05/25/26, 08:56 AM

# Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs
Source: [https://arxiv.org/html/2605.22971](https://arxiv.org/html/2605.22971)
\(2018\)

###### Abstract\.

Employees often struggle to identify “who knows what,” leading to organizational productivity losses\. We investigate whether Large Language Models \(LLMs\) can infer individual domain knowledge directly from long\-term Slack logs\. Analyzing 27,188 messages from 43 users, we evaluated seven models \(including Gemini, Claude, and GPT families\) by comparing their zero\-shot estimates against self\-reported skill ratings from 27 participants\. Gemini 2\.5 Flash achieved the lowest error \(MAE 21\.13%\), while GPT models showed significantly larger discrepancies\. Notably, estimation accuracy depended only weakly on message volume, indicating that more text alone does not guarantee better inference\. These findings demonstrate the feasibility and current limits of automated expertise mapping, highlighting the need for privacy\-preserving deployments and richer, structure\-aware representations of human knowledge\.

Communication Log, Large Language Models, Cognitive Augmentation, Knowledge Elicitation

††copyright:acmlicensed††journalyear:2018††doi:XXXXXXX\.XXXXXXX††conference:Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2018; Woodstock, NY††isbn:978\-1\-4503\-XXXX\-X/2018/06††ccs:Do Not Use This Code Generate the Correct Terms for Your Paper††ccs:Do Not Use This Code Generate the Correct Terms for Your Paper††ccs:Do Not Use This Code Generate the Correct Terms for Your Paper††ccs:Do Not Use This Code Generate the Correct Terms for Your Paper![Refer to caption](https://arxiv.org/html/2605.22971v1/src/imgs/teaser.png)Figure 1\.Concept image of this study\. This study aims to estimate the domain knowledge of a person from their communication logs withLLMs\. In this study, we use the Slack communication logs as input and to estimate and visualize human domain knowledge usingLLMs\.Concept image of this study\.## 1\.Introduction

Imagine you are a new worker in a company\. You join a group full of experienced workers\. Some know specific software technologies well, and some may know well about the credentials for a project\. Since you are new, you are not sure who the best person to ask is\. In the end, the approach may be to ask various people until you find the best solution\. This may also occur for a new student at a university or in a laboratory\. The issue does not sound like a huge problem, but it actually is, according to the various social costs\.

Internal knowledge, so\-called intranet sharing in a company, has a high failure rate, whichInternational Data Corporation \(IDC\)Fortune reported in 2017 in their white paper: 500 companies lose at least 31\.5 billion USD a year\(Zohuri and Mossavar\-Rahmani,[2019](https://arxiv.org/html/2605.22971#bib.bib1); Trippe,[2022](https://arxiv.org/html/2605.22971#bib.bib2); West,[2018](https://arxiv.org/html/2605.22971#bib.bib3)\)\. McKinsey also reported that 1\.8 hours \(19%\) of the total work hours are spent searching for information or a person to ask for help\(Bughinet al\.,[2012](https://arxiv.org/html/2605.22971#bib.bib4)\)\. Especially in the onboarding phase,Panopto and YouGov \([2018](https://arxiv.org/html/2605.22971#bib.bib5)\)reported that new employees can fully catch up and work independently within 6 months, at an estimated cost of 253 thousand USD\. As these reports suggest, a hidden intranet incurs high economic costs across organizations worldwide\.

LLMsleverage their strengths in processing and generating complex language, rapidly becoming versatile tools across various fields\(Rekimoto,[2025](https://arxiv.org/html/2605.22971#bib.bib27); Zhouet al\.,[2025](https://arxiv.org/html/2605.22971#bib.bib28); Salminenet al\.,[2025](https://arxiv.org/html/2605.22971#bib.bib29); Oomoriet al\.,[2024](https://arxiv.org/html/2605.22971#bib.bib30); Suzawaet al\.,[2025](https://arxiv.org/html/2605.22971#bib.bib39)\)\. In fields such as healthcare\(Liuet al\.,[2025](https://arxiv.org/html/2605.22971#bib.bib36); Takitaet al\.,[2025](https://arxiv.org/html/2605.22971#bib.bib37)\)and education\(Moritaet al\.,[2025](https://arxiv.org/html/2605.22971#bib.bib26); Chenet al\.,[2024](https://arxiv.org/html/2605.22971#bib.bib34); Yamaokaet al\.,[2025](https://arxiv.org/html/2605.22971#bib.bib38),[2023](https://arxiv.org/html/2605.22971#bib.bib44)\), they contribute to organizing and interpreting vast amounts of information\(Yanget al\.,[2024](https://arxiv.org/html/2605.22971#bib.bib31)\), from analyzing clinical records and massive medical datasets to generating human, understandable summaries and responses with performance often approaching human levels\(Mumtazet al\.,[2023](https://arxiv.org/html/2605.22971#bib.bib6); Clusmannet al\.,[2023](https://arxiv.org/html/2605.22971#bib.bib7)\)\. Within organizations, interest is growing in utilizingLLMsto quantify and share individual expertise\(Zhanget al\.,[2024b](https://arxiv.org/html/2605.22971#bib.bib13); Kernan Freireet al\.,[2023](https://arxiv.org/html/2605.22971#bib.bib32); Wuet al\.,[2025](https://arxiv.org/html/2605.22971#bib.bib33)\)\.LLMsare also being used to transform the content of communications by replacing or refining human utterances\(Guet al\.,[2021](https://arxiv.org/html/2605.22971#bib.bib41); Galimzhanovaet al\.,[2023](https://arxiv.org/html/2605.22971#bib.bib42); Zhanget al\.,[2022](https://arxiv.org/html/2605.22971#bib.bib43)\), such as by summarizing long multi\-participant chat threads into concise highlights\(Kosilova and Birzniece,[2024](https://arxiv.org/html/2605.22971#bib.bib9)\)and rephrasing a user’s statements in real time better to fit the audience or context\(Kumaret al\.,[2025](https://arxiv.org/html/2605.22971#bib.bib10)\)\. These emerging applications highlight howLLMscan measure human knowledge by extracting and reformatting information for improved understanding\.

In this study, we investigate whether humans’ individual domain knowledge can be estimated from chat logs withLLMs\. Our study uses Slack111[https://slack\.com/](https://slack.com/)organizational communication logs as input and to estimate and visualize human domain knowledge usingLLMs\. To evaluate performance, we conduct a user self\-annotation task afterLLMsestimation, asking users to rate the level of understanding of the domain knowledge extracted by the system\. By analyzing the gap between the estimated domain knowledge of variousLLMsand the user’s self\-annotated domain knowledge, we verify how wellArtificial Intelligence \(AI\)perform in domain knowledge estimation using chat/communication logs\. Our key research questions \(RQs\) are as follows:

- RQ1How precise canLLMsestimate the human domain knowledge?
- RQ2WhichLLMsmodel provide the most accurate knowledge estimation?
- RQ3How does the amount of communication logs impact the accuracy ofLLMsdomain knowledge estimation?

This study will contribute to the field of semi\-automated human domain knowledge estimation ecosystem\. As we conceive, the organization can manage knowledge through daily activities, such as chat communication, to map team members’ domain knowledge\.

Table 1\.Comparison of related works against our proposed work\. Our work is most closely related toZhanget al\.\([2024b](https://arxiv.org/html/2605.22971#bib.bib13)\); however, the study is a survey paper that has not conducted any practical data analysis\. Hence, our study is the first to estimate domain knowledge via chat\-based LLMs\.
## 2\.Related Work

[Table 1](https://arxiv.org/html/2605.22971#S1.T1)shows the related work and position of this study\. This section will explain what kind of related work exists in this field and the originality of our work\.

### 2\.1\.Survey on Organizational Chat Conversation Analysis

Kosilova and Birzniece \([2024](https://arxiv.org/html/2605.22971#bib.bib9)\)conducted a large\-scale survey of organizational chat conversation analysis\. The study is a survey paper, so the actual dataset is not used for the practical case\. The survey selected 16 papers, and the conclusion stated that the domain significantly impacts the performance of knowledge elicitation, particularly in medicine and software development, which are often difficult\.Zhanget al\.\([2024b](https://arxiv.org/html/2605.22971#bib.bib13)\)presents a comprehensive survey of conversation analysis in the era ofLLMs, formalizing it as a four\-stage process encompassing scene reconstruction, causality analysis, skill enhancement, and conversation generation\. Their work highlights the field’s fragmentation, noting that existing studies predominantly address shallow subtasks such as emotion or intent classification while lacking deeper reasoning about conversational dynamics\. They further identify the need for benchmarks and methods that capture goal\-directed, multi\-turn conversational behavior, underscoring substantial gaps between current research and real\-world applications\.

The communication logs for knowledge elicitation have been discussed in a survey paper, however, none of the studies actually use them in a practical case\.

### 2\.2\.Practical Case of Knowledge Elicitation from Communication Logs

Huanget al\.\([2007](https://arxiv.org/html/2605.22971#bib.bib14)\)proposes a cascaded framework for automatically extracting high\-quality <thread\-title, reply\> pairs from online discussion forums as chatbot knowledge\. By combining SVM\-based relevant\-reply identification with ranking SVMs to select informative, concise, and trustworthy responses, the method effectively filters noisy forum content and surfaces reusable conversational knowledge\. Experiments on a large movie forum demonstrate that the approach yields high\-precision chatbot response pairs, substantially outperforming baseline methods\.

Wang and Chen \([2024](https://arxiv.org/html/2605.22971#bib.bib11)\)has mentioned the concept of Human\-AI mutual learning, whereAIand humans learn from each other\. The unique point of this study is that explainableAIis used to provide transparency into howAIacquires new knowledge and to return the knowledge elicitation flow to humans\. Since the paper is a position paper, the actual dataset is not used for the practical case\.

Zhanget al\.\([2024a](https://arxiv.org/html/2605.22971#bib.bib12)\)proposesKnowledge Elicitation and Retrieval \(KEAR\), anLLMs\-enabled knowledge elicitation and retrieval framework for zero\-shot cross\-lingual stance detection, addressing the challenge of transferring stance\-relevant reasoning across languages with no target\-language training data\. The method elicits background, inference, and explanation knowledge fromLLMsreasoning, verifies them via multi\-agent collaboration, and retrieves the most relevant knowledge through a hierarchical cross\-lingual retriever\. Experiments on multilingual benchmarks show thatKEARsignificantly outperforms competitive zero\-shot and even supervisedCross\-Lingual Stance Detection \(CLSD\)methods, demonstrating the effectiveness ofLLMs\-derived inferential knowledge for bridging language gaps\.

Arsovskiet al\.\([2019](https://arxiv.org/html/2605.22971#bib.bib16)\)presents a methodology for automatically extracting conversational knowledge from existing rule\-based chatbots by sending large\-scale question inputs and identifying the stable set of unique response rules they contain\. The authors demonstrate that chatbot knowledge converges after sufficient probing and further validate this saturation point through K\-means clustering over extracted responses\. Using the obtained knowledge, they train a seq2seq neural conversational agent that reproduces the original chatbot’s behavior, achieving high BLEU similarity and demonstrating effective machine\-to\-machine knowledge transfer\.

We have discovered that a combination ofLLMsand chat \(Slack\) logs for knowledge extraction does not yet exist, and that is our primary focus of the study\.

![Refer to caption](https://arxiv.org/html/2605.22971v1/src/imgs/architecture.png)Figure 2\.Overview of the proposed workflow\. The first step is to export Slack communication logs as JSON files\. Then, the backend server reads the JSON files and uses them to generate a prompt when making a prompt request toLLMs\. TheLLMswill generate an estimate of domain knowledge based on the system prompt and the user’s message\. Estimated domain knowledge is then stored in the Firebase Cloud Database\(Google,[2025c](https://arxiv.org/html/2605.22971#bib.bib25)\)\. In the frontend web application, look at the cloud database and display each user’s domain knowledge\. When a user logs in to the web application, they can see their own domain knowledge\. In the end, the user can also make their own annotations for each estimated domain of knowledge\.Overview of the workflow of the proposed method\. The first step is to export Slack communication logs as JSON files\. Then, the backend server reads the JSON files and uses them to generate a prompt when making a prompt request to \\lx@glossaries@gls@link\{acronym\}\{llms\}\{\{\{\}\}LLMs\}\. The \\lx@glossaries@gls@link\{acronym\}\{llms\}\{\{\{\}\}LLMs\} will generate an estimate of domain knowledge based on the system prompt and the user’s message\. Estimated domain knowledge is then stored in the \(Firebase\) cloud database\. In the frontend web application, look at the cloud database and display each user’s domain knowledge\. When a user logs in to the web application, they can see their own domain knowledge\. In the end, the user can also make their own annotations for each estimated domain of knowledge\.

## 3\.Methodology

[Figure 2](https://arxiv.org/html/2605.22971#S2.F2)shows the overview of the proposed workflow\. In this section, we will explain each component of the system architecture in detail\.

### 3\.1\.Statistics of Communication Log Dataset

In this study, we use the Slack communication logs\. Data was collected from April 30th, 2017, to November 4th, 2024 \(2744 days\)\. The dataset contains communication logs from a company, totaling 27,188 messages\. There were 43 users in the chats and 94 channels\. We will explain the process of selecting specific data from this raw data and provide a detailed introduction to the data format\.

![Refer to caption](https://arxiv.org/html/2605.22971v1/x1.png)Figure 3\.Number of messages per user \(UID\)\. The quantity varies among users, so we examine the impact of this variation on estimation performance\. The largest number of messages is generated by UID 0, and the smallest by UID 26\.Number of messages per user \(UID\)\. The quantity varies among users, and so we check the impacts of this variation on the estimation performance\. The largest number of messages is done by UID 0, and the smallest is done by UID 26\.#### 3\.1\.1\.Selected Data Volume and Statistics

Among 43 users, we selected 27 as the actual target participants for the study\. This is mainly due to the contact’s availability, meaning some candidates have not been able to reach them\.[Figure 3](https://arxiv.org/html/2605.22971#S3.F3)shows the number of messages per user \(UID\)\. The largest number of messages is 10,819, done by UID 0, and the smallest is 3, done by UID 26\. The mean message count is 792, and the median is 208\. While user message volume varies significantly, the logs capture a broad range of user information to compare message volume and inference accuracy, the focus of this research\.

#### 3\.1\.2\.Data Structure

[1](https://arxiv.org/html/2605.22971#LST1)shows the structure of the Slack message entry\. The data is stored in a JSON format\. The key message is stored in the “text” field, and users who react to the message are also visible\.[2](https://arxiv.org/html/2605.22971#LST2)shows the structure of a Slack channel join event message\. This message is not a user message, but a system message\. The user who joined the channel is visible in the “user” field\. We use this message to determine whether the user is in the channel\. This is an important point because we wanted to infer domain knowledge not only from the user’s own utterances, but also from passive information the user observed during the conversation\.

Listing 1:Raw data of a Slack message entry\. Some parts are anonymized\.\{

"user":"UID0",

"type":"message",

"ts":"1683702597\.263009",

"client\_msg\_id":"\.\.\.",

"text":"\.\.\.Ahintforsurveys:Ifyouwanttosearchrelatedworkaroundourresearchfield,pleasesearch\\"keywords\+conferencename\\"\.CHI,ETRA,UbiComp,ISWC,UIST,CVPR,AHs,SIGGRAPH,\.\.\.",

"user\_profile":\{\.\.\.\},

"thread\_ts":"1683702597\.263009",

"reply\_count":2,

"replies":\[\{"user":"UID1","ts":"1683704080\.511989"\},\.\.\.\],

"reactions":\[\{"name":"\+1","users":\["UID2","UID3"\],"count":2\}\],

"attachments":\[\{"from\_url":"\.\.\.","message\_blocks":\[\.\.\.\]\}\],

"blocks":\[\{"type":"rich\_text",\.\.\.\}\]

\}

Listing 2:Raw data of a Slack channel join event message\.\{

"subtype":"channel\_join",

"user":"UID4",

"text":"<@UID4\>hasjoinedthechannel",

"type":"message",

"ts":"1493555632\.223680"

\}

### 3\.2\.Selection of Large Language Models

In this study, we selected the followingLLMsfor the evaluation: Claude Haiku 4\.5\(Anthropic,[2025a](https://arxiv.org/html/2605.22971#bib.bib18)\), Claude Sonnet 4\.5\(Anthropic,[2025b](https://arxiv.org/html/2605.22971#bib.bib19)\), Gemini 2\.5 Flash\(Google,[2025a](https://arxiv.org/html/2605.22971#bib.bib20)\), Gemini 2\.5 Pro\(Google,[2025b](https://arxiv.org/html/2605.22971#bib.bib21)\), GPT 4o\(OpenAI,[2024](https://arxiv.org/html/2605.22971#bib.bib22)\), GPT o3\(OpenAI,[2025b](https://arxiv.org/html/2605.22971#bib.bib23)\), and GPT 5\(OpenAI,[2025a](https://arxiv.org/html/2605.22971#bib.bib24)\)\. We select the models that are API\-friendly as of October 2025\.

##### Claude Haiku 4\.5:

The model is Anthropic’s small, fast hybrid\-reasoning large language model, designed to deliver near\-frontier performance for coding, tool use, and computer control with much higher cost\- and latency\-efficiency than larger Claude models\. It is trained on a filtered mixture of public web data up to February 2025, licensed and partner datasets, opt\-in user data, and synthetic data\. It supports extended\-thinking mode and a 200k\-token context window for complex, multi\-agent workflows\. Extensive internal and third\-party evaluations indicate substantially improved alignment and robustness over Claude Haiku 3\.5, strong safeguards against misuse in agentic scenarios, and CBRN capabilities below AI Safety Level\-3 thresholds, leading to deployment under Anthropic’s ASL\-2 standard\.

##### Claude Sonnet 4\.5:

The model is Anthropic’s latest hybrid\-reasoning large language model, with an extended\-thinking mode and state\-of\-the\-art performance on software engineering, long\-horizon agentic workflows, and real\-world computer\-use tasks, while also improving general reasoning and mathematics\. Extensive pre\-deployment evaluations covering safeguards, agentic safety, cybersecurity, reward hacking, alignment, and model welfare show substantially improved safety and honesty leading to deployment under Anthropic’sAISafety Level 3 standard\. Taken together, these results position Claude Sonnet 4\.5 as Anthropic’s primary high\-intelligence model, combining cutting\-edge coding and agent capabilities with conservative, policy\-driven safety controls suitable for safety\-critical and scientific applications\.

##### Gemini 2\.5 Flash:

The model is a next\-generation lightweight Gemini model that lets users control its “thinking budget” to trade off reasoning depth against latency and cost for high\-throughput applications\. It is natively multimodal, jointly processing text, audio, images, and video, with a 1\-million\-token context window that enables exploration of huge datasets and long\-horizon interactions in a single session\. The model also supports native audio outputs and seamless switching among 24 languages with the same voice, making it suitable for expressive, interactive, and globally deployed AI systems\.

##### Gemini 2\.5 Pro:

The model is a reasoning model, designed to solve complex problems and to understand vast datasets spanning text, code, audio, images, video, and even entire code repositories\. It is a multimodal model deployed on Vertex AI with a 1,048,576\-token input window and 65,535\-token output capacity, supporting a wide range of enterprise and research workloads\. The model exposes advanced capabilities, such as tool use \(e\.g\., code execution and RAG with the Vertex AI RAG engine\), system instructions, function calling, and structured outputs, making it suitable as a general\-purpose backbone for sophisticated AI agents\.

##### GPT 4o:

The model is an end\-to\-end multimodal model capable of processing and generating text, audio, and images within a single unified neural architecture\. It delivers human\-like response latency in speech interactions and achieves GPT\-4\-level performance while improving speed, multilingual ability, and vision/audio understanding\. The model incorporates extensive safety evaluations and mitigations, including red\-teaming, content filtering, and controls for unauthorized voice generation\.

##### GPT o3:

The model is a reasoning\-focused models that combine advanced chain\-of\-thought reinforcement learning with full tool capabilities, enabling strong performance in math, coding, scientific analysis, and multimodal tasks\. The models enhance safety through deliberative alignment and an instruction hierarchy that prioritizes system\-level constraints\. Extensive evaluations show improved robustness against harmful content, jailbreaks, and ungrounded inferences, while maintaining state\-of\-the\-art reasoning and tool\-use efficiency\.

##### GPT 5:

The model is a unified AI system that combines a fast, high\-throughput model, a deeper GPT\-5 thinking reasoning model, and a real\-time router that chooses between them based on task complexity, tool needs, and user intent\. It delivers state\-of\-the\-art performance across coding, math, writing, health, and visual perception, while reducing hallucinations and improving instruction\-following and overall usefulness in real\-world ChatGPT queries\. GPT\-5 further introduces output\-centric “safe\-completions” training, extensive red\-teaming, and biological/cybersecurity preparedness safeguards to limit harmful use, decrease sycophancy and deception, and better handle dual\-use and safety\-critical scenarios\.

### 3\.3\.Domain Knowledge Extraction Workflow

In this study, we provide aCommand Line Interface \(CLI\)pipeline that extracts user domain knowledge from Slack archives\. It starts by loading environment variables for threeLLMsproviders \(OpenAI, Claude, Gemini\) and by exposing helpers that ingest Slack member metadata, allowing filtering by billing status or activity\.

TheCLIrelies on argparse to capture user IDs, filter settings, dataset roots, output result path, and the target model\. Before any heavy computation, it performs lightweight connection checks against the configuredLLMsAPIs \(OpenAI, Claude, Gemini\) to fail fast if credentials are missing or invalid\. The script then enumerates Slack channel subdirectories\. It iterates over the target users, skips those without contributions \(channels where the user is absent\) or those with already materialized outputs, and dispatches the collected text to the appropriate chunked API helper\. Outputs are emitted on a per\-user, per\-channel basis, accompanied by console\-level progress cues\.

The domain knowledge acquired for each user across different channels was ultimately averaged with the inference results from other channels to visualize the knowledge level\. We will now describe these processes in greater detail\.

#### 3\.3\.1\.Model\-dependent token parameters\.

Different OpenAI models expose different API parameters for controlling output length\. We therefore explicitly distinguish between models that use a traditionalmax\_tokensparameter and models that instead usemax\_completion\_tokens\. In particular, we treat “o\-series” models \(whose names start withofollowed by a digit, e\.g\.,o3,o3\-mini,o4\-mini\) and the GPT\-5 family \(e\.g\.,gpt\-5,gpt\-5\-pro\) as usingmax\_completion\_tokens, whereas all remaining OpenAI chat models are treated as usingmax\_tokens\. A helper function automatically checks the normalized model name and chooses the appropriate parameter\. For o\-series and GPT\-5 models, we also acknowledge that some variants effectively support only the default sampling behaviour, and therefore avoid setting a temperature parameter when it is not meaningful \(e\.g\., for purely deterministic configurations\)\. For Anthropic Claude, we call the messages endpoint with an explicit upper bound on generated tokens \(max\_tokens = 4096\), regardless of the context window\. For Google Gemini, we control sampling via a generation configuration that includes the temperature but leaves the maximum number of generated tokens implicit\. At the same time, we explicitly constrain the total prompt size to stay within the model\-specific context window\.

Figure 4\.LLMsprompt template used to estimate domain knowledge from Slack logs during evaluation\.Prompt box that outlines the context, task instructions, and JSON output format provided to the large language models\.
Prompt Template:[⬇](data:text/plain;base64,ICBZb3UgYXJlIGFuIGV4cGVydCBpbiBhbmFseXppbmcgYW5kIGVzdGltYXRpbmcgYSB1c2VyJ3MgZG9tYWluIGtub3dsZWRnZSBiYXNlZCBvbgogIGxvZyBkYXRhLiBGb2N1cyBzcGVjaWZpY2FsbHkgb24gdGhlICJUQVJHRVRVU0VSIiB0byBhbmFseXplIHRoZWlyIGtub3dsZWRnZSBsZXZlbAogIGJhc2VkIG9uICJJTlBVVERBVEEiLiBUaGUgIlRBUkdFVFVTRVIiIGNvcnJlc3BvbmRzIHRvICJ1c2VyIiBpbiB0aGUgIklOUFVUREFUQSIuCgogIEluc3RydWN0aW9uczoKICAtIEV4dHJhY3QgZG9tYWluIGtub3dsZWRnZSBieSBhbmFseXppbmcgdGhlICJ0ZXh0IiBmaWVsZHMgZm9yIHRoZSB0YXJnZXQgdXNlci4KICAtIEluIHRoZSBvdXRwdXQsIGxpc3QgcHJvcGVyIG5vdW5zIHJlbGF0ZWQgdG8gc2tpbGxzLCBkb21haW5zLCBvciBrZXkgdGVybXMKICAgIChlLmcuLCB0ZWNobm9sb2d5LCBtZXRob2RzLCBvciBjb25jZXB0cykuCiAgLSBGb3IgZWFjaCBleHRyYWN0ZWQgaXRlbSwgY2xhc3NpZnkgdGhlIGtub3dsZWRnZSBsZXZlbDoKICAgIC0gMiAoS25vd24pOiBTdHJvbmcgZXZpZGVuY2UgdGhlIHVzZXIga25vd3MgdGhpcy4KICAgIC0gMSAoTWF5YmUga25vd24pOiBTb21lIGV2aWRlbmNlLCBtb2RlcmF0ZSBjb25maWRlbmNlLgogICAgLSAwIChVbmtub3duKTogSW5zdWZmaWNpZW50IGV2aWRlbmNlIG9mIGtub3dsZWRnZS4KICAtIEZvciBlYWNoIGl0ZW0sIGdpdmUgYSBicmllZiByZWFzb24gZm9yIHlvdXIgY2xhc3NpZmljYXRpb24gYmFzZWQgb24gSU5QVVREQVRBLgoKICBJTlBVVERBVEE6CiAge2NodW5rfQ==)Youareanexpertinanalyzingandestimatingauser’sdomainknowledgebasedonlogdata\.Focusspecificallyonthe"TARGETUSER"toanalyzetheirknowledgelevelbasedon"INPUTDATA"\.The"TARGETUSER"correspondsto"user"inthe"INPUTDATA"\.Instructions:\-Extractdomainknowledgebyanalyzingthe"text"fieldsforthetargetuser\.\-Intheoutput,listpropernounsrelatedtoskills,domains,orkeyterms\(e\.g\.,technology,methods,orconcepts\)\.\-Foreachextracteditem,classifytheknowledgelevel:\-2\(Known\):Strongevidencetheuserknowsthis\.\-1\(Maybeknown\):Someevidence,moderateconfidence\.\-0\(Unknown\):Insufficientevidenceofknowledge\.\-Foreachitem,giveabriefreasonforyourclassificationbasedonINPUTDATA\.INPUTDATA:\{chunk\}Example Output JSON:[⬇](data:text/plain;base64,ICAie3RhcmdldF91c2VyX2lkfSI6IHt7CiAgICAgICJ0ZXh0IjogIkV4dHJhY3RlZCBwcm9wZXIgbm91biBvciB2ZXJiIGZyb20gdGhlIHRleHQgaW4gSU5QVVREQVRBIiwKICAgICAgImxldmVsIjogMiAoS25vd24pLCAxIChNYXliZSBrbm93biksIG9yIDAgKFVua25vd24pLAogICAgICAicmVhc29uIjogIkJyaWVmIGV4cGxhbmF0aW9uIGZvciB3aHkgdGhpcyBrbm93bGVkZ2UgbGV2ZWwgd2FzIGFzc2lnbmVkIGJhc2VkCiAgICAgICAgICAgICAgICAgb24gdGhlIElOUFVUREFUQSIKICB9fQ==)"\{target\_user\_id\}":\{\{"text":"ExtractedpropernounorverbfromthetextinINPUTDATA","level":2\(Known\),1\(Maybeknown\),or0\(Unknown\),"reason":"BriefexplanationforwhythisknowledgelevelwasassignedbasedontheINPUTDATA"\}\}

#### 3\.3\.2\.Model\-specific context windows\.

To handle long JSON logs, we approximate each model’s maximum context length and derive a per\-chunk budget of input tokens\. For OpenAI models, we maintain a table of approximate maximum context sizes \(in tokens\) for representative models \(e\.g\.,gpt\-4o,gpt\-5, ando3: approximately 128 000 tokens\)\. If the user does not specify a context window, we infer it by partially matching the normalized model name against this table; otherwise, a default of 4,096 tokens is used\. For Claude models, we assume a large context window of 200,000 tokens for several recent variants \(e\.g\.,claude\-sonnet\-4\-5, andclaude\-haiku\-4\-5\)\. For Gemini, we similarly maintain approximate context windows of 32,768 tokens for bothgemini\-2\-5\-proandgemini\-2\-5\-flashmodels\. If no explicit limit is given, the code infers a maximum from this table based on partial matches of the model name\.

#### 3\.3\.3\.Token counting and chunking strategy\.

To split large JSON logs into model\-compatible pieces, we approximate token counts using thecl100k\_basetokenizer\. Although this tokenizer is native to OpenAI models, we also use it as a conservative approximation for Claude and Gemini\. LetTmaxT\_\{\\text\{max\}\}denote the assumed maximum context length for a given model,TsysT\_\{\\text\{sys\}\}the number of tokens occupied by the fixed system prompt, andTresT\_\{\\text\{res\}\}a reserved budget of tokens left free for the model’s generated output \(defaultTres=500T\_\{\\text\{res\}\}=500\)\. We then define an effective upper bound on the total tokens we are willing to use for each request by applying a safety factors∈\(0,1\)s\\in\(0,1\)to the context window:

Teff=⌊s⋅Tmax⌋\.T\_\{\\text\{eff\}\}=\\lfloor s\\cdot T\_\{\\text\{max\}\}\\rfloor\.The per\-chunk budget for the user content is then\.

Tchunk=Teff−Tsys−Ttmpl−Tres,T\_\{\\text\{chunk\}\}=T\_\{\\text\{eff\}\}\-T\_\{\\text\{sys\}\}\-T\_\{\\text\{tmpl\}\}\-T\_\{\\text\{res\}\},whereTtmplT\_\{\\text\{tmpl\}\}accounts for any fixed tokens in the user message template \(e\.g\., headers such as “TARGETUSER” and “INPUTDATA”\)\. For OpenAI models, we typically uses=0\.75s=0\.75; for Claude, we use a more conservative safety factor \(e\.g\.,s≈0\.65s\\approx 0\.65\) to compensate for tokenizer differences and message overhead; for Gemini, we again uses=0\.75s=0\.75\.

Given an input log serialized as a JSON string, we encode it into tokens with the approximate tokenizer, compute the total number of tokensTinputT\_\{\\text\{input\}\}, and split it into

Nchunks=⌈TinputTchunk⌉N\_\{\\text\{chunks\}\}=\\left\\lceil\\frac\{T\_\{\\text\{input\}\}\}\{T\_\{\\text\{chunk\}\}\}\\right\\rceilcontiguous segments\. Each segment is then decoded back into text and passed as the “INPUTDATA” for a separate model call\. If the user additionally specifies a hard capNmaxN\_\{\\text\{max\}\}on the number of chunks, we process at mostmin⁡\(Nchunks,Nmax\)\\min\(N\_\{\\text\{chunks\}\},N\_\{\\text\{max\}\}\)segments and log a warning if the cap is reached\.

#### 3\.3\.4\.Per\-provider prompting and execution\.

Across providers, we reuse a common semantic task: estimating a target user’s domain knowledge from log data\.[Figure 4](https://arxiv.org/html/2605.22971#S3.F4)shows the system prompt used in this study\. We send the prompt together with the Slack communication login token chunk\. The model is instructed to return a single JSON object with entries of the form\{"text": "\.\.\.", "level": 0\|1\|2, "reason": "\.\.\." \}, wherelevelencodes whether the knowledge is unknown, maybe known, or clearly known\. For OpenAI, we also enable JSON response formats when supported by the underlying model \(e\.g\.,gpt\-4owithtype="json", GPT\-5 withtype="json\_object"\)\. In contrast, o\-series models are queried without a structuredresponse\_formatdue to current API constraints\. For Claude and Gemini, we pass the combined prompt \(system plus user\) in the provider\-specific format, while maintaining the token budgeting strategy described above\.

![Refer to caption](https://arxiv.org/html/2605.22971v1/x2.png)Figure 5\.Web application user interface \(UI\) for self\-annotation\. Users were instructed to self\-rate their skills, which were displayed on their personal page\. The domain knowledge \(terminology\) is shown according to theLLMsextraction\.Web application user interface \(UI\) for self\-annotation\. Users were instructed to self\-rate their skills, which were displayed on their personal page\. The domain knowledge \(terminology\) is shown according to the \\lx@glossaries@gls@link\{acronym\}\{llms\}\{\{\{\}\}LLMs\} extraction\.

### 3\.4\.Performance Evaluation of Domain Knowledge Estimation Through User Self\-Annotation

Self\-annotation is the last key point of our study method\. We encourage the Slack users who are active at least once\. The service exposes three GET endpoints that retrieve member profiles and their skills, and a POST endpoint that updates the self\-reported skill levels in the Firebase cloud database\. Each endpoint loads group metadata from disk, initializes Firebase credentials lazily, and interacts with Firestore collections to aggregate skills, compute the top\-five averages, and merge self\-assessments\. Any failures propagate as HTTP exceptions with explicit status codes, keeping client feedback actionable\.[Figure 5](https://arxiv.org/html/2605.22971#S3.F5)shows the user interface \(UI\) for the self\-annotation application\. As shown in the figure, participants were instructed to rate their own skills on a scale of 0 to 100, in increments of 5\.

We evaluate the performance of theLLMsthrough mean absolute error \(MAE\) and standard deviation of the mean absolute errors \(MAE\_STD\), root mean square error \(RMSE\), and median absolute error \(Median AE\)\. Here are the definitions of the performance metrics\.

\(1\)MAE=1n​∑i=1n\|yi−y^i\|\\text\{MAE\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\|y\_\{i\}\-\\hat\{y\}\_\{i\}\|
whereyiy\_\{i\}is the true value,y^i\\hat\{y\}\_\{i\}is the predicted value, andnnis the number of samples\.

\(2\)MAES​T​D=1n−1​∑i=1n\(\|yi−y^i\|−MAE\)2\\text\{MAE\}\_\{STD\}=\\sqrt\{\\frac\{1\}\{n\-1\}\\sum\_\{i=1\}^\{n\}\(\|y\_\{i\}\-\\hat\{y\}\_\{i\}\|\-\\text\{MAE\}\)^\{2\}\}
wherennis the number of samples\.

\(3\)RMSE=1n​∑i=1n\(yi−y^i\)2\\text\{RMSE\}=\\sqrt\{\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\(y\_\{i\}\-\\hat\{y\}\_\{i\}\)^\{2\}\}
whereyiy\_\{i\}is the true value,y^i\\hat\{y\}\_\{i\}is the predicted value, andnnis the number of samples\.

\(4\)Median AE=median​\(\|y1−y^1\|,\|y2−y^2\|,…,\|yn−y^n\|\)\\text\{Median AE\}=\\text\{median\}\(\|y\_\{1\}\-\\hat\{y\}\_\{1\}\|,\|y\_\{2\}\-\\hat\{y\}\_\{2\}\|,\\ldots,\|y\_\{n\}\-\\hat\{y\}\_\{n\}\|\)
whereyiy\_\{i\}is the true value,y^i\\hat\{y\}\_\{i\}is the predicted value, andnnis the number of samples\.

## 4\.Data Collection

In this section, we will explain the data collection process in detail\. All participants are based in Germany, so we have considered the General Data Protection Regulation \(GDPR\) in the ethical review process\. We have also obtained the approval from the ethics committee of theanonymized for the review process, and this text will be updated once the approval is obtained\.

### 4\.1\.Participants

In this study, the target participants are restricted to the Slack users who are active at least once\. Since some Slack members were former members, we reached 27 participants through follow\-up email invitations\. To understand the demographics, we asked all participants to complete the pre\-survey\. The pre\-survey results showed that the participants’ mean age was 27\.96 years \(SD = 3\.30\)\. Of these, 22 identified as male \(81%\), four as female \(15%\), and one preferred not to disclose their gender \(4%\)\. Regarding nationality, 12 participants were from India and 11 from Japan, with one participant each from Chile, Germany, Iran, and Russia\. Regarding occupation, 18 participants \(67%\) were employed, and 9 \(33%\) were students\.

![Refer to caption](https://arxiv.org/html/2605.22971v1/src/imgs/experiment_procedure.png)Figure 6\.Experiment procedure\. Participants first listen to the experiment instructions from the experiment conductor\. The user then confirms and gives the signed consent form and the online survey\. Once those are completed, the experiment conductor shares the login information for each participant\. Then, participants login to the web application and self\-rate their skills\. Finally, participants complete the post\-survey\.Experiment procedure\. Participants first listen to the experiment instructions from the experiment conductor\. User then confirm and give the signed consent form and the online survey\. Once those are completed, experiment conductor share the login information for each participant\. Then, participants login to the web application and self\-rate their skills\. Finally, participants complete the post\-survey\.
### 4\.2\.Experiment Procedure

[Figure 6](https://arxiv.org/html/2605.22971#S4.F6)shows the experiment procedure\. Participants were first asked to receive instructions from the experiment conductor\. This instruction includes the provision that participants may opt out at any time\. Once the participants agree to the experiment, participants fill out the consent form and the survey222Google Form:[https://forms\.gle/DjXDPAxGKzDyXFeQ6](https://forms.gle/DjXDPAxGKzDyXFeQ6)\. The ethics committee evaluates the consent form, and the survey is used only for demographic information \([subsection 4\.1](https://arxiv.org/html/2605.22971#S4.SS1)\)\.

Once the preparatory phase was complete, participants were asked to login to the web application333Web Application:[https://member\-skill\-search\.netlify\.app/](https://member-skill-search.netlify.app/)on their own laptop\. The experiment conductor primarily created each user’s account \(email and password\) and distributed it to each participant\. Email is the one used for Slack login, so this is the approach to connect each participant’s domain knowledge\. Once participants logged in to the web application, they were asked to self\-rate their skills\. The domain knowledge \(terminology\) is shown according to theLLMsextraction\. To avoid biasing theLLMsestimation, the application did not display the skill level estimated by theLLMs\. After the self\-annotation task was complete, participants were asked to complete the post\-survey444Google Form:[https://forms\.gle/vcWx2VrCLfqRej5z9](https://forms.gle/vcWx2VrCLfqRej5z9)\. The post\-survey collects participants’ feedback on their general thoughts about this study\.

![Refer to caption](https://arxiv.org/html/2605.22971v1/src/imgs/extracted_word_count_per_user.png)Figure 7\.The number of words extracted from the communication logs for each participant\. The bar chart shows the number of words extracted from each participant’s communication logs\. The number of words is the sum of the number of words in the message content and the number of words in the message metadata\.The number of words extracted from the communication logs for each participant\. The number of words is the sum of the number of words in the message content and the number of words in the message metadata\.

## 5\.Result and Discussion

This section presents the study’s findings, organized by research questions\.[Figure 7](https://arxiv.org/html/2605.22971#S4.F7)shows the amount of domain knowledge extracted from the communication logs for each participant, sorted by the number of user messages\. In general, participants who sent more messages had more domain knowledge extracted from their logs\. However, this relationship is not strictly linear, as it also depends on the number of messages across the channels each participant accesses\.

### 5\.1\.Performance Evaluation of LLMs

We first address RQ1: “How precise canLLMsestimate the human domain knowledge?”\. In this analysis, we treat each participant’s self\-annotated skill level on the 0–100 scale as ground truth and compare it with the corresponding estimates produced by each model\. Following the evaluation protocol described in[subsection 3\.4](https://arxiv.org/html/2605.22971#S3.SS4), we compute the mean absolute error \(MAE\), its standard deviation \(MAE\_STD\), the root mean square error \(RMSE\), and the median absolute error \(Median AE\) over all skills and all participants\. These metrics quantify different aspects of the discrepancy betweenLLMs\-estimated domain knowledge and self\-reported domain knowledge: MAE and Median AE capture typical deviations on the original rating scale, RMSE emphasizes larger errors, and MAE\_STD reflects how consistently a model performs across individual skill items\.

[Table 2](https://arxiv.org/html/2605.22971#S5.T2)summarizes the overall performance of all evaluated models\. Across the sevenLLMs, we observe that all models exhibit a non\-trivial ability to approximate participants’ domain knowledge, but the estimates remain far from perfect\. Gemini 2\.5 Flash achieves the best performance, with an MAE of21\.13±19\.1421\.13\\pm 19\.14, an RMSE of28\.4828\.48, and a Median AE of15\.0015\.00\. On our 0–100 rating scale, this means that, on average, the model’s predictions differ from users’ self\-ratings by about 21 points, and half of the predictions fall within 15 points of the self\-annotated values\. Gemini 2\.5 Pro shows slightly larger errors \(MAE=22\.68=22\.68, RMSE=31\.23=31\.23, Median AE=16\.50=16\.50\), but remains close to Gemini 2\.5 Flash, suggesting that both Gemini models capture broadly similar patterns in the communication logs\.

The Claude models occupy the middle of the performance spectrum: Claude Haiku 4\.5 yields an MAE of26\.7426\.74and Claude Sonnet 4\.5 an MAE of27\.9527\.95, with RMSE and Median AE values that are likewise intermediate between the Gemini and GPT families\. In contrast, the GPT models show the largest discrepancies, with MAE values around32−−3332\-\-33and RMSE values above4141, indicating noticeably less accurate alignment with participants’ self\-annotated knowledge levels\. The relatively large MAE\_STD values across all models \(approximately19−−2719\-\-27\) further highlight substantial variability at the level of individual skills and participants, implying that some domains are consistently easier for the models to estimate than others\.

These findings also answer RQ2: “WhichLLMsmodel provides the most accurate knowledge estimation?”\. The results reveal a clear performance ranking in our setting: Gemini models perform best overall, followed by Claude models, and finally GPT models\. While the absolute error levels indicate that reliable, fine\-grained estimation of human domain knowledge remains challenging, the consistent advantage of the Gemini models suggests that architectural or training differences betweenLLMsfamilies can have a meaningful impact on this type of estimation task\.

Table 2\.Performance Comparison of theLLMsfor Domain Knowledge Estimation\.![Refer to caption](https://arxiv.org/html/2605.22971v1/src/imgs/gemini_flash_mae_per_user.png)Figure 8\.Performance comparison ofLLMsdomain knowledge estimation\. The bar chart shows the mean absolute error \(↓\\downarrow\) calculated from the errors obtained through self\-annotation by LLMs and humans for all participants’ data\. As shown in the result, Gemini 2\.5 flash performed the lowest mean absolute error against self\-annotated value\. The most lowest performance was GPT 4o\.Performance comparison of \\lx@glossaries@gls@link\{acronym\}\{llms\}\{\{\{\}\}LLMs\} domain knowledge estimation\. The bar chart shows the mean absolute error calculated from the errors obtained through self\-annotation by LLMs and humans for all participants’ data\. As shown in the result, Gemini 2\.5 flash performed the lowest mean absolute error against self\-annotated value\. The most lowest performance was GPT 4o\.
### 5\.2\.Performance Comparison Among Individuals

We now address RQ3: “How does the amount of communication logs impact the accuracy ofLLMsdomain knowledge estimation?” As described in[subsection 3\.1](https://arxiv.org/html/2605.22971#S3.SS1), the number of Slack messages per participant varies substantially, from only a handful of posts to more than ten thousand messages, providing a natural testbed for analyzing how data volume relates to estimation performance\. To investigate this relationship, we focus on Gemini 2\.5 Flash, the best\-performing model in the aggregate evaluation, and compute the MAE between its estimated skill scores and each participant’s self\-annotated scores\.[Figure 8](https://arxiv.org/html/2605.22971#S5.F8)visualizes these per\-user MAE values, with participants ordered by their total number of user messages\.

Overall, the figure reveals considerable variability in MAE across individuals, largely independent of their message volume\. Participants with very few messages tend to exhibit relatively large errors, suggesting that when the available communication history is limited,LLMslack sufficient evidence to infer domain knowledge reliably\. Beyond this low\-activity regime, however, the MAE fluctuates within a similar range even as the number of messages increases by an order of magnitude or more\. In other words, we do not observe a systematic improvement in estimation accuracy for high\-activity users; some participants with many messages still show MAE values comparable to, or even higher than, those of participants with moderate message counts\.

This pattern indicates that, under our current zero\-shot setup, simply accumulating more messages is not sufficient to guarantee more accurate domain\-knowledge estimation\. One plausible explanation is that not all messages are equally informative: a substantial fraction of Slack communication consists of brief acknowledgements, social talk, or coordination messages that convey little about a user’s technical expertise\. In addition, the Slack logs capture only part of each participant’s work and study activities, whereas the self\-annotated skill scores may include knowledge rarely verbalized in chat\. Finally, because we deliberately did not fine\-tune or condition theLLMson any task\-specific examples, the models had to infer expertise patterns purely from their general prior knowledge and the raw logs\. Taken together, these factors likely limit the benefits of additional messages and suggest that future work should explore more targeted prompting, lightweight personalization, or task\-specific adaptation to exploit the available communication data better\.

## 6\.Limitation and Future Work

### 6\.1\.Data Security and Privacy

A central concern in this study is the security and privacy of the communication data used for analysis\. As described in[subsection 3\.1](https://arxiv.org/html/2605.22971#S3.SS1), our dataset consists of Slack communication logs collected over 2,744 days, comprising 27,188 messages from 43 users across 94 channels within a single group\. Because these logs may contain both corporate confidential information and personal data, they must be treated as highly sensitive\. All participants were based in Germany, and the data collection and analysis procedures were designed in accordance with the General Data Protection Regulation \(GDPR\) and an institutional ethics review process\.

In the present work, these logs are used solely for research purposes to investigate whetherLLMscan estimate individual domain knowledge from past communication\. Only users who could be contacted and who later provided informed consent \(27 out of the 43 Slack accounts\) are included as participants in our evaluation\. Their self\-annotated skill scores are linked to theLLMs\-estimated scores, and all analyses are conducted using pseudonymized user identifiers \(UIDs\) rather than real names\. User information is represented by IDs such as UID 0, and message examples shown in the paper are partially anonymized\. These design choices aim to minimize the exposure of identifiable or sensitive content while still allowing us to study the behavior of different models on realistic organizational data\.

At the same time, our experimental setup relies on cloud\-based APIs for the evaluatedLLMsfamilies \(OpenAI, Claude, and Gemini\), meaning that segments of the Slack logs are sent to external providers during inference\. Under our current research protocol and consent procedure, this is acceptable for assessing the feasibility and relative performance of differentLLMsfor domain\-knowledge estimation\. However, many organizations operate under stricter data\-governance policies or data\-residency requirements, where sending internal communication logs to third\-party cloud services would be unacceptable\. Consequently, the present study should be interpreted as a proof\-of\-concept demonstration rather than a ready\-to\-deploy solution for highly sensitive environments\.

As an important direction for future work, we plan to explore deployment strategies that keep all communication data within the organization’s own security boundary\. One promising approach is to employ local or self\-hostedLLMs, for example, models deployed on on\-premises servers or in tenant\-isolated environments, so that raw communication logs never leave corporate infrastructure\. Complementary to this, we aim to investigate privacy\-preserving preprocessing techniques, such as stronger anonymization or message aggregation, as well as the use of intermediate user representations that can be processed byLLMswithout exposing the original message content\. Developing such privacy\-aware variants of our method will be essential for makingLLMs\-based domain\-knowledge estimation practically applicable in real\-world organizations that must balance knowledge sharing with rigorous security and privacy constraints\.

### 6\.2\.Towards Practical Application

In this study, we took a first step toward practical use by investigating whether individual domain knowledge can be estimated from everyday organizational communication, specifically Slack logs, usingLLMs\. Our experimental setup focused on how well models can infer users’ expertise from organically occurring messages, without requiring additional structured inputs or explicit self\-descriptions\. To make the process manageable for participants, we limited the target of estimation to word\-level skill terms \(e\.g\., technology names or key concepts\)\. We asked users to provide self\-assessed familiarity scores on a 0–100 scale via a dedicated web application\. This design enabled a systematic, quantitative evaluation across models while keeping the annotation burden relatively low\.

At the same time, our findings highlight several gaps that need to be addressed before such an approach can be deployed in real\-world workflows\. Domain knowledge is inherently multifaceted and context\-dependent; it involves not only familiarity with isolated concepts but also the depth of understanding, the ability to combine skills across domains, and experience with specific tools, projects, or roles\. Representing expertise purely as a flat list of word\-level skills therefore overlooks meaningful relationships among skills, differences in seniority or specialization, and the temporal evolution of knowledge\. Moreover, organizational communication channels may be noisy and incomplete, with substantial variation in how individuals express their expertise, leading to uneven estimation performance across users and domains\.

As future work, we plan to design a more general extraction pipeline that captures richer structures of competence while preserving interpretability and ease of use\. For example, we envision methods that derive skill clusters, topic hierarchies, or role\-specific profiles from communication logs, combined with other data sources such as project repositories or internal documentation\. Such representations support more nuanced reasoning about who knows what, including the identification of complementary expertise and emerging specialists\. Building on these representations, our long\-term goal is to develop a practical mechanism in which, when an employee submits a question or task, the backend system automatically leverages inferred domain\-knowledge profiles to identify and recommend suitable experts within the organization\. In this way,LLMs\-based estimation of domain knowledge could serve as a foundation for expert\-finding tools and knowledge\-support systems that augment existing communication channels, analogous to how currentLLMsprompts route user queries to appropriate capabilities\.

## 7\.Conclusion

In this study, we investigated whether LLMs can estimate domain knowledge from organizational chat logs\. Using 27,188 Slack messages from 43 users over 2,744 days, we built a pipeline that extracts skill terms via LLMs and aggregates them into per\-user knowledge profiles, which 27 participants then self\-rated\. Our results show that contemporary LLMs can approximate human domain knowledge meaningfully but remain unreliable for fine\-grained profiling\. Across seven models, mean absolute errors ranged from 21 to 33 points on a 0–100 scale\. Gemini 2\.5 Flash achieved the best performance \(MAE 21\.13\), followed by Gemini 2\.5 Pro, with Claude models in the middle and GPT models trailing\. Per\-user analysis revealed substantial individual variation and only weak dependence on message volume: sparse histories hurt performance, but beyond a minimal threshold, additional messages did not systematically reduce errors\. These findings highlight both the promise and current limitations of using LLMs to infer “who knows what” from communication logs, providing a foundation for future research on AI\-supported organizational knowledge sharing\.

## References

- Anthropic \(2025a\)Claude haiku 4\.5Note:AI language modelExternal Links:[Link](https://www.anthropic.com/news/claude-haiku-4-5)Cited by:[§3\.2](https://arxiv.org/html/2605.22971#S3.SS2.p1.1)\.
- Anthropic \(2025b\)Claude sonnet 4\.5Note:AI language modelExternal Links:[Link](https://www.anthropic.com/news/claude-sonnet-4-5)Cited by:[§3\.2](https://arxiv.org/html/2605.22971#S3.SS2.p1.1)\.
- S\. Arsovski, H\. Osipyan, M\. I\. Oladele, and A\. D\. Cheok \(2019\)Automatic knowledge extraction of any chatbot from conversation\.Expert Systems with Applications137,pp\. 343–348\.Cited by:[Table 1](https://arxiv.org/html/2605.22971#S1.T1.1.8.8.1),[§2\.2](https://arxiv.org/html/2605.22971#S2.SS2.p4.1)\.
- J\. Bughin, M\. Chui, and J\. Manyika \(2012\)Capturing business value with social technologies\.McKinsey Quarterly4\(1\),pp\. 72–80\.Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p2.1)\.
- L\. Chen, Z\. Jiang, D\. Xia, Z\. Cai, L\. Sun, P\. Childs, and H\. Zuo \(2024\)BIDTrainer: an llms\-driven education tool for enhancing the understanding and reasoning in bio\-inspired design\.InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems,CHI ’24,New York, NY, USA\.External Links:ISBN 9798400703300,[Link](https://doi.org/10.1145/3613904.3642887),[Document](https://dx.doi.org/10.1145/3613904.3642887)Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- J\. Clusmann, F\. R\. Kolbinger, H\. S\. Muti, Z\. I\. Carrero, J\. Eckardt, N\. G\. Laleh, C\. M\. L\. Löffler, S\. Schwarzkopf, M\. Unger, G\. P\. Veldhuizen,et al\.\(2023\)The future landscape of large language models in medicine\.Communications medicine3\(1\),pp\. 141\.Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- E\. Galimzhanova, C\. I\. Muntean, F\. M\. Nardini, R\. Perego, and G\. Rocchietti \(2023\)Rewriting conversational utterances with instructed large language models\.In2023 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology \(WI\-IAT\),pp\. 56–63\.Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- D\. &\. Google \(2025a\)Gemini 2\.5 flashNote:Multimodal large\-language modelExternal Links:[Link](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash)Cited by:[§3\.2](https://arxiv.org/html/2605.22971#S3.SS2.p1.1)\.
- D\. &\. Google \(2025b\)Gemini 2\.5 proNote:Multimodal large\-language modelExternal Links:[Link](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro)Cited by:[§3\.2](https://arxiv.org/html/2605.22971#S3.SS2.p1.1)\.
- Google \(2025c\)Firebase\.Technical reportGoogle\.Note:Firebase is a backend\-as\-a\-service platform for building web and mobile applications\.External Links:[Link](https://firebase.google.com/)Cited by:[Figure 2](https://arxiv.org/html/2605.22971#S2.F2)\.
- X\. Gu, K\. M\. Yoo, and J\. Ha \(2021\)Dialogbert: discourse\-aware response generation via learning to recover and rank utterances\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 12911–12919\.Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- J\. Huang, M\. Zhou, and D\. Yang \(2007\)Extracting chatbot knowledge from online discussion forums\.InProceedings of the 20th International Joint Conference on Artifical Intelligence,IJCAI’07,San Francisco, CA, USA,pp\. 423–428\.Cited by:[Table 1](https://arxiv.org/html/2605.22971#S1.T1.1.4.4.1),[§2\.2](https://arxiv.org/html/2605.22971#S2.SS2.p1.1)\.
- S\. Kernan Freire, C\. Wang, S\. Ruiz\-Arenas, and E\. Niforatos \(2023\)Tacit knowledge elicitation for shop\-floor workers with an intelligent assistant\.InExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems,CHI EA ’23,New York, NY, USA\.External Links:ISBN 9781450394222,[Link](https://doi.org/10.1145/3544549.3585755),[Document](https://dx.doi.org/10.1145/3544549.3585755)Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- K\. Kosilova and I\. Birzniece \(2024\)Survey on organizational chat conversation analysis: exploring dialogue summarization from a knowledge discovery perspective\.Complex Systems Informatics and Modeling Quarterly\(39\),pp\. 86–104\.Cited by:[Table 1](https://arxiv.org/html/2605.22971#S1.T1.1.2.2.1),[§1](https://arxiv.org/html/2605.22971#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.22971#S2.SS1.p1.1)\.
- R\. Kumar, H\. Kumar, and K\. Shalini \(2025\)Leveraging knowledge graphs and llms for context\-aware messaging\.arXiv preprint arXiv:2503\.13499\.Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- Y\. Liu, A\. Shah, J\. Ackerman, and M\. Saha \(2025\)Exploring the design space of real\-time llm knowledge support systems: a case study of jargon explanations\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,CHI ’25,New York, NY, USA\.External Links:ISBN 9798400713941,[Link](https://doi.org/10.1145/3706598.3714262),[Document](https://dx.doi.org/10.1145/3706598.3714262)Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- R\. Morita, K\. Watanabe, J\. Zhou, A\. Dengel, and S\. Ishimaru \(2025\)GenAIReading: augmenting human cognition with interactive digital textbooks using large language models and image generation models\.InProceedings of the Augmented Humans International Conference 2025,AHs ’25,New York, NY, USA,pp\. 289–301\.External Links:ISBN 9798400715662,[Link](https://doi.org/10.1145/3745900.3746066),[Document](https://dx.doi.org/10.1145/3745900.3746066)Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- U\. Mumtaz, A\. Ahmed, and S\. Mumtaz \(2023\)LLMs\-healthcare: current applications and challenges of large language models in various medical specialties\.arXiv preprint arXiv:2311\.12882\.Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- K\. Oomori, Y\. Ishiguro, and J\. Rekimoto \(2024\)SkillsInterpreter: a case study of automatic annotation of flowcharts to support browsing instructional videos in modern martial arts using large language models\.InProceedings of the Augmented Humans International Conference 2024,AHs ’24,New York, NY, USA,pp\. 217–225\.External Links:ISBN 9798400709807,[Link](https://doi.org/10.1145/3652920.3652942),[Document](https://dx.doi.org/10.1145/3652920.3652942)Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- OpenAI \(2024\)GPT\-4o system card\.Technical reportOpenAI\.Note:System card describing the GPT\-4o omni modelExternal Links:[Link](https://openai.com/ja-JP/index/hello-gpt-4o/)Cited by:[§3\.2](https://arxiv.org/html/2605.22971#S3.SS2.p1.1)\.
- OpenAI \(2025a\)GPT\-5 system card\.Technical reportOpenAI\.Note:System card describing the GPT\-5 model familyExternal Links:[Link](https://openai.com/ja-JP/index/introducing-gpt-5/)Cited by:[§3\.2](https://arxiv.org/html/2605.22971#S3.SS2.p1.1)\.
- OpenAI \(2025b\)OpenAI o3 and o4\-mini system card\.Technical reportOpenAI\.Note:System card describing the OpenAI o3 and o4\-mini reasoning modelsExternal Links:[Link](https://openai.com/ja-JP/index/o3-o4-mini-system-card/)Cited by:[§3\.2](https://arxiv.org/html/2605.22971#S3.SS2.p1.1)\.
- Panopto and YouGov \(2018\)Workplace knowledge and productivity report\.Technical reportPanopto\.Note:Survey of 1 001 US employees in organisations with 200\+ employees; average large US business loses US$47 million annually from inefficient knowledge sharing\.External Links:[Link](https://www.panopto.com/resource/ebook/valuing-workplace-knowledge/)Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p2.1)\.
- J\. Rekimoto \(2025\)GazeLLM: multimodal llms incorporating human visual attention\.InProceedings of the Augmented Humans International Conference 2025,AHs ’25,New York, NY, USA,pp\. 302–311\.External Links:ISBN 9798400715662,[Link](https://doi.org/10.1145/3745900.3746075),[Document](https://dx.doi.org/10.1145/3745900.3746075)Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- J\. Salminen, D\. Amin, S\. Jung, and B\. Jansen \(2025\)The use of large language models in hci: a critical analysis of synthetic users\.InProceedings of the Augmented Humans International Conference 2025,AHs ’25,New York, NY, USA,pp\. 413–417\.External Links:ISBN 9798400715662,[Link](https://doi.org/10.1145/3745900.3746108),[Document](https://dx.doi.org/10.1145/3745900.3746108)Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- H\. Suzawa, K\. Watanabe, A\. Dengel, and S\. Ishimaru \(2025\)Augmenting online meetings with context\-aware real\-time music generation\.InProceedings of the Augmented Humans International Conference 2025,AHs ’25,New York, NY, USA,pp\. 487–490\.External Links:ISBN 9798400715662,[Link](https://doi.org/10.1145/3745900.3746116),[Document](https://dx.doi.org/10.1145/3745900.3746116)Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- H\. Takita, S\. L\. Walston, Y\. Mitsuyama, K\. Watanabe, S\. Ishimaru, and D\. Ueda \(2025\)Comparative performance of large language models in structuring head ct radiology reports: multi\-institutional validation study in japan\.Japanese Journal of Radiology,pp\. 1–11\.Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- A\. Tigunova \(2020\)Extracting personal information from conversations\.InCompanion Proceedings of the Web Conference 2020,WWW ’20,New York, NY, USA,pp\. 284–288\.External Links:ISBN 9781450370240,[Link](https://doi.org/10.1145/3366424.3382089),[Document](https://dx.doi.org/10.1145/3366424.3382089)Cited by:[Table 1](https://arxiv.org/html/2605.22971#S1.T1.1.7.7.1)\.
- A\. J\. Trippe \(2022\)Knowledge management with patents: a systematic method for capturing and sharing unique technical information for corporate guidance\.World Patent Information69,pp\. 102110\.Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p2.1)\.
- X\. Wang and X\. Chen \(2024\)Towards human\-ai mutual learning: a new research paradigm\.arXiv preprint arXiv:2405\.04687\.Cited by:[Table 1](https://arxiv.org/html/2605.22971#S1.T1.1.5.5.1),[§2\.2](https://arxiv.org/html/2605.22971#S2.SS2.p2.1)\.
- D\. M\. West \(2018\)The future of work: robots, ai, and automation\.Bloomsbury Publishing USA\.Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p2.1)\.
- T\. Wu, H\. Zhu, M\. Albayrak, A\. Axon, A\. Bertsch, W\. Deng, Z\. Ding, B\. Guo, S\. Gururaja, T\. Kuo, J\. T\. Liang, R\. Liu, I\. Mandal, J\. Milbauer, X\. Ni, N\. Padmanabhan, S\. Ramkumar, A\. Sudjianto, J\. Taylor, Y\. Tseng, P\. Vaidos, Z\. Wu, W\. Wu, and C\. Yang \(2025\)LLMs as workers in human\-computational algorithms? replicating crowdsourcing pipelines with llms\.InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems,CHI EA ’25,New York, NY, USA\.External Links:ISBN 9798400713958,[Link](https://doi.org/10.1145/3706599.3706690),[Document](https://dx.doi.org/10.1145/3706599.3706690)Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- K\. Yamaoka, K\. Watanabe, K\. Kise, A\. Dengel, and S\. Ishimaru \(2023\)Experience is the best teacher: personalized vocabulary building within the context of instagram posts and sentences from gpt\-3\.InAdjunct Proceedings of the 2022 ACM International Joint Conference on Pervasive and Ubiquitous Computing and the 2022 ACM International Symposium on Wearable Computers,UbiComp/ISWC ’22 Adjunct,New York, NY, USA,pp\. 313–316\.External Links:ISBN 9781450394239,[Link](https://doi.org/10.1145/3544793.3560382),[Document](https://dx.doi.org/10.1145/3544793.3560382)Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- K\. Yamaoka, K\. Watanabe, K\. Kise, A\. Dengel, and S\. Ishimaru \(2025\)Img2Vocab: explore words tied to your life with llms and social media images\.IEEE Access13\(\),pp\. 20456–20471\.External Links:[Document](https://dx.doi.org/10.1109/ACCESS.2025.3533076)Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- J\. Yang, H\. Jin, R\. Tang, X\. Han, Q\. Feng, H\. Jiang, S\. Zhong, B\. Yin, and X\. Hu \(2024\)Harnessing the power of llms in practice: a survey on chatgpt and beyond\.ACM Transactions on Knowledge Discovery from Data18\(6\),pp\. 1–32\.Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- R\. Zhang, Y\. Tian, P\. Wei, D\. D\. Zeng, and W\. Mao \(2024a\)An llm\-enabled knowledge elicitation and retrieval framework for zero\-shot cross\-lingual stance identification\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 12253–12266\.Cited by:[Table 1](https://arxiv.org/html/2605.22971#S1.T1.1.6.6.1),[§2\.2](https://arxiv.org/html/2605.22971#S2.SS2.p3.1)\.
- S\. Zhang, M\. Wang, and K\. Balog \(2022\)Analyzing and simulating user utterance reformulation in conversational recommender systems\.InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 133–143\.Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- X\. Zhang, H\. Yu, Y\. Li, M\. Wang, L\. Chen, and F\. Huang \(2024b\)The imperative of conversation analysis in the era of llms: a survey of tasks, techniques, and trends\.arXiv preprint arXiv:2409\.14195\.Cited by:[Table 1](https://arxiv.org/html/2605.22971#S1.T1),[Table 1](https://arxiv.org/html/2605.22971#S1.T1.1.3.3.1),[§1](https://arxiv.org/html/2605.22971#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.22971#S2.SS1.p1.1)\.
- S\. Zhou, M\. Armstrong, G\. Barbareschi, T\. Ajioka, Z\. Hu, M\. Muto, A\. Ryoichi, K\. Yoshifuji, and K\. Minamizawa \(2025\)Augmented body communicator: enhancing daily body expression for people with upper limb limitations through llm and a robotic arm\.InProceedings of the Augmented Humans International Conference 2025,AHs ’25,New York, NY, USA,pp\. 174–186\.External Links:ISBN 9798400715662,[Link](https://doi.org/10.1145/3745900.3746089),[Document](https://dx.doi.org/10.1145/3745900.3746089)Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p3.1)\.
- B\. Zohuri and F\. Mossavar\-Rahmani \(2019\)A model to forecast future paradigms: volume 1: introduction to knowledge is power in four dimensions\.Apple academic press\.Cited by:[§1](https://arxiv.org/html/2605.22971#S1.p2.1)\.

Similar Articles

How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

arXiv cs.CL

This paper introduces a register-aware linguistic evaluation framework to assess how human-like large language models (LLMs) are by comparing the distribution of 67 lexico-grammatical features between human and LLM-generated texts using Maximum Mean Discrepancy. Experiments across seven instruction-tuned open-source models and five registers show that no model perfectly matches human baselines, and closeness to human language varies by register rather than model size.

Are super tiny LLMs any good?

Reddit r/singularity

Explores whether very small language models can handle casual conversations adequately, and what training factors differentiate the better ones.