Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest

arXiv cs.CL 04/22/26, 04:00 AM Papers
Summary
Researchers from Utah State and Vanderbilt benchmark GPT-4, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2 and BERT on three social-media tasks—authorship verification, post generation, and user attribute inference—introducing new sampling protocols and taxonomies to reduce bias and enable reproducible benchmarks.
arXiv:2604.18955v1 Announce Type: new Abstract: In this study, we present the first comprehensive evaluation of modern LLMs - including GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT - across three core social media analytics tasks on a Twitter (X) dataset: (I) Social Media Authorship Verification, (II) Social Media Post Generation, and (III) User Attribute Inference. For the authorship verification, we introduce a systematic sampling framework over diverse user and post selection strategies and evaluate generalization on newly collected tweets from January 2024 onward to mitigate "seen-data" bias. For post generation, we assess the ability of LLMs to produce authentic, user-like content using comprehensive evaluation metrics. Bridging Tasks I and II, we conduct a user study to measure real users' perceptions of LLM-generated posts conditioned on their own writing. For attribute inference, we annotate occupations and interests using two standardized taxonomies (IAB Tech Lab 2023 and 2018 U.S. SOC) and benchmark LLMs against existing baselines. Overall, our unified evaluation provides new insights and establishes reproducible benchmarks for LLM-driven social media analytics. The code and data are provided in the supplementary material and will also be made publicly available upon publication.
Original Article
View Cached Full Text
Cached at: 04/22/26, 08:29 AM
# Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest
Source: [https://arxiv.org/html/2604.18955](https://arxiv.org/html/2604.18955)
Ramtin Davoudi Utah State University ramtin\.davoudi@usu\.edu&Kartik Thakkar Utah State University kartik\.thakkar@usu\.edu&Nazanin Donyapour Independent Researcher nazanin\.donyapour@gmail\.comTyler Derr Vanderbilt University tyler\.derr@vanderbilt\.edu&Hamid Karimi Utah State University hamid\.karimi@usu\.edu

###### Abstract

In this study, we present the first comprehensive evaluation of modern LLMs—including GPT\-4, GPT\-4o, GPT\-3\.5\-Turbo, Gemini 1\.5 Pro, DeepSeek\-V3, Llama 3\.2, and BERT—across three core social media analytics tasks on a Twitter \(X\) dataset: \(I\) Social Media Authorship Verification, \(II\) Social Media Post Generation, and \(III\) User Attribute Inference\. For the authorship verification, we introduce a systematic sampling framework over diverse user and post selection strategies and evaluate generalization on newly collected tweets from January 2024 onward to mitigate “seen\-data” bias\. For post generation, we assess the ability of LLMs to produce authentic, user\-like content using comprehensive evaluation metrics\. Bridging Tasks I and II, we conduct a user study to measure real users’ perceptions of LLM\-generated posts conditioned on their own writing\. For attribute inference, we annotate occupations and interests using two standardized taxonomies \(IAB Tech Lab 2023 and 2018 U\.S\. SOC\) and benchmark LLMs against existing baselines\. Overall, our unified evaluation provides new insights and establishes reproducible benchmarks for LLM\-driven social media analytics\. The code and data are provided in the supplementary material and will also be made publicly available upon publication\.

Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi\-task Quest

Ramtin DavoudiUtah State Universityramtin\.davoudi@usu\.eduKartik ThakkarUtah State Universitykartik\.thakkar@usu\.eduNazanin DonyapourIndependent Researchernazanin\.donyapour@gmail\.com

Tyler DerrVanderbilt Universitytyler\.derr@vanderbilt\.eduHamid KarimiUtah State Universityhamid\.karimi@usu\.edu

## 1Introduction

Online social media platforms have become integral to modern society, generating vast amounts of user\-generated content that offers unique insights into various domains such as marketingSingh and Singh \([2018](https://arxiv.org/html/2604.18955#bib.bib6)\), public healthSchillingeret al\.\([2020](https://arxiv.org/html/2604.18955#bib.bib7)\), crisis managementSaroj and Pal \([2020](https://arxiv.org/html/2604.18955#bib.bib8)\), and so on\. This has led to a unique and vibrant research field known associal media analytics\(or social media mining\)\. While there has been significant progress in social media analytics since the emergence of popular social networking platforms such as Facebook and Twitter \(now X\), it still faces critical challenges in fully leveraging the insights and potential of social media data\. One of the main challenges is the complexity of user\-generated content, especially the text\. The social media content is often informal, ambiguous, or irrelevant \(e\.g\., spam and memes\), making content analysis complex\. Also, social media language changes rapidly \(e\.g\., new slang, memes\), so models trained on past data may quickly become outdated\.

Recent advances in large language models \(LLMs\), including GPT, DeepSeek, and Gemini, offer new opportunities to address these challenges\. Trained on large and diverse corpora, LLMs demonstrate strong capabilities in understanding and generating language in dynamic and noisy environments\. They have shown competitive performance across tasks such as sentiment analysisZhanget al\.\([2023a](https://arxiv.org/html/2604.18955#bib.bib13)\), stance detectionGambiniet al\.\([2024](https://arxiv.org/html/2604.18955#bib.bib40)\), misinformation classificationHuet al\.\([2024a](https://arxiv.org/html/2604.18955#bib.bib10)\), topic extractionMuet al\.\([2024](https://arxiv.org/html/2604.18955#bib.bib11)\), and summarizationZhanget al\.\([2024a](https://arxiv.org/html/2604.18955#bib.bib20)\), often with minimal task\-specific supervision\.

In this work, we conduct a comprehensive empirical evaluation of LLMs across three core social media analytics tasks:\(I\) Social Media Authorship Verification,\(II\) Social Media Post Generation, and\(III\) User Attribute Inference\. Task I aims to determine whether a post was authored by a specific user\. While authorship verification has applications in digital forensics and plagiarism detectionTyoet al\.\([2022](https://arxiv.org/html/2604.18955#bib.bib17)\), social media settings pose additional challenges due to short, low\-quality contentHuanget al\.\([2025](https://arxiv.org/html/2604.18955#bib.bib31)\)\. Task II focuses on generating posts that reflect a user’s style and content preferences\. Although automated text generation has been widely studiedCelikyilmazet al\.\([2020](https://arxiv.org/html/2604.18955#bib.bib4)\), producing realistic and personalized social media content remains difficultPerez\-Castroet al\.\([2023](https://arxiv.org/html/2604.18955#bib.bib5)\); Duet al\.\([2023](https://arxiv.org/html/2604.18955#bib.bib59)\), and LLMs are increasingly being leveraged for personalized content generationKumaret al\.\([2024](https://arxiv.org/html/2604.18955#bib.bib1)\); Salemiet al\.\([2024](https://arxiv.org/html/2604.18955#bib.bib2)\); Zhanget al\.\([2024b](https://arxiv.org/html/2604.18955#bib.bib3)\)\. Task III addresses the prediction of user attributes \(e\.g\., occupation, interests\) from textual content\. Such inference supports personalization and large\-scale behavioral analysisHuet al\.\([2007](https://arxiv.org/html/2604.18955#bib.bib71)\); De Bock and Van den Poel \([2010](https://arxiv.org/html/2604.18955#bib.bib72)\); Goel and Goldstein \([2014](https://arxiv.org/html/2604.18955#bib.bib73)\), yet remains challenging due to label ambiguity and fine\-grained classification requirements\. For a detailed discussion of work related to these three tasks, we refer readers to Appendix[G](https://arxiv.org/html/2604.18955#A7)\.

Why these three tasks together?Social media characterizesuser identitythrough observable traces in user\-generated contentGündüz \([2017](https://arxiv.org/html/2604.18955#bib.bib93)\); Shulman \([2022](https://arxiv.org/html/2604.18955#bib.bib92)\)\. Similar to previous studies, we conceptualize this identity at three levels: the user’s unique identity through their social media posts \(Task I\)Yadav and Li \([2017](https://arxiv.org/html/2604.18955#bib.bib94)\), the user’s unique social media signature, style, and tone \(Task II\)Abbasi and Chen \([2008](https://arxiv.org/html/2604.18955#bib.bib95)\), and latent demographic attributes \(Task III\)Tigunovaet al\.\([2020](https://arxiv.org/html/2604.18955#bib.bib96)\)\. Thus, together, these tasks enable a coherent assessment of how LLMs infer and recognize user identity\.

In this paper, we evaluate several state\-of\-the\-art LLMs—including GPT, Gemini, DeepSeek, Llama, and BERT—alongside traditional ML baselines on a large Twitter \(X\) dataset\. Our contributions are:

- •To the best of our knowledge, this is the first study to jointly and systematically assess multiple modern LLMs across three fundamental social media analytics tasks under a unified evaluation framework\.
- •ForSocial Media Authorship Verification\(Task I\), we design diverse user and postsampling strategiesto enable controlled and realistic benchmarking\. We further address “seen data” \(data leakage\) bias by evaluating models on newly collected tweets from January 2024 onward, testing generalization beyond training cut\-off periods\.
- •ForSocial Media Post Generation\(Task II\), we introduce a structured evaluation protocol for user\-conditioned generation based on a curated set of verified, active, and profile\-rich users, combining lexical, semantic, and diversity\-based metrics to characterize trade\-offs between authenticity and fluency\.
- •Bridging Tasks I and II, we conduct thefirst user studymeasuring real users’ perceptions of authenticity in LLM\-generated tweets, enabling direct comparison between automatic metrics and human judgments\.
- •ForUser Attribute Inference\(Task III\), we predict users’occupationsandinterestsusing two standardized taxonomies: the IAB Tech Lab Content Taxonomy v3\.1 and the 2018 U\.S\. Standard Occupational Classification \(SOC\)\. Grounding evaluation in these multi\-level ontologies enables reproducible labeling, hierarchical analysis from coarse to fine granularity, and alignment with established real\-world classification standards\.

Collectively, this work establishes a unified and reproducible benchmarking framework for analyzing the capabilities and limitations of modern LLMs in social media analytics\.

Remark:In this study, we use Gemini 1\.5 Pro, DeepSeek\-V3, Llama 3\.2, and three versions of GPT \(4, 4o, and 3\.5\-Turbo\)\. For brevity, we drop the versions from Gemini, DeepSeek, and Llama\.

## 2Methodology

![Refer to caption](https://arxiv.org/html/2604.18955v1/x1.png)Figure 1:An overview of the proposed methodology across three social media analytic tasksFigure[1](https://arxiv.org/html/2604.18955#S2.F1)demonstrates an overview of the proposed methodology for investigating the capabilities of LLMs across three social media analytic tasks\. All three tasks are fed from a social media network dataset consisting of the temporal friendship graph and the user\-generated posts acquired fromKheiriet al\.\([2023](https://arxiv.org/html/2604.18955#bib.bib41)\)\(see the Dataset in Appendix[A](https://arxiv.org/html/2604.18955#A1)for more information\)\. We focus on Twitter \(X\) due to its text\-centric nature and well\-defined relational signals \(e\.g\., follower networks and various interactions\), which make it suitable for controlled benchmarking\. The dataset includes temporal network structure and user\-generated posts, enabling evaluation of authorship verification, post generation, and attribute inference\. Although our experiments use Twitter data, the methodology is platform\-agnostic and can be adapted to other platforms by redefining relational and interaction signals\. Next, we explain each component of this method\.

### 2\.1Social Media Authorship Verification

In this task, we frame a binary classification task in which an LLM must distinguish between social media posts genuinely authored by a target user and those authored by other users\. Next, we explain each part of this task\.

Positive User \(and Post\) Sampling\.Positive \(or target\) users are users whose posts are assessed by an LLM\. The purpose of this sampling is twofold: \(1\) The sheer high number of social media users in our dataset \(around 120K\) makes it impossible to assess all of the users \(mainly due to the cost\); and \(2\) We can sample different users to investigate LLMs’ power across diverse groups\. We sample three sets of target \(positive\) users:

- •Random \(Rnd\): A simple random selection of users\.
- •Recent Active \(Rec\): Users active within the last three weeks in our dataset\.
- •Top Active Users \(Top\): Users with the highest overall number of posts \(tweets\)\.

Once a positive user is selected \(shown in green color in Figure[1](https://arxiv.org/html/2604.18955#S2.F1)\), we chronologically sort their posts \(posts shown in green in Figure[1](https://arxiv.org/html/2604.18955#S2.F1)\)\. Then, we sample two sets of posts: \(1\)Few\-shot Example Posts:kkolder posts and \(2\)Positive Evaluation Posts:m1m\_\{1\}newer posts–See Figure[1](https://arxiv.org/html/2604.18955#S2.F1)\. The former is used as training examples for LLM to "become familiar" with the positive user’s content \(i\.e\., employing few\-shot samples\), while the latter is used to form the binary classification described below\.

Negative Social Media Post Sampling\.To form the binary classification task for an LLM, for each positive user, we also samplem2m\_\{2\}posts from other users calledNegative Evaluation Posts\(shown in orange color in Figure[1](https://arxiv.org/html/2604.18955#S2.F1)\)\. To make a fair comparison, we sample these posts at the same time interval as thePositive Evaluation Posts\. More specifically, let the target user’sm1m\_\{1\}Positive Evaluation Postscorrespond to a time interval\[τstart,τend\]\\bigl\[\\tau\_\{\\text\{start\}\},\\,\\tau\_\{\\text\{end\}\}\\bigr\], whereτstart\\tau\_\{\\text\{start\}\}andτend\\tau\_\{\\text\{end\}\}are the earliest and latest timestamps of these posts, respectively \(see Figure[1](https://arxiv.org/html/2604.18955#S2.F1)\)\. Then, we employ one of the following strategies forNegative Evaluation Postswithin the same time interval of\[τstart,τend\]\\bigl\[\\tau\_\{\\text\{start\}\},\\,\\tau\_\{\\text\{end\}\}\\bigr\]:

- •Random Sampling: A simple random draw ofm2m\_\{2\}posts from a pool of other users’ posts\.
- •Similar Topics Sampling: We embed each of them1m\_\{1\}Positive Evaluation Postsinto vector representations \(e\.g\., using a text embedding model\)\. Similarly, we embed candidate tweets from other users\. Then, we calculate each candidate post’s average similarity to the user’sPositive Evaluation Postsand select the topm2m\_\{2\}most similar posts to formNegative Evaluation Posts\. We call this sampling strategyTopic\-similar\.
- •Social Graph\-based Sampling: We leverage the social graph to drawNegative Evaluation Posts\. For each positive useruiu\_\{i\}, we randomly selectm2m\_\{2\}posts from other users who arefollowees\(\{∀v\|v←ui\}\\\{\\forall v\|v\\leftarrow u\_\{i\}\\\}\),followers\(\{∀v\|v→ui\}\\\{\\forall v\|v\\rightarrow u\_\{i\}\\\}\), orReciprocal\(\{∀v\|v↔ui\}\\\{\\forall v\|v\\leftrightarrow u\_\{i\}\\\}\) ofuiu\_\{i\}, making three sample sets\. This sampling results in three sets, referred to asFollowers\-only,Followees\-only, andReciprocal\.

LLM’s Prompt\.After selecting a target \(positive\) user and theirkkFew\-shot Example Postsand forming bothm1m\_\{1\}Positive Evaluation Posts,m2m\_\{2\}Negative Evaluation Posts, we present each classification instance to the LLM\. The exact prompt is in Appendix[B](https://arxiv.org/html/2604.18955#A2)\(Prompt 1\)\.

Unbiased \(Unseen\) Data Investigation\.To evaluate authorship verification without bias from potentially memorized data, we also examine the issue of “seen data” by using social media posts \(tweets\) posted from Jan\. 2024 onward–well beyond the LLMs’ training cutoff dates \(see Appendix[C](https://arxiv.org/html/2604.18955#A3), Table[11](https://arxiv.org/html/2604.18955#A3.T11)for exact cutoff dates\)\. Specifically, from an initial pool of systematically filtered active users \(VAPORusers, described in Section[2\.2](https://arxiv.org/html/2604.18955#S2.SS2)\), we select 50 users who each authored at leastk\+m1k\+m\_\{1\}original tweets \(excluding retweets\) since 2024111Note that the dataset extends only up to 2020; therefore, we used the X API to collect more recent tweets\.\. This ensures a realistic, unbiased evaluation of model generalization performance on definitively unseen data\. The prompt used is identical to Prompt 1 \(Appendix[B](https://arxiv.org/html/2604.18955#A2)\)\.

##### Evaluation\.

For this task, we use weighted F1\-score as the evaluation metric\.

### 2\.2Social Media Post Generation

In this task, we assess each LLM’s capability to create plausible social media posts on behalf of real users\. Specifically, the objective is to generate a set of synthetic posts \(tweets\) that reflect the user’s writing style, topical preferences, and social context, drawing on the user’s historical posts as guiding examples\.

##### User Sampling\.

We first restrict our pool to users who posted at least 200*original*tweets \(excluding retweets\) during the 2018–2020 window\. Additionally, we retain only users who have more than 500 followers, more than 100 followees \(friends\), a non\-empty bio with adescriptionlonger than 20 characters, and averifiedstatus\. These constraints help ensure each selected user is sufficiently active, has meaningful profile information, and demonstrates personal tweeting behavior\. Applying the above filters yields a total of 383 users, each of whom is included in our evaluations for the post generation task\. We refer to this set of users asVAPORusers \(Verified, Active, Profile\-rich, Original\)\. SelectingVAPORusers may bias the data toward public\-facing accounts; this was intentional for Tasks II–III to ensure high\-quality ground truth\. In contrast, Task I and the user study \(Section[2\.3](https://arxiv.org/html/2604.18955#S2.SS3)and Appendix[E](https://arxiv.org/html/2604.18955#A5)\) include less\-active users and casual tweets, ensuring coverage of broader and informal language use\.

##### Prompt Posts\.

To generatennnew posts for eachVAPORuser, we samplekkuser\-authored posts and combine them into a single prompt, along with the user’s description and follower/following statistics \(shown asOther Informationin Figure[1](https://arxiv.org/html/2604.18955#S2.F1)\)\. Each model is instructed to produce a fixed number of posts that resemble the user’s style\. We explicitly request short posts not exceeding 280 characters \(tweet’s character limit\), formatted consecutively with minimal additional text\.

##### LLM’s Prompt\.

The prompt is included in Appendix[B](https://arxiv.org/html/2604.18955#A2)\(Prompt 2\)\.

##### Evaluation\.

We evaluate the quality of generated posts against two reference sets:Prompt PostsandNon\-prompt Posts\. Prompt Posts consist of real user tweets that are included in the input prompt and thus visible to the LLM during generation\. In contrast, Non\-prompt Posts are a separate set of real tweets authored by the same user but withheld from the prompt, serving as an unseen reference set for evaluation\. The purpose is to assess LLMs’ capability on “Example" and “Evaluation" sets, respectively, similar to what is usually done in machine learning modeling\. As for evaluation measures, we compute standard natural language generation metrics including BLEUPapineniet al\.\([2002](https://arxiv.org/html/2604.18955#bib.bib76)\), ROUGELin \([2004](https://arxiv.org/html/2604.18955#bib.bib77)\)\(specifically ROUGE\-1 and ROUGE\-L\) to quantify the lexical overlap with reference tweets\. Additionally, we calculate PerplexityChenet al\.\([1998](https://arxiv.org/html/2604.18955#bib.bib78)\)using GPT\-2Radfordet al\.\([2019](https://arxiv.org/html/2604.18955#bib.bib45)\)to assess the fluency and naturalness of the generated content\.

![Refer to caption](https://arxiv.org/html/2604.18955v1/x2.png)Figure 2:An example of metrics for post generation evaluationMoreover, we numerically represent each of thennLLM\-generated posts and each of thekkposts in our evaluation sets using the SBERT sentence transformer model \(all\-MiniLM\-L6\-v2\)Reimers and Gurevych \([2019](https://arxiv.org/html/2604.18955#bib.bib44)\)\. Then, we form ann×kn\\times kmatrix ofℳ\\mathcal\{M\}where entryℳij\\mathcal\{M\}\_\{ij\}stores the cosine similarity between the LLM\-generated postiiand the real user’s postjj\. Using this matrix, we propose the following metrics \(Figure[2](https://arxiv.org/html/2604.18955#S2.F2)demonstrates an example\):

Average Gen vs\. Real \(AGR\):For each generated post \(row\), we retrieve the maximum similarity to any real post\. We then take the mean of these maxima across all generated posts\.AGR=1n∑i=1nmaxj⁡ℳijAGR=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\max\_\{j\}\\mathcal\{M\}\_\{ij\}\.

Average Real vs\. Gen \(ARG\):For each real post \(column\), we retrieve the maximum similarity to any generated post\. We then take the mean of these maxima across all real posts\.ARG=1k∑j=1kmaxi⁡ℳijARG=\\frac\{1\}\{k\}\\sum\_\{j=1\}^\{k\}\\max\_\{i\}\\mathcal\{M\}\_\{ij\}\.

Average Overall Similarity \(AOS\):The mean of all pairwise cosine similarities between generated and real posts\.AOS=1n×k∑i∑jℳijAOS=\\frac\{1\}\{n\\times k\}\\sum\_\{i\}\\sum\_\{j\}\\mathcal\{M\}\_\{ij\}\.

Gen Dispersion Ratio \(GDR\):The fraction of real posts uniquely identified as the most similar match across generated posts \(row ratio\)\.

GDR=\|unique\(\{arg⁡max𝑗ℳij\|i=1,…,n\}\)\|k\\text\{GDR\}=\\frac\{\\left\|unique\\left\(\\left\\\{\\underset\{j\}\{\\arg\\max\}~\\mathcal\{M\}\_\{ij\}\\,\\middle\|\\,i=1,\\ldots,n\\right\\\}\\right\)\\right\|\}\{k\}
Real Dispersion Ratio \(RDR\):The fraction of generated posts uniquely identified as the most similar match across real posts \(column ratio\)\.

RDR=\|unique\(\{arg⁡max𝑖ℳij\|j=1,…,k\}\)\|n\\text\{RDR\}=\\frac\{\\left\|unique\\left\(\\left\\\{\\underset\{i\}\{\\arg\\max\}~\\mathcal\{M\}\_\{ij\}\\,\\middle\|\\,j=1,\\ldots,k\\right\\\}\\right\)\\right\|\}\{n\}
The last two metrics \(GDRGDRandRDRRDR\) assess the breadth of coverage between generated and real posts: howbroadlyornarrowlygenerated posts cover different real posts, and vice versa\.

### 2\.3User Study

Bridging Task I and Task II, we conducted a user study to evaluate the perceived authenticity of LLM\-generated posts\. The study was approved by our university’s IRB \(Institutional Review Board\), and participants were recruited via an open call\. Eligibility required being at least 18 years old, owning a public Twitter/X account, and having at least 50 original tweets\. Nineteen users met these criteria and completed the study\. Each participant evaluated a personalized set of 20 LLM\-generated tweets \(five per model from DeepSeek, Gemini, GPT, and Llama\) and two authentic tweets randomly sampled from their own timeline, which served as attention checks\. Generated tweets were conditioned on the user’s bio, follower/followee counts, and 50 sampled tweets\. The exact prompt is provided in Appendix[B](https://arxiv.org/html/2604.18955#A2)\(Prompt 2\)\. Participants were instructed as follows:

“The following tweets were generated by AI \(LLM\) using your publicly available tweets\. For each of them, rank how likely it is that you would write it\.”

Participants rated each tweet using a five\-point scale:Definitely not me,Probably not me,Unsure,Probably me, andDefinitely me\. If a participant failed to selectProbably meorDefinitely mefor either of their two real tweets, their responses were deemed unreliable and excluded\. This left us1212valid participants\. Surveys were administered individually via the Qualtrics platform, with each participant receiving a $15 Amazon gift card\. We also collected basic demographic and Twitter \(X\) usage information of the participants, summarized in Appendix[E](https://arxiv.org/html/2604.18955#A5), Table[15](https://arxiv.org/html/2604.18955#A5.T15)\.

### 2\.4User Attribute Inference

To categorize user attributes in our study, we utilize two formal taxonomies\. First, we draw on theIAB Tech Lab Content Taxonomy v3\.1IAB Tech Lab \([2023](https://arxiv.org/html/2604.18955#bib.bib42)\), a widely adopted standard in digital advertising that provides a hierarchical classification of online content across diverse topics \(e\.g\., news, sports, business, etc\.\)\. Specifically, we use the top\-level categories of this taxonomy\. Second, we use the2018 Standard Occupational Classification \(SOC\) SystemU\.S\. Bureau of Labor Statistics \([2018](https://arxiv.org/html/2604.18955#bib.bib43)\), which is maintained by the U\.S\. Bureau of Labor Statistics to systematically classify occupations in the United States\. Using these categories, we annotated and labeled the occupations and interests/hobbies of ourVAPORusers \(described in Section[2\.2](https://arxiv.org/html/2604.18955#S2.SS2)\)\.

Two authors of this paper collaboratively annotated each user’s occupation and interests by closely examining profile descriptions and historical tweets, with disagreements resolved by a third author \(a senior researcher\)\. For interests, we initially considered 39 categories from the IAB Tech Lab Content Taxonomy, of which 25 were represented among the selected users\. For occupations, we relied on the SOC 2018 hierarchy and annotated at two levels–Level 1 \(L1\) and Level 2 \(L2\)–comprising 18 and 38 occupational groups, respectively, to balance granularity and coverage\. The full category lists are provided in Appendix[F](https://arxiv.org/html/2604.18955#A6)\(see Tables[18](https://arxiv.org/html/2604.18955#A6.T18)–[20](https://arxiv.org/html/2604.18955#A6.T20)\), and the Task III prompt is included in Appendix[B](https://arxiv.org/html/2604.18955#A2)\(Prompt 3\)\.

##### Evaluation\.

For each user, an LLM yields a category \(for both occupations and interests/hobbies\)\. Then, we evaluate the performance against the ground truth user attributes using accuracy and weighted F1\-score metrics\.

## 3Experiments

In this section, we describe our experimental design and present detailed evaluations of the LLMs on three main tasks\. The experiments were conducted on a system with an AMD EPYC 7513 CPU, 4 NVIDIA RTX A4000 GPUs, and 1 TB of RAM\.

Table 1:Experimental results for social media authorship verification, Task I, \(metric: weighted F1\-score, multiplied by 100 for clarity\)Negative Social Media Post SamplingModelReciprocalFollowees\-onlyFollowers\-onlyRandomTopic‑similarAvgUnseenData\(KnowledgeCut\-off\)AvgRankPositive User SamplingRndRecTopRndRecTopRndRecTopRndRecTopRndRecTopGPT\-485838478868374878694939576838084\.5851\.07Gemini75818059787969788084858757726975\.5792\.87DeepSeek73777768777672767586878756707075\.1803\.47RF70808263787564767759677278656571\.4\-4\.33TF\-IDF60606060606057606071818157757565\.1\-5\.80Compression\-NCD59636259626257616165787755717164\.2\-6\.20SIAMESE \+ SBERT57605959606056606069747460343458\.0\-7\.67SIAMESE \+ GloVe1975774070686577660626233344953\.1\-7\.87Llama61747650707155727051515143474459\.1567\.87GPT\-3\.5\-Turbo58615957565655575562605957575757\.7718\.47Bert557742767415177341333643343341\.5\-10\.13USE28434341444428434449605942535344\.9\-10\.47

### 3\.1Social Media Authorship Verification

For this task, we benchmark five LLMs–GPT\-4, GPT\-3\.5\-Turbo, Gemini, Llama, and DeepSeek–alongside BERTDevlinet al\.\([2019](https://arxiv.org/html/2604.18955#bib.bib84)\)and a Random Forest classifier \(training details in Appendix[C](https://arxiv.org/html/2604.18955#A3)\)\. We further include established authorship verification baselines: Universal Sentence Encoder \(USE\) cosine similarityCeret al\.\([2018](https://arxiv.org/html/2604.18955#bib.bib80)\), TF\-IDF cosine similarityStamatatos \([2009](https://arxiv.org/html/2604.18955#bib.bib81)\), a compression\-based impostors method using normalized compression distance \(NCD\)Potha and Stamatatos \([2017](https://arxiv.org/html/2604.18955#bib.bib82)\), and two Siamese models based on GloVe\-LSTMBoenninghoffet al\.\([2019](https://arxiv.org/html/2604.18955#bib.bib15)\)and SBERT embeddingsReimers and Gurevych \([2019](https://arxiv.org/html/2604.18955#bib.bib44)\)\(baseline descriptions in Appendix[C](https://arxiv.org/html/2604.18955#A3)\)\. Performance is measured using weighted F1\-score, pooling predictions across users\. We evaluate 15 settings formed by the Cartesian product of three positive user sampling schemes–Random,Recent Active, andTop Active–and five negative post sampling strategies–Random,Topic\-similar,Followers\-only,Followees\-only, andReciprocal\(Section[2\.1](https://arxiv.org/html/2604.18955#S2.SS1)\)\. The number ofFew\-shot Example PostsandPositive/Negative Evaluation Postsis fixed \(k=m1=m2=20k=m\_\{1\}=m\_\{2\}=20\), and each setting includes 50 users\. Table[1](https://arxiv.org/html/2604.18955#S3.T1)reports results, including evaluation on unseen data collected after model knowledge cut\-off dates\.

As shown in Table[1](https://arxiv.org/html/2604.18955#S3.T1), GPT\-4 consistently achieves the strongest performance, with an average F1\-score of 0\.845 and the top rank \(1\.07\)\. Gemini and DeepSeek form a strong second tier, with average F1\-scores of approximately 0\.755 and 0\.751, respectively\. Traditional baselines \(e\.g\., Random Forest, TF\-IDF, and Compression\-NCD\) achieve intermediate performance, while Siamese models \(SBERT and GloVe\) show moderate capability and sensitivity to sampling conditions\. In contrast, Llama, GPT\-3\.5\-Turbo, BERT, and USE exhibit lower and more variable performance, highlighting the robustness of GPT\-4, Gemini, and DeepSeek for authorship verification\. On unseen data, GPT\-4 again leads \(0\.85\), followed by DeepSeek \(0\.80\) and Gemini \(0\.79\), demonstrating strong generalization beyond the training period\.

Table 2:Impact of positive user sampling and negative post sampling strategies on model accuracy\.ModelUser Effect \(pp\)Tweet Effect \(pp\)DeepSeek4\.3720\.17Gemini6\.6218\.26GPT\-3\.5\-Turbo4\.5310\.62GPT\-45\.5314\.65Llama9\.4721\.13Average6\.1016\.97Table[2](https://arxiv.org/html/2604.18955#S3.T2)further shows that negative post sampling has a substantially larger impact on accuracy \(≈\\approx17 percentage points\) than positive user sampling \(≈\\approx6 points\)\. Llama and DeepSeek are most sensitive to negative sampling variations, while GPT\-3\.5\-Turbo shows the lowest variability\. Overall, these results underscore GPT\-4’s robustness and the critical role of negative post sampling in authorship verification\. Appendix[C](https://arxiv.org/html/2604.18955#A3)provides supplementary materials for Task I, including details on the computation of the User and Tweet Effects, qualitative analysis \(Table[9](https://arxiv.org/html/2604.18955#A3.T9)\), class\-wise results, and implementation details such as model access and hyperparameters\.

### 3\.2Social Media Post Generation

The evaluated models for this task include GPT\-4o, Gemini, DeepSeek, and Llama, along with traditional baselines such as the Markov ChainShannon \([1948](https://arxiv.org/html/2604.18955#bib.bib83)\); Freitaset al\.\([2015](https://arxiv.org/html/2604.18955#bib.bib85)\), BART\-largeLewiset al\.\([2019](https://arxiv.org/html/2604.18955#bib.bib86)\), and T5\-largeRaffelet al\.\([2020](https://arxiv.org/html/2604.18955#bib.bib87)\)\. For BART and T5, we used the ‘large’ pretrained variants\. We usedk=50k=50,n=10n=10, and5050Prompt and5050Nonprompt posts \(Section[2\.2](https://arxiv.org/html/2604.18955#S2.SS2)\)\.

Table 3:Experimental results for social media post generation \(Task II\)SetModelAOSAGRARGGDRRDRBLEUROUGE\-1ROUGE\-LPromptGPT\-4o0\.1960\.4660\.3330\.1570\.9190\.0490\.2330\.187Gemini0\.1990\.4420\.3190\.1340\.9000\.0410\.2090\.178DeepSeek0\.2140\.4930\.3630\.1630\.9260\.0650\.2640\.226Llama0\.1880\.4890\.3410\.1560\.9080\.1350\.2920\.253Markov Chain0\.2600\.7500\.4240\.1170\.6950\.9570\.6700\.668BART\-large0\.2570\.5160\.3860\.1400\.8000\.0990\.2410\.179T5\-large0\.2280\.5420\.3720\.0600\.8000\.0290\.2340\.189NonpromptGPT\-4o0\.1910\.4170\.3150\.1440\.8910\.0290\.2110\.168Gemini0\.1950\.4240\.3120\.1330\.8930\.0370\.2010\.171DeepSeek0\.2090\.4340\.3410\.1510\.8850\.0420\.2360\.204Llama0\.1830\.4080\.3170\.1460\.8750\.0350\.2200\.181Markov Chain0\.2450\.5040\.3740\.1120\.6580\.1230\.3450\.320BART\-large0\.2150\.4380\.3300\.1201\.0000\.0060\.1830\.130T5\-large0\.2190\.4560\.3510\.1400\.9000\.0170\.2060\.170

Table 4:Perplexity Scores \(Task II\)ModelPerplexityGPT\-4o73\.43Gemini129\.31DeepSeek141\.83Llama76\.78Markov Chain263\.26BART\-large1\.123T5\-large3\.00

Tables[4](https://arxiv.org/html/2604.18955#S3.T4)and[4](https://arxiv.org/html/2604.18955#S3.T4)summarize model behavior in post generation\. The Markov Chain baseline shows strong replication \(high AOS and BLEU\) but very poor fluency \(Perplexity: 263\.26\)\. Among LLMs, DeepSeek attains the highest overall semantic similarity in the Prompt setting, indicating closer semantic alignment with real posts, yet its high perplexity suggests reduced fluency\. Llama achieves the highest BLEU and ROUGE scores among LLMs, reflecting stronger lexical reuse, but records the lowest overall semantic similarity\. GPT\-4o offers the most balanced profile, combining strong semantic similarity, broad coverage \(high RDR\), and the lowest perplexity among LLMs\. Gemini emphasizes paraphrasing, yielding moderate semantic similarity with comparatively higher perplexity\. Transformer baselines \(BART\-large, T5\-large\) achieve extremely low perplexity—indicating highly predictable outputs—but show narrower coverage under Prompt conditioning\. Overall, the results reveal trade\-offs among semantic similarity, fluency, lexical reuse, and coverage breadth\.

![Refer to caption](https://arxiv.org/html/2604.18955v1/x3.png)Figure 3:User survey responses categorized by generated tweets from different LLMs
### 3\.3User Study Results

Table 5:Experimental results for user attribute inference \(Task III\)ModelInterests/HobbiesOccupations \(L1\)Occupations \(L2\)AccuracyWeighted F1AccuracyWeighted F1AccuracyWeighted F1GPT\-4o71\.5475\.0767\.6266\.0156\.1351\.50Gemini76\.2476\.4978\.3276\.6161\.8760\.83DeepSeek69\.1969\.5371\.2768\.1957\.1851\.53Llama30\.5436\.2835\.7743\.292\.611\.35Preoţiuc\-Pietroet al\.\([2015](https://arxiv.org/html/2604.18955#bib.bib88)\)2\.602\.3322\.0829\.2713\.0410\.32Lewiset al\.\([2019](https://arxiv.org/html/2604.18955#bib.bib86)\)\(Few\-shot\)46\.5631\.2770\.1363\.0850\.6539\.47Michelson and Macskassy \([2010](https://arxiv.org/html/2604.18955#bib.bib89)\)15\.618\.1127\.423\.8412\.995\.56Pennacchiotti and Popescu \([2011](https://arxiv.org/html/2604.18955#bib.bib90)\)41\.5637\.7555\.8449\.5832\.4732\.12To assess perceived authenticity, we conducted the user study described in Section[2\.3](https://arxiv.org/html/2604.18955#S2.SS3)\. Figure[3](https://arxiv.org/html/2604.18955#S3.F3)shows the distribution of ratings across five categories\. Gemini and Llama receive the highest concentration of positive judgments \(*Definitely/Probably me*; 44/60 each\), followed by GPT\-4o \(41/60\) and DeepSeek \(39/60\)\. GPT\-4o also receives the most*Definitely not me*responses, while DeepSeek shows the highest number of*Probably not me*ratings\. The neutral option \(*Unsure*\) remains below 15% across models, suggesting that participants generally formed clear opinions\. Mean authenticity scores further confirm this pattern: Gemini and Llama achieve the highest ratings \(3\.95/5\), with DeepSeek and GPT\-4o slightly lower \(3\.67–3\.68\)\. Additional statistical and qualitative analyses are provided in Appendix[E](https://arxiv.org/html/2604.18955#A5)\(Tables[14](https://arxiv.org/html/2604.18955#A5.T14)and[16](https://arxiv.org/html/2604.18955#A5.T16)\)\.

Note:All users consented to having both their own tweets and LLM\-generated tweets made publicly available\.

We further compare human ratings \(Appendix[E](https://arxiv.org/html/2604.18955#A5), Table[14](https://arxiv.org/html/2604.18955#A5.T14)\) with automatic metrics \(Tables[4](https://arxiv.org/html/2604.18955#S3.T4)and[4](https://arxiv.org/html/2604.18955#S3.T4)\) to examine their alignment with perceived authenticity\. Llama shows the strongest consistency, combining high author\-likeness ratings with strong BLEU and ROUGE scores, suggesting that lexical reuse enhances perceived authenticity\. Gemini achieves similarly high human ratings despite lower lexical overlap and higher perplexity, indicating that paraphrastic imitation can also be effective\. DeepSeek attains high semantic similarity but lower authenticity ratings, implying that topical alignment alone is insufficient\. GPT\-4o demonstrates moderate performance across both human and automatic measures, reinforcing that no single metric reliably predicts perceived author\-likeness\. Overall, the results reveal trade\-offs across lexical, stylistic, and semantic dimensions\.

### 3\.4User Attribute Inference

We evaluate GPT\-4o, Gemini, DeepSeek, and Llama on user attribute inference by predicting the occupations and interests ofVAPORusers \(Section[2\.2](https://arxiv.org/html/2604.18955#S2.SS2)\)\. For each user, 50 sampled tweets are provided as input to the prompt described in Section[2\.4](https://arxiv.org/html/2604.18955#S2.SS4)\. We compare LLMs’ performance against several baselines, including a Gaussian Process classifier with Word2Vec embeddingsPreoţiuc\-Pietroet al\.\([2015](https://arxiv.org/html/2604.18955#bib.bib88)\), few\-shot BARTLewiset al\.\([2019](https://arxiv.org/html/2604.18955#bib.bib86)\), an entity\-based DBpedia classifierMichelson and Macskassy \([2010](https://arxiv.org/html/2604.18955#bib.bib89)\), and a TF\-IDF gradient boosting modelPennacchiotti and Popescu \([2011](https://arxiv.org/html/2604.18955#bib.bib90)\)\. Table[5](https://arxiv.org/html/2604.18955#S3.T5)shows the results\. Performance is measured using accuracy and the weighted F1\-score\.

Gemini achieves the strongest results across interests and both occupational levels \(L1and L2\)\. GPT\-4o and DeepSeek perform comparably at broader levels \(L1\) but decline at finer granularity \(L2\), while Llama consistently underperforms\. These findings indicate that increasing classification granularity significantly affects LLM accuracy, with Gemini demonstrating the most robust overall performance\.

Error patterns are further analyzed via confusion matrices \(Table[17](https://arxiv.org/html/2604.18955#A6.T17), Appendix[F](https://arxiv.org/html/2604.18955#A6)\)\. Appendix[F](https://arxiv.org/html/2604.18955#A6)also provides implementation details, baseline descriptions, category lists, and qualitative analysis\.

## 4Conclusion

In this paper, we presented a comprehensive evaluation of modern LLMs across three core social media analytics tasks: social media authorship verification, social media post generation, and user attribute inference\. GPT\-4 achieved the strongest results in authorship verification, particularly under the unseen\-data regime\. In the post generation, DeepSeek showed strong semantic alignment, while Gemini and Llama received the highest human authenticity ratings\. For attribute inference, Gemini performed best, especially at finer\-grained hierarchical levels\. Overall, our findings highlight the importance of controlled sampling and multifaceted evaluation when assessing LLMs in social media contexts\.

Our study has some limitations\. Although no other datasets meeting our criteria were identified, future work could extend our methodology to additional datasets and platforms \(e\.g\., Reddit\)\. Moreover, our user study is not large\-scale\. Future studies should use larger, more diverse samples\.

Future work may extend this framework by incorporating multimodal signals \(e\.g\., images, videos, and social graphs\) to enrich stylistic and demographic analyses\. Modeling reposting and diffusion dynamics could further clarify content virality\. More broadly, advancing LLM\-based modeling of social network evolution—potentially via link prediction that integrates textual, temporal, and structural signals—would strengthen comprehensive benchmarks in social media analytics\.

Finally, we note ethical risks in modeling user identity, including impersonation and privacy\-sensitive profiling\. This work is intended for benchmarking, not unsafeguarded deployment\.

## References

- Writeprints: a stylometric approach to identity\-level identification and similarity detection in cyberspace\.ACM Transactions on Information Systems \(TOIS\)26\(2\),pp\. 1–29\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p4.1)\.
- T\. Alsanoosy, B\. Shalbi, and A\. Noor \(2024\)Authorship attribution for english short texts\.Engineering, Technology & Applied Science Research14\(5\),pp\. 16419–16426\.Cited by:[Appendix G](https://arxiv.org/html/2604.18955#A7.p2.1)\.
- B\. Boenninghoff, R\. M\. Nickel, S\. Zeiler, and D\. Kolossa \(2019\)Similarity learning for authorship verification in social media\.InICASSP 2019\-2019 IEEE international conference on acoustics, speech and signal processing \(ICASSP\),pp\. 2457–2461\.Cited by:[4th item](https://arxiv.org/html/2604.18955#A3.I2.i4.p1.1),[§3\.1](https://arxiv.org/html/2604.18955#S3.SS1.p1.1)\.
- A\. Celikyilmaz, E\. Clark, and J\. Gao \(2020\)Evaluation of text generation: a survey\.arXiv preprint arXiv:2006\.14799\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p3.1)\.
- D\. Cer, Y\. Yang, S\. Kong, N\. Hua, N\. Limtiaco, R\. S\. John, N\. Constant, M\. Guajardo\-Cespedes, S\. Yuan, C\. Tar,et al\.\(2018\)Universal sentence encoder for english\.InProceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations,pp\. 169–174\.Cited by:[1st item](https://arxiv.org/html/2604.18955#A3.I2.i1.p1.1),[§3\.1](https://arxiv.org/html/2604.18955#S3.SS1.p1.1)\.
- N\. V\. Chawla, K\. W\. Bowyer, L\. O\. Hall, and W\. P\. Kegelmeyer \(2002\)SMOTE: synthetic minority over\-sampling technique\.Journal of artificial intelligence research16,pp\. 321–357\.Cited by:[Appendix C](https://arxiv.org/html/2604.18955#A3.SSx2.p1.8)\.
- S\. F\. Chen, D\. Beeferman, and R\. Rosenfeld \(1998\)Evaluation metrics for language models\.Cited by:[§2\.2](https://arxiv.org/html/2604.18955#S2.SS2.SSS0.Px4.p1.1)\.
- K\. De Bock and D\. Van den Poel \(2010\)Predicting website audience demographics forweb advertising targeting using multi\-website clickstream data\.Fundamenta Informaticae98\(1\),pp\. 49–70\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p3.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§3\.1](https://arxiv.org/html/2604.18955#S3.SS1.p1.1)\.
- H\. Du, W\. Xing, and B\. Pei \(2023\)Automatic text generation using deep learning: providing large\-scale support for online learning communities\.Interactive Learning Environments31\(8\),pp\. 5021–5036\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p3.1)\.
- C\. Freitas, F\. Benevenuto, S\. Ghosh, and A\. Veloso \(2015\)Reverse engineering socialbot infiltration strategies in twitter\.InProceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015,pp\. 25–32\.Cited by:[1st item](https://arxiv.org/html/2604.18955#A4.I1.i1.p1.1),[§3\.2](https://arxiv.org/html/2604.18955#S3.SS2.p1.4)\.
- M\. Gambini, C\. Senette, T\. Fagni, and M\. Tesconi \(2024\)Evaluating large language models for user stance detection on x \(twitter\)\.Machine Learning113\(10\),pp\. 7243–7266\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p2.1)\.
- S\. Goel and D\. G\. Goldstein \(2014\)Predicting individual behavior with social networks\.Marketing Science33\(1\),pp\. 82–93\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p3.1)\.
- U\. Gündüz \(2017\)The effect of social media on identity construction\.Mediterranean journal of social sciences8\(5\)\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p4.1)\.
- T\. Hong, J\. Choi, K\. Lim, and P\. Kim \(2021\)Enhancing personalized ads using interest category classification of sns users based on deep neural networks\.Sensors21\(1\),pp\. 199\.Cited by:[Appendix G](https://arxiv.org/html/2604.18955#A7.p4.1)\.
- B\. Hu, Q\. Sheng, J\. Cao, Y\. Shi, Y\. Li, D\. Wang, and P\. Qi \(2024a\)Bad actor, good advisor: exploring the role of large language models in fake news detection\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 22105–22113\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p2.1)\.
- J\. Hu, H\. Zeng, H\. Li, C\. Niu, and Z\. Chen \(2007\)Demographic prediction based on user’s browsing behavior\.InProceedings of the 16th international conference on World Wide Web,pp\. 151–160\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p3.1)\.
- Y\. Hu, Z\. Hu, C\. Seah, and R\. K\. Lee \(2024b\)InstructAV: instruction fine\-tuning large language models for authorship verification\.arXiv preprint arXiv:2407\.12882\.Cited by:[Appendix G](https://arxiv.org/html/2604.18955#A7.p2.1)\.
- B\. Huang, C\. Chen, and K\. Shu \(2024\)Can large language models identify authorship?\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Cited by:[Appendix G](https://arxiv.org/html/2604.18955#A7.p2.1)\.
- B\. Huang, C\. Chen, and K\. Shu \(2025\)Authorship attribution in the era of llms: problems, methodologies, and challenges\.ACM SIGKDD Explorations Newsletter26\(2\),pp\. 21–43\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p3.1)\.
- J\. Huertas\-Tato, A\. Martín, and D\. Camacho \(2024\)Understanding writing style in social media with a supervised contrastively pre\-trained transformer\.Knowledge\-Based Systems296,pp\. 111867\.Cited by:[Appendix G](https://arxiv.org/html/2604.18955#A7.p2.1)\.
- IAB Tech Lab \(2023\)Content taxonomy: v3\.1\.Note:[https://iabtechlab\.com/standards/content\-taxonomy/](https://iabtechlab.com/standards/content-taxonomy/)Accessed: 2025\-05\-07Cited by:[§2\.4](https://arxiv.org/html/2604.18955#S2.SS4.p1.1)\.
- M\. Injadat, F\. Salo, and A\. B\. Nassif \(2016\)Data mining techniques in social media: a survey\.Neurocomputing214,pp\. 654–670\.Cited by:[Appendix G](https://arxiv.org/html/2604.18955#A7.p1.1)\.
- K\. Kheiri, M\. F\. A\. Khan, T\. Derr, and H\. Karimi \(2023\)An analysis of the dynamics of ties on twitter\.In2023 IEEE International Conference on Big Data \(BigData\),pp\. 5809–5817\.Cited by:[Table 6](https://arxiv.org/html/2604.18955#A1.T6),[Appendix A](https://arxiv.org/html/2604.18955#A1.p1.1),[§2](https://arxiv.org/html/2604.18955#S2.p1.1)\.
- I\. Kumar, S\. Viswanathan, S\. Yerra, A\. Salemi, R\. A\. Rossi, F\. Dernoncourt, H\. Deilamsalehy, X\. Chen, R\. Zhang, S\. Agarwal,et al\.\(2024\)Longlamp: a benchmark for personalized long\-form text generation\.arXiv preprint arXiv:2407\.11016\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p3.1)\.
- M\. Lewis, Y\. Liu, N\. Goyal, M\. Ghazvininejad, A\. Mohamed, O\. Levy, V\. Stoyanov, and L\. Zettlemoyer \(2019\)BART: denoising sequence\-to\-sequence pre\-training for natural language generation, translation, and comprehension\.arXiv preprint arXiv:1910\.13461\.Cited by:[2nd item](https://arxiv.org/html/2604.18955#A4.I1.i2.p1.1),[2nd item](https://arxiv.org/html/2604.18955#A6.I1.i2.p1.1),[§3\.2](https://arxiv.org/html/2604.18955#S3.SS2.p1.4),[§3\.4](https://arxiv.org/html/2604.18955#S3.SS4.p1.1),[Table 5](https://arxiv.org/html/2604.18955#S3.T5.2.9.1)\.
- C\. Lin \(2004\)Rouge: a package for automatic evaluation of summaries\.InText summarization branches out,pp\. 74–81\.Cited by:[§2\.2](https://arxiv.org/html/2604.18955#S2.SS2.SSS0.Px4.p1.1)\.
- X\. Liu, B\. Peng, M\. Wu, M\. Wang, H\. Cai, and Q\. Huang \(2024\)Occupation prediction with multimodal learning from tweet messages and google street view images\.AGILE: GIScience Series5,pp\. 36\.Cited by:[Appendix G](https://arxiv.org/html/2604.18955#A7.p4.1)\.
- M\. Michelson and S\. A\. Macskassy \(2010\)Discovering users’ topics of interest on twitter: a first look\.InProceedings of the fourth workshop on Analytics for noisy unstructured text data,pp\. 73–80\.Cited by:[3rd item](https://arxiv.org/html/2604.18955#A6.I1.i3.p1.1),[§3\.4](https://arxiv.org/html/2604.18955#S3.SS4.p1.1),[Table 5](https://arxiv.org/html/2604.18955#S3.T5.2.10.1)\.
- Y\. Mu, C\. Dong, K\. Bontcheva, and X\. Song \(2024\)Large language models offer an alternative to the traditional approach of topic modelling\.arXiv preprint arXiv:2403\.16248\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p2.1)\.
- K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu \(2002\)Bleu: a method for automatic evaluation of machine translation\.InProceedings of the 40th annual meeting of the Association for Computational Linguistics,pp\. 311–318\.Cited by:[§2\.2](https://arxiv.org/html/2604.18955#S2.SS2.SSS0.Px4.p1.1)\.
- M\. Pennacchiotti and A\. Popescu \(2011\)A machine learning approach to twitter user classification\.InProceedings of the international AAAI conference on web and social media,Vol\.5,pp\. 281–288\.Cited by:[4th item](https://arxiv.org/html/2604.18955#A6.I1.i4.p1.1),[§3\.4](https://arxiv.org/html/2604.18955#S3.SS4.p1.1),[Table 5](https://arxiv.org/html/2604.18955#S3.T5.2.11.1)\.
- A\. Perez\-Castro, M\.R\. Martínez\-Torres, and S\.L\. Toral \(2023\)Efficiency of automatic text generators for online review content generation\.Technological Forecasting and Social Change189,pp\. 122380\.External Links:ISSN 0040\-1625,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.techfore.2023.122380),[Link](https://www.sciencedirect.com/science/article/pii/S0040162523000653)Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p3.1)\.
- R\. G\. Pillai, A\. Fokkens, and W\. van Atteveldt \(2025\)Engagement\-driven persona prompting for rewriting news tweets\.InProceedings of the 31st International Conference on Computational Linguistics,pp\. 8612–8622\.Cited by:[Appendix G](https://arxiv.org/html/2604.18955#A7.p3.1)\.
- N\. Potha and E\. Stamatatos \(2017\)An improved impostors method for authorship verification\.InInternational conference of the cross\-language evaluation forum for European languages,pp\. 138–144\.Cited by:[3rd item](https://arxiv.org/html/2604.18955#A3.I2.i3.p1.1),[§3\.1](https://arxiv.org/html/2604.18955#S3.SS1.p1.1)\.
- D\. Preoţiuc\-Pietro, V\. Lampos, and N\. Aletras \(2015\)An analysis of the user occupational class through twitter content\.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 1754–1764\.Cited by:[§3\.4](https://arxiv.org/html/2604.18955#S3.SS4.p1.1),[Table 5](https://arxiv.org/html/2604.18955#S3.T5.2.8.1)\.
- Z\. Qiu, H\. Lyu, W\. Xiong, and J\. Luo \(2025\)Can llms simulate social media engagement? a study on action\-guided response generation\.arXiv preprint arXiv:2502\.12073\.Cited by:[Appendix G](https://arxiv.org/html/2604.18955#A7.p3.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[§2\.2](https://arxiv.org/html/2604.18955#S2.SS2.SSS0.Px4.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[3rd item](https://arxiv.org/html/2604.18955#A4.I1.i3.p1.1),[§3\.2](https://arxiv.org/html/2604.18955#S3.SS2.p1.4)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.arXiv preprint arXiv:1908\.10084\.Cited by:[5th item](https://arxiv.org/html/2604.18955#A3.I2.i5.p1.1),[§2\.2](https://arxiv.org/html/2604.18955#S2.SS2.SSS0.Px4.p2.7),[§3\.1](https://arxiv.org/html/2604.18955#S3.SS1.p1.1)\.
- A\. Salemi, S\. Mysore, M\. Bendersky, and H\. Zamani \(2024\)Lamp: when large language models meet personalization\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 7370–7392\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p3.1)\.
- A\. Saroj and S\. Pal \(2020\)Use of social media in crisis management: a survey\.International Journal of Disaster Risk Reduction48,pp\. 101584\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p1.1)\.
- D\. Schillinger, D\. Chittamuru, and A\. S\. Ramírez \(2020\)From “infodemics” to health promotion: a novel framework for the role of social media in public health\.American journal of public health110\(9\),pp\. 1393–1396\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p1.1)\.
- C\. E\. Shannon \(1948\)A mathematical theory of communication\.The Bell system technical journal27\(3\),pp\. 379–423\.Cited by:[1st item](https://arxiv.org/html/2604.18955#A4.I1.i1.p1.1),[§3\.2](https://arxiv.org/html/2604.18955#S3.SS2.p1.4)\.
- D\. Shulman \(2022\)Self\-presentation: impression management in the digital age\.InThe Routledge international handbook of Goffman studies,pp\. 26–37\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p4.1)\.
- M\. Singh and G\. Singh \(2018\)Impact of social media on e\-commerce\.International Journal of Engineering & Technology7\(2\.30\),pp\. 21–26\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p1.1)\.
- E\. Stamatatos \(2009\)A survey of modern authorship attribution methods\.Journal of the American Society for information Science and Technology60\(3\),pp\. 538–556\.Cited by:[2nd item](https://arxiv.org/html/2604.18955#A3.I2.i2.p1.1),[§3\.1](https://arxiv.org/html/2604.18955#S3.SS1.p1.1)\.
- A\. Tigunova, P\. Mirza, A\. Yates, and G\. Weikum \(2020\)RedDust: a large reusable dataset of reddit user traits\.InProceedings of the Twelfth Language Resources and Evaluation Conference,pp\. 6118–6126\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p4.1)\.
- J\. Tyo, B\. Dhingra, and Z\. C\. Lipton \(2022\)On the state of the art in authorship attribution and authorship verification\.arXiv preprint arXiv:2209\.06869\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p3.1)\.
- U\.S\. Bureau of Labor Statistics \(2018\)Standard occupational classification \(soc\) system\.Note:[https://www\.bls\.gov/soc/](https://www.bls.gov/soc/)Accessed: 2025\-05\-07Cited by:[§2\.4](https://arxiv.org/html/2604.18955#S2.SS4.p1.1)\.
- H\. Wen, Z\. Xiao, E\. Hovy, and A\. G\. Hauptmann \(2023\)Towards open\-domain twitter user profile inference\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 3172–3188\.Cited by:[Appendix G](https://arxiv.org/html/2604.18955#A7.p4.1)\.
- H\. Yadav and J\. Li \(2017\)Social media writing style fingerprint\.arXiv preprint arXiv:1712\.04762\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p4.1)\.
- E\. Yu, J\. Li, and C\. Xu \(2024\)RePALM: popular quote tweet generation via auto\-response augmentation\.InFindings of the Association for Computational Linguistics ACL 2024,pp\. 9566–9579\.Cited by:[Appendix G](https://arxiv.org/html/2604.18955#A7.p3.1)\.
- T\. Zhang, F\. Ladhak, E\. Durmus, P\. Liang, K\. McKeown, and T\. B\. Hashimoto \(2024a\)Benchmarking large language models for news summarization\.Transactions of the Association for Computational Linguistics12,pp\. 39–57\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p2.1)\.
- W\. Zhang, Y\. Deng, B\. Liu, S\. J\. Pan, and L\. Bing \(2023a\)Sentiment analysis in the era of large language models: a reality check\.arXiv preprint arXiv:2305\.15005\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p2.1)\.
- X\. Zhang, Y\. Malkov, O\. Florez, S\. Park, B\. McWilliams, J\. Han, and A\. El\-Kishky \(2023b\)Twhin\-bert: a socially\-enriched pre\-trained language model for multilingual tweet representations at twitter\.InProceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining,pp\. 5597–5607\.Cited by:[Appendix G](https://arxiv.org/html/2604.18955#A7.p3.1)\.
- Z\. Zhang, R\. A\. Rossi, B\. Kveton, Y\. Shao, D\. Yang, H\. Zamani, F\. Dernoncourt, J\. Barrow, T\. Yu, S\. Kim,et al\.\(2024b\)Personalization of large language models: a survey\.arXiv preprint arXiv:2411\.00027\.Cited by:[§1](https://arxiv.org/html/2604.18955#S1.p3.1)\.
- Y\. Zhao, Y\. Wang, X\. Cheng, A\. M\. Tumlin, Y\. Liu, D\. Xia, M\. Jiang, and T\. Derr \(2025\)Amplifying your social media presence: personalized influential content generation with llms\.arXiv preprint arXiv:2505\.01698\.Cited by:[Appendix G](https://arxiv.org/html/2604.18955#A7.p3.1)\.

## Appendices

## Appendix ADataset

We performed our evaluations using the Twitter dataset introduced byKheiriet al\.\([2023](https://arxiv.org/html/2604.18955#bib.bib41)\), comprising data from over 120,000 users tracked across a 15\-week period\. This dataset is particularly suitable due to its large scale, diversity, and comprehensive metadata and content coverage\. It includes weekly snapshots of user networks, capturing the dynamic formation and dissolution of follower connections\. Additionally, it provides detailed user\-generated content \(tweets\), interactions \(such as mentions\), and extensive user profile attributes, including verification status and user activity metrics\. The dataset captures an average of 1,175,846 new tweets weekly and approximately 2\.9 million total social ties, making it ideal for our comprehensive analysis\. We have explicitly obtained permission from the original authors to utilize this dataset in our study\. The dataset statistics are summarized in Table[6](https://arxiv.org/html/2604.18955#A1.T6)\.

Table 6:Twitter dataset statistics acquired fromKheiriet al\.\([2023](https://arxiv.org/html/2604.18955#bib.bib41)\)Network PropertyValueTotal users123,829Total ties2,922,732\# Verified accounts3,829Avg\. weekly new followers10,855Avg\. weekly new unfollowers465Avg\. weekly new Tweets1,175,846Percentage verified users1\.687Avg\. followees count per user205Avg\. followers count per user150Diameter \(longest shortest path\)8Avg\. new tweets \(w/mentions\)2,021

## Appendix BLLMs’ Prompts and Expenses

User Description:\{⟨description⟩\}Previous Tweets:\{⟨previous\_tweets⟩\}Tweet to Classify:\{⟨tweet\_to\_classify⟩\}Task:Based on the description and the previous tweets of the user, output ‘0’ if the above tweet was*not*generated by the user or ‘1’ if it*was*\. Even if the tweet is a URL, output only ‘0’ or ‘1’ as your response\.

Prompt 1: Social media authorship verification prompt \(Task I\)

Here, \{description\} and \{previous\_tweets\} correspond to the user’s bio and a few\-shot list of the user’s previously authored tweets \(excluding retweets\)\. Meanwhile, \{tweet\_to\_classify\} is drawn from either the target user’s own future tweets \(Positive Evaluation Posts\) or from other users’ tweets \(Negative Evaluation Posts\)\. The LLM outputs either0or1, indicating whether it believes the specified tweet belongs to the same user\.

We have collectedKKtweets from one user on Twitter, included below:<tweets go here\>Below, you can also find the user\-written bio \(description\) as well as the number of followings and followers of the user:User Bio: ⟨user\_description⟩Number of Followings: ⟨following\_count⟩Number of Followers: ⟨followers\_count⟩Now, generate exactly ⟨NUM\_GENERATED\_TWEETS⟩ tweets that the user could have posted on their Twitter account\. Each tweet should not exceed 280 characters and must be formatted as:1\. <tweet\_text\>2\. <tweet\_text\>…Ensure all ⟨NUM\_GENERATED\_TWEETS⟩ tweets are included, without extra spacing or missing tweets\.

Prompt 2: Social media Post generation prompt \(Task II\)

For inferring the users’ interests and hobbies, we provide the LLMs with the following structured prompt:

Your task is to infer the primary ⟨ATTRIBUTE⟩ of a Twitter/X user based on their tweets\.You will be provided with exactly ⟨\#SAMPLED\_TWEETS⟩ randomly selected tweets posted by this user\.Assign the user to exactly one of the following ⟨ATTRIBUTE⟩ categories, based on the content of their tweets:⟨CATEGORY\_LIST⟩Here are the user’s sampled tweets:⟨SAMPLED\_TWEETS⟩Respond strictly and exclusively in the following JSON format:\{ "⟨JSON\_KEY⟩": "⟨one category from the above list⟩" \}Do*not*include any other text, explanation, or formatting\.

Prompt 3: User attribute inference prompt \(Task III\)

We instantiate Prompt 3 for \(i\) interests/hobbies, \(ii\) SOC 2018 Level 1 occupations, and \(iii\) SOC 2018 Level 2 occupations by substituting the placeholders in Table[7](https://arxiv.org/html/2604.18955#A2.T7)\.

Table 7:Placeholders for interest and occupation prompt\.TaskATTRIBUTECATEGORY\_LISTJSON\_KEYInterests / hobbiesinterest categoryIAB categories present in our datainterest\_categoryOccupation L1occupational category \(Level\-\-1\)18 SOC major groupsoccupation\_categoryOccupation L2occupational category \(Level\-\-2\)38 SOC minor groupsoccupation\_category### LLMs’ Costs

Table[8](https://arxiv.org/html/2604.18955#A2.T8)provides estimated expenses incurred from using LLMs in our experiments\. The total estimated cost for all model experiments across all three tasks is $220\.

Table 8:Estimated cost of using LLMs in our experiments\.ModelCost \(USD\)GPT\-4$150GPT\-4o$20GPT\-3\.5\-Turbo$20Gemini$20DeepSeek$10LlamaFreeTotal$220

## Appendix CSocial Media Authorship Verification

### Model Access and Hyperparameters

We accessed LLMs through their respective APIs—OpenAI’sChatCompletionendpoint for GPT\-4o, Google’sGenerativeModelAPI for Gemini, and theollama\.chatinterface for Llama models\. For consistency across models, we fix the temperature parameter to0in all LLM calls, ensuring deterministic outputs in the binary classification setting\.

#### Qualitative Analysis for Task I \(Case Studies\)

As shown in Table[9](https://arxiv.org/html/2604.18955#A3.T9), the user’s previous tweets reveal a clear and consistent pattern of sharing educational YouTube content explicitly linked to the brand “Letstute\." Gemini, GPT\-4o, and DeepSeek effectively recognize this pattern, correctly identifying tweets associated with “Letstute" educational videos as authentic\. GPT\-3\.5 struggles because it relies heavily on exact wording matches and misses broader topical connections\. Llama mistakenly accepts short conversational tweets unrelated to the user’s known educational content, indicating difficulty in grasping the user’s overall topical theme\.

Table 9:Authorship verification example with model predictions\. \(GT: Ground Truth, ✓: correct, ✗: incorrect\)Previous tweets from the user:

- •
- •
- •

### Random Forest Baseline Implementation

For the authorship verification task, we trained a Random Forest \(RF\) classifier separately for each dataset configuration, using the same user and tweet splits as in the LLM experiments\. Specifically, for each user, we constructed a balanced, pairwise training set by \(i\) pairing each of the user’sk=20k=20Few\-shot Example Posts with every other Few\-shot Example Post from the same user, yieldingk×\(k−1\)=380k\\times\(k\-1\)=380positive pairs per user \(labeled as 1\), and \(ii\) pairing each of thesekkFew\-shot Example Posts withm2=20m\_\{2\}=20Negative Evaluation Posts sampled from other users within the same timeframe, resulting in an additionalk×m2=400k\\times m\_\{2\}=400negative pairs per user \(labeled as 0\)\. For the test set, we paired each user’sk=20k=20Few\-shot Example Post with theirm1=20m\_\{1\}=20Positive Evaluation Posts \(labeled as 1\) and another set ofm2=20m\_\{2\}=20Negative Evaluation Posts from other users \(labeled as 0\), ensuring no overlap and preventing train–test leakage\. Each tweet was embedded once using SBERTall\-MiniLM\-L6\-v2, and embeddings of each tweet pair were concatenated into a single 768\-dimensional feature vector\. For dataset configurations involving graph\-based sampling \(Followers\-only,Followees\-only,Reciprocal\), potential class imbalance in the training set was addressed using the Synthetic Minority Over\-Sampling Technique \(SMOTE\)Chawlaet al\.\([2002](https://arxiv.org/html/2604.18955#bib.bib97)\)\. Each RF model was tuned independently via 3\-fold grid search, optimizing the hyperparameters detailed in Table[10](https://arxiv.org/html/2604.18955#A3.T10)\.

At test time, given a candidate tweetff, we formedk=20k=20pairs by individually pairing it with each prior tweet:\(t1,f\),\(t2,f\),…,\(tk,f\)\(t\_\{1\},f\),\(t\_\{2\},f\),\\dots,\(t\_\{k\},f\)\. The RF predicted labels independently for each pair, resulting in a set of pairwise predictionsP=\{p1,p2,…,pk\}P=\\\{p\_\{1\},p\_\{2\},\\dots,p\_\{k\}\\\}, where eachpi∈\{0,1\}p\_\{i\}\\in\\\{0,1\\\}\. The final classification for the candidate tweet was determined by majority voting:

y^f=\{1if∑i=1kpi\>k20otherwise\\hat\{y\}\_\{f\}=\\begin\{cases\}1&\\text\{if\}\\quad\\sum\_\{i=1\}^\{k\}p\_\{i\}\>\\frac\{k\}\{2\}\\\\\[6\.0pt\] 0&\\text\{otherwise\}\\end\{cases\}
The predicted labely^f\\hat\{y\}\_\{f\}indicates whether the candidate tweetffis considered authored by the target user \(y^f=1\\hat\{y\}\_\{f\}=1\) or by a different user \(y^f=0\\hat\{y\}\_\{f\}=0\)\. Similar to the LLM experiments, we evaluated this model’s performance using accuracy and weighted F1\-scores, ensuring direct comparability with the results obtained from the LLMs\.

Table 10:Search space for the 3\-fold RF hyper\-parameter tuningHyperparameterGrid valuesnestimatorsn\_\{\\mathrm\{estimators\}\}\{100, 200\}max\_depth\{None, 10, 30\}min\_samples\_split\{2, 5\}min\_samples\_leaf\{1, 3\}
### Descriptions of Additional Baseline Models

To complement our LLM\-based evaluation, we implemented a diverse set of established authorship verification baselines from the literature\. Below is a summary of each:

- •Universal Sentence Encoder \(USE\) \+ Cosine Similarity: This semantic similarity approach uses Google’s pretrained Universal Sentence Encoder to embed tweets\. An author’s representation is formed by averaging the embeddings of their prior tweets\. Cosine similarity is then computed between this profile and each candidate tweet, and a similarity threshold determines the classificationCeret al\.\([2018](https://arxiv.org/html/2604.18955#bib.bib80)\)\.
- •TF\-IDF \+ Cosine Similarity: A traditional lexical matching method that encodes tweets using TF\-IDF vectors with unigrams, bigrams, and trigrams\. The averaged TF\-IDF vector of a user’s previous tweets forms the author profile, and cosine similarity is used to compare it to test tweetsStamatatos \([2009](https://arxiv.org/html/2604.18955#bib.bib81)\)\.
- •GZip Compression Distance \(NCD\): This compression\-based baseline constructs a profile by concatenating all of a user’s previous tweets\. It computes the Normalized Compression Distance \(NCD\) between the profile and each test tweet using GZip\. A dynamic threshold based on median distances is used for classificationPotha and Stamatatos \([2017](https://arxiv.org/html/2604.18955#bib.bib82)\)\.
- •Siamese Network with GloVe \+ LSTM: This neural architecture uses pretrained GloVe embeddings and an LSTM encoder to represent tweets\. A Siamese network computes the cosine similarity between a pair of encoded tweets, and the result is passed through a sigmoid output layer to determine authorship likelihoodBoenninghoffet al\.\([2019](https://arxiv.org/html/2604.18955#bib.bib15)\)\.
- •Siamese Network with SBERT Embeddings: Each tweet is encoded using a pretrained SBERT model to capture semantic meaning\. The Siamese architecture computes cosine similarity between tweet embeddings and uses a dense layer to classify whether the tweets belong to the same authorReimers and Gurevych \([2019](https://arxiv.org/html/2604.18955#bib.bib44)\)\.

### LLMs Knowledge Cut\-off Dates

The knowledge cut\-off dates of the evaluated LLMs for assessing their generalization performance on “unseen data" \(Section[2\.1](https://arxiv.org/html/2604.18955#S2.SS1)and[3\.1](https://arxiv.org/html/2604.18955#S3.SS1)\) are presented in Table[11](https://arxiv.org/html/2604.18955#A3.T11)\. As it can be observed, all dates are before January 2024 onward, the date for which we we collected the new \(“unseen data"\)\.

Table 11:LLMs and their knowledge cutoff datesModelCut\-off DateGPT\-4Oct\. 2023GPT\-4oSep\. 2023GPT\-3\.5\-TurboAug\. 2021Gemini 1\.5 ProNov\. 2023DeepSeek\-V3Oct\. 2023Llama 3\.2Dec\. 2023
### Impact of Sampling Strategies

The values reported in Table[2](https://arxiv.org/html/2604.18955#S3.T2)for the User Effect and Tweet Effect were computed as follows:

- •User Effect: For each negative post sampling strategy, we calculated the maximum and minimum accuracy across the three user post sampling strategies \(Rnd, Rec, Top\), and averaged these differences across all negative post sampling strategies: Ueffect=1\|T\|∑t∈T\(maxu∈U⁡\(At,u\)−minu∈U⁡\(At,u\)\)U\_\{\\text\{effect\}\}=\\frac\{1\}\{\|T\|\}\\sum\_\{t\\in T\}\\left\(\\max\_\{u\\in U\}\(A\_\{t,u\}\)\-\\min\_\{u\\in U\}\(A\_\{t,u\}\)\\right\)
- •Tweet Effect: For each positive user sampling strategy, we calculated the maximum and minimum accuracy across the five negative post sampling strategies \(Reciprocal, Followees\-only, Followers\-only, Random, Topic\-similar\), and averaged these differences across all positive user sampling strategies: Teffect=1\|U\|∑u∈U\(maxt∈T⁡\(At,u\)−mint∈T⁡\(At,u\)\)T\_\{\\text\{effect\}\}=\\frac\{1\}\{\|U\|\}\\sum\_\{u\\in U\}\\left\(\\max\_\{t\\in T\}\(A\_\{t,u\}\)\-\\min\_\{t\\in T\}\(A\_\{t,u\}\)\\right\)

Where:

- •TTrepresents the set of negative post sampling strategies: Reciprocal, Followees\-only, Followers\-only, Random, Topic\-similar\.
- •UUrepresents the set of positive user sampling strategies: Random \(Rnd\), Recently active \(Rec\), Top\-active \(Top\)\.
- •At,uA\_\{t,u\}denotes the accuracy obtained for a specific negative post sampling strategyttand positive user sampling strategyuu\.

Average performance analysis \(Figure\.[4](https://arxiv.org/html/2604.18955#A3.F4)and Table[12](https://arxiv.org/html/2604.18955#A3.T12)\) indicates that GPT–4 consistently leads, delivering high and balanced performance across both Negative Evaluation Posts and Positive Evaluation Posts \(genuine user’s tweets\) classes\. Gemini and DeepSeek represent a robust second tier, achieving strong accuracy but with a clearer bias toward the Positive class\. While GPT–3\.5–Turbo stands out as the most balanced model, it does so at a noticeably lower overall performance level\. Traditional models like Random Forest exhibit substantial Positive\-class bias, and other baselines such as BERT and Llama highlight significant performance gaps and limitations\. Similarly, classical baseline methods \(e\.g\., USE, TF\-IDF, Gzip, and SIAMESE variants\) show varying degrees of accuracy and imbalance\.

![Refer to caption](https://arxiv.org/html/2604.18955v1/x4.png)Figure 4:Mean class\-wise weighted F1\-scores for each model\. Points above the dashed line \(y=xy\{=\}x\) indicate better performance on Positive Evaluation Posts; points below indicate better performance on Negative Evaluation posts\.Table 12:Mean for weighted F1‑scores per class and absolute balance gap\.ModelF1 \(Neg\.\)F1 \(Pos\.\)GapGPT‑4717189891818Gemini 1\.5 Pro585882822424DeepSeek555582822727GPT‑3\.5‑Turbo5454565622RF484879793131Llama 3\.2232376765353BERT292944441515USE545436361818TF\-IDF5757666699Gzip525268681616SIAMESE \+ SBERT444465652121SIAMESE \+ GloVe333360602727Figure[5](https://arxiv.org/html/2604.18955#A3.F5)details the class\-specific precision and recall for the authorship verification task\. GPT–4 maintains a high precision \(0\.803\) for Class 0, effectively reducing false positives, alongside an excellent recall \(0\.939\) for Class 1, correctly identifying most user\-authored tweets\. Gemini and DeepSeek show strong balanced recall \(approximately 0\.87 and 0\.86\) for Class 1, but lower precision \(0\.66 and 0\.62\) for Class 0, reflecting moderate false\-positive rates\. While GPT–3\.5–Turbo achieves high recall \(0\.930\) for Class 0, its precision is notably low \(0\.391\), indicating aggressive labeling with many false positives; inversely, it shows high precision \(0\.931\) but low recall \(0\.402\) for Class 1\. Traditional methods such as Random Forest, TF\-IDF, Gzip, SIAMESE\+SBERT, SIAMESE\+GloVe, and USE exhibit varied levels of performance, with notable biases and moderate precision\-recall trade\-offs\. In contrast, BERT and Llama demonstrate significantly limited performance across both precision and recall metrics\. Overall, these precision\-recall analyses reinforce GPT–4’s robustness and superior capability in accurately verifying authorship with balanced performance\.

![Refer to caption](https://arxiv.org/html/2604.18955v1/x5.png)Figure 5:Precision and recall comparison across models for the authorship verification task\.Figure[6](https://arxiv.org/html/2604.18955#A3.F6)summarizes the mean evaluation metrics \(accuracy, precision, recall, and F1\-score\) across evaluated models for the authorship verification task\. GPT–4 achieves the highest overall performance, with accuracy \(85\.08%\) and F1\-score \(0\.8898\), reflecting a robust balance between precision \(85\.16%\) and recall \(93\.82%\)\. Gemini and DeepSeek exhibit strong and balanced results, with Gemini slightly outperforming DeepSeek in accuracy \(76\.25%\) and F1\-score \(0\.8240\)\. Conversely, GPT\-3\.5\-Turbo demonstrates an imbalanced performance, with exceptional precision \(93\.17%\) but notably low recall \(40\.11%\), indicative of conservative classification behavior\. Traditional methods like Random Forest provide moderate and balanced performance \(accuracy: 73\.25%\), whereas baseline methods such as Llama 3\.2 \(accuracy: 64\.41%, high recall but lower precision\) and BERT \(accuracy: 51\.44%\) show clear limitations\. Additional classical baselines \(USE, TF\-IDF, Gzip, SIAMESE\+SBERT, and SIAMESE\+GloVe\) also exhibit varied levels of accuracy and balance, underscoring GPT–4’s superior performance and robustness in authorship verification\.

![Refer to caption](https://arxiv.org/html/2604.18955v1/x6.png)Figure 6:Heatmap illustrating mean evaluation metrics \(accuracy, precision, recall, and F1\-score\) across different models evaluated on the authorship verification task\. Higher scores \(darker colors\) indicate stronger performance\.

## Appendix DSocial Media Post Generation

### Model Access and Hyperparameters

For tweet generation, we accessed OpenAI’s GPT\-4o using a batch API to reduce cost by submitting prompts in bulk\. In contrast, Gemini, DeepSeek, and Llama \(via Ollama\) were accessed through individual non\-batched calls\. The generation temperature was uniformly set to 0\.7 across all models to promote content diversity while avoiding degenerate or repetitive outputs\.

### Descriptions of Post Generation Baselines

To complement our evaluation of LLMs on the social media post generation task, we implemented a set of traditional baseline models that reflect diverse approaches to content generation\. Below, we provide a brief description of each method\.

- •Markov Chain: A classical probabilistic model that generates new tweets by modeling transition probabilities between word sequences\. We train a second\-order Markov chain using a user’s prompt tweets and sample new sentences constrained to tweet length\. This method captures frequent local patterns but lacks global coherence or semantic understanding\.Shannon \([1948](https://arxiv.org/html/2604.18955#bib.bib83)\); Freitaset al\.\([2015](https://arxiv.org/html/2604.18955#bib.bib85)\)
- •BART\-large: A denoising autoencoder for sequence\-to\-sequence generation that uses a bidirectional encoder and a left\-to\-right decoder\. Pretrained on large\-scale corpora with text corruption tasks, BART is well\-suited for conditional generation tasks, including style imitation and text rewriting\.Lewiset al\.\([2019](https://arxiv.org/html/2604.18955#bib.bib86)\)
- •T5\-large: A unified text\-to\-text transformer that reformulates all NLP tasks—including generation—as text\-to\-text problems\. Thelargevariant is pretrained on the C4 corpus with a multi\-task objective and fine\-tuned for conditional generation\.Raffelet al\.\([2020](https://arxiv.org/html/2604.18955#bib.bib87)\) #### Qualitative Analysis for Task II \(Case Studies\) Table[13](https://arxiv.org/html/2604.18955#A4.T13)compares tweets originally posted by one user to tweets generated by four LLMs \(DeepSeek, GPT\-4o, Gemini, and Llama\)\. The table highlights two main topics from the user’s tweets: appreciation for volunteersandrecognition of alumni scholarship achievements\. Among the LLM models, DeepSeek and GPT\-4o deliver the closest matches\. Both accurately reflect the volunteer appreciation theme with similar enthusiasm, although GPT\-4o slightly lacks specifics regarding the alumni scholarship\. Gemini effectively praises volunteers but becomes too general when mentioning scholarships\. Lastly, Llama maintains relevance to Science Olympiad but often overlooks specific details, shifting attention to general team accomplishments and different scholarships\. Overall, DeepSeek and GPT\-4o provide tweets most faithful to the user’s original content, while Gemini and Llama are thematically relevant but less precise\. Table 13:Comparison of one user’s tweets and LLM\-generated tweets\.Theme A – Volunteer appreciation\\cellcolorblue\!12Real:@MDSciOly Congrats to Russ for being recognized for his incredible Science Olympiad volunteer work\! We are so proud\! @SOAlumniNetworkGen \(DeepSeek\):Shoutout to all the incredible coaches and volunteers who make Science Olympiad possible\. Your passion fuels the next generation of innovators\! \#ThankYouComment:Matches closely; same focus on volunteers with similar enthusiasm\.Gen \(GPT\-4o\):Shoutout to all the incredible volunteers who make Science Olympiad possible\! Your dedication inspires future scientists every day\. Thank you\! \#VolunteerAppreciationComment:Excellent match; closely mirrors topic, appreciation tone, and enthusiasm\.Gen \(Gemini\):Volunteers are the heart of Science Olympiad\! Thank you for your dedication\. \#volunteerappreciationComment:Good match on volunteers, but a bit general compared to the original\.Gen \(Llama\):Meet our incredible teams of 7,800\+ students from across 50 states gearing up for the ultimate STEM competition \- Science Olympiad\!Comment:Stays related to Science Olympiad but misses the volunteer focus\.Theme B – Alumni success & scholarships\\cellcolorblue\!12Real:This Science Olympiad alum and valedictorian candidate just won the prestigious $40,000 Edison Scholarship\! Tanya is MIT\-bound and plans to major in chemical engineering\. Way to go\! \#STEM \#scienceolympiadGen \(DeepSeek\):Meet Tanya, a Science Olympiad alum and valedictorian who just won the $40K Edison Scholarship\! Your future starts here\. \#STEMSuccessComment:Good match; includes the same name, scholarship amount, and excitement\.Gen \(GPT\-4o\):Big congrats to our alumni who are making waves in the STEM world\! Your achievements showcase the lasting impact of Science Olympiad\. Keep shining\! \#ScienceOlympiadAlumniComment:Good topical alignment about alumni success, though it omits specific scholarship details\.Gen \(Llama\):Congratulations to our Division C winners who each won $1,000 college scholarships from DuPont\! Your hard work pays off\!Comment:On\-topic with scholarships but uses different details and lacks personal specifics\.Gen \(Gemini\):Science Olympiad alumni, where are you now? Share your success stories\! \#sciolyalumniComment:Mentions alumni generally but misses details about the specific scholarship and winner\.

## Appendix EUser Study

Table[14](https://arxiv.org/html/2604.18955#A5.T14)summarises the 60 ratings collected for each model\. For every system we report three descriptive statistics: the mean score on the five‑point*“definitely not me” \(1\) – “definitely me” \(5\)*scale, its associated 95% confidence interval \(CI\), and the*Top‑2‑box*proportion—i\.e\., the share of responses falling in the two most positive categories, an intuitive acceptance rate\.

Gemini and Llama obtain the strongest reception: almost three‑quarters of their tweets are marked*probably me*or*definitely me*, and both achieve a mean rating of 3\.95/5\. DeepSeek and GPT‑4o lag by roughly 0\.30 points in mean score and by about 5–8 percentage points in the positive bins metric, implying a modest—but not dramatic—loss of perceived author likeness\. Because the 95 % CIs overlap across all four systems, these differences should be viewed as suggestive rather than conclusive\.

To test whether the full five‑category rating distributions differ by model, we conducted a Pearsonχ2\\chi^\{2\}test on the complete4×54\\times 5contingency table \(four models, five response categories\)\. The result,χ2=7\.29\\chi^\{2\}=7\.29with 12 degrees of freedom \(p=0\.84p=0\.84\), fails to reject the null hypothesis of identical distributions\. In practical terms, the available 60 judgments per model do not provide sufficient power to claim a statistically significant winner, even though Gemini and Llama trend higher on both descriptive metrics\.

Table 14:Participant‑likeness ratings on a 1–5 scale \(higher = more authentic\)\.ModelMean±\\pm95% CIPositive Bins %DeepSeek3\.68±0\.343\.68\\pm 0\.3465\.0Gemini3\.95±0\.32\\mathbf\{3\.95\\pm 0\.32\}73\.3GPT\-4o3\.67±0\.363\.67\\pm 0\.3668\.3Llama3\.95±0\.29\\mathbf\{3\.95\\pm 0\.29\}73\.3Table[15](https://arxiv.org/html/2604.18955#A5.T15)presents detailed demographic characteristics and additional information about the user study participants\.

Table 15:Demographics of User Study ParticipantsAge Range•\[22–26\): 63\.64%•\[26–30\]: 36\.36%Gender•Male: 91\.67%•Female: 8\.33%Tweet Activity•Daily: 66\.67%•A few times/week: 25\.00%•Weekly: 8\.33%Main Tweet Topics•Entertainment: 91\.67%•Technology/AI: 75\.00%•Sports: 58\.33%•Science: 50\.00%•Politics: 41\.67%•Personal updates: 33\.33%•E\-commerce: 33\.33%•Religion: 8\.33%### Qualitative Analysis for User Study \(Case Studies\)

Table[16](https://arxiv.org/html/2604.18955#A5.T16)compares original tweets from one participant with tweets generated by four different LLMs \(GPT\-4o, DeepSeek, Gemini, and Llama\), focusing on two specific topics:fintech skepticismandcar purchasing decisions\. Among these models, GPT\-4o best captures the user’s style, accurately reflecting the original tweets’ topics and conversational tone, though it sometimes sounds slightly more formal\. DeepSeek does very well in fintech discussions, matching the user’s viewpoint closely, but struggles with other topics like cars, where it drifts into unrelated general tech content\. Gemini often misses the user’s specific interests entirely, defaulting instead to generic technology\-related statements\. Lastly, Llama picks up on contextual details \(e\.g\., Canadian outdoors\) but frequently misses the user’s main points about fintech and specific vehicle choices, shifting instead toward broader, less relevant subjects\.

Table 16:Comparison of one participant’s original tweets and LLM\-generated tweets\.Theme A – Fintech skepticismAn overview of the user:Male, 26 years old, Software Developer \(USA\)\. Tweets weekly, mainly about Technology/AI and Entertainment\.\\cellcolorblue\!12Real sample tweet:Are there any cons to @Wealthsimple? So far I don’t see why anyone would use a traditional bank\.Gen \(DeepSeek\):@Wealthsimple Your app is a game\-changer\. Traditional banks should be worried\.Comment:Closely matches topic and tone; clearly shares the user’s viewpoint\.User’s survey response:Definitely meGen \(GPT\-4o\):Let’s talk smart investments\. With the rise of fintech platforms, do you think traditional banks will become obsolete? \#Fintech \#InvestingComment:Keeps the casual question\-and\-answer style and fintech focus, but slightly more formal\.User’s survey response:Definitely meGen \(Gemini\):Is it just me, or is AI getting smarter every day? Kind of exciting, kind of terrifying\. \#AIComment:Completely off\-topic; shifts to general AI instead of finance\.User’s survey response:Definitely meGen \(Llama\):Thinking of switching to a Linux distribution for my next project – any recommendations? \#LinuxComment:Misses the financial topic completely; instead emphasizes a general tech persona\.User’s survey response:Definitely not meTheme B – Car decisions\\cellcolorblue\!12Real:First time Subaru is stuck\. Hmm should I get a Lexus GX or a Land Cruiser?Gen \(GPT\-4o\):Thinking about switching from Subaru to Lexus GX\. Need something more rugged for those off\-road adventures\. Any thoughts? \#CarTalkComment:Excellent match; closely follows user’s original question and specific vehicle brands\.User’s survey response:Probably meGen \(Llama\):Planning a road trip to explore the Canadian wilderness – any recommendations for off\-road gear? \#offroadadventures \#CanadaComment:Captures outdoor theme but doesn’t reflect user’s original car\-brand comparison\.User’s survey response:UnsureGen \(Gemini\):Electric cars are the future\. Just sayin’\. \#EV \#electriccarComment:Briefly touches on cars, but too general; misses user’s interest in specific brands and off\-road vehicles\.User’s survey response:Probably meGen \(DeepSeek\):Just pre\-ordered the new Framework laptop\. Modular design is the future\. \#RightToRepairComment:Completely unrelated topic; shifts to a generic tech subject rather than cars\.User’s survey response:Probably not me

## Appendix FUser Attribute Inference

### Misclassification Patterns

Table[17](https://arxiv.org/html/2604.18955#A6.T17)summarizes the most frequent true–predicted label confusions in Task III, along with their occurrence counts\. Across both interest and occupation inference, most errors occur between semantically adjacent categories with overlapping topical or stylistic signals, rather than between unrelated domains\. This suggests that misclassifications largely reflect boundary ambiguity in social media language rather than systematic model failure\.

Table 17:Most Frequent Confusion Pairs in User Attribute Inference \(Task III\)Category TypeTrue CategoryPredicted CategoryCountInterest \(IAB\)EntertainmentPop Culture45Pop CultureEntertainment18Occupation \(L1L\_\{1\}\)Education, Training, and LibraryArts, Design, Entertainment, Sports, and Media33ManagementArts, Design, Entertainment, Sports, and Media22Occupation \(L2L\_\{2\}\)Entertainers and PerformersMedia and Communication Workers43Media and Communication WorkersEntertainers and Performers41
### Model Access and Hyperparameters

For the user attribute inference task, we set the generation temperature to0for all models to ensure consistent outputs and reproducibility\. GPT\-4o was accessed via OpenAI’sChatCompletionendpoint, Gemini through Google’sGenerativeModelAPI, DeepSeek using its REST interface, and Llama via local inference through Ollama\. All models were queried individually in non\-batch mode, and responses were parsed to extract structured JSON\-formatted predictions\.

Table 18:Interest Categories Based on IAB Content Taxonomy v3\.1IDInterest CategoryI1AttractionsI2AutomotiveI3Books and LiteratureI4Business and FinanceI5CareersI6CommunicationI7CrimeI8DisastersI9EducationI10EntertainmentI11Fine ArtI12Food & DrinkI13Hobbies & InterestsI14Home & GardenI15LawI16Medical HealthI17PetsI18PoliticsI19Pop CultureI20ScienceI21SportsI22Style & FashionI23Technology & ComputingI24TravelI25Video GamingTable 19:Occupational Categories \(Level 1\) Based on SOC 2018IDOccupation Category \(Level 1\)L1\-1Accommodation and Food ServicesL1\-2Arts, Design, Entertainment, Sports, and Media OccupationsL1\-3Community and Social Service OccupationsL1\-4Computer and Mathematical OccupationsL1\-5Education, Training, and Library OccupationsL1\-6Healthcare Practitioners and Technical OccupationsL1\-7Legal OccupationsL1\-8Life, Physical, and Social Science OccupationsL1\-9Management OccupationsL1\-10Management, Business, Science, and Arts OccupationsL1\-11Management, Business, and Financial OccupationsL1\-12Office and Administrative Support OccupationsL1\-13Production OccupationsL1\-14Professional and Related OccupationsL1\-15Professional, Scientific, and Technical ServicesL1\-16Protective Service OccupationsL1\-17Sales and Office OccupationsL1\-18Sales and Related OccupationsTable 20:Occupational Categories \(Level 2\) Based on SOC 2018IDOccupation Category \(Level 2\)L2\-1AccommodationL2\-2Advertising, Marketing, Promotions, Public Relations, and Sales ManagersL2\-3Arts and Design WorkersL2\-4Assemblers and FabricatorsL2\-5Business and Financial Operations OccupationsL2\-6Computer OccupationsL2\-7Computer and Information Systems ManagersL2\-8Computer and Mathematical OccupationsL2\-9Counselors, Social Workers, and Other Community and Social Service SpecialistsL2\-10Educational Instruction and Library OccupationsL2\-11Entertainers and Performers, Sports and Related WorkersL2\-12Financial ClerksL2\-13Firefighting and Prevention WorkersL2\-14Health Technologists and TechniciansL2\-15Information and Record ClerksL2\-16Law Enforcement WorkersL2\-17Lawyers, Judges, and Related WorkersL2\-18Legal OccupationsL2\-19Librarians, Curators, and ArchivistsL2\-20Life ScientistsL2\-21Life, Physical, and Social Science OccupationsL2\-22Management OccupationsL2\-23Mathematical Science OccupationsL2\-24Media and Communication OccupationsL2\-25Media and Communication WorkersL2\-26Office and Administrative Support OccupationsL2\-27Other Management OccupationsL2\-28Other Protective Service WorkersL2\-29Physical ScientistsL2\-30Postsecondary TeachersL2\-31Professional, Scientific, and Technical ServicesL2\-32Retail Sales WorkersL2\-33Sales Representatives, ServicesL2\-34Sales Representatives, Wholesale and ManufacturingL2\-35Scientific Research and Development ServicesL2\-36Social Science OccupationsL2\-37Social Scientists and Related WorkersL2\-38Top Executives
### Descriptions of User Attribute Inference Baselines

To complement our evaluation of LLMs on the user attribute inference task, we implemented a set of traditional baseline models\. Below, we briefly describe each method along with their references\.

- •\(Preoţiuc\-Pietro et al\. 2015\): A probabilistic classifier that employs Word2Vec embeddings to represent tweets and spectral clustering to group semantically related words into clusters\. It handles class imbalance through random oversampling, followed by classification using a Gaussian Process with an Automatic Relevance Determination \(ARD\) kernel\.
- •Lewiset al\.\([2019](https://arxiv.org/html/2604.18955#bib.bib86)\): A fine\-tuning approach leveraging a pre\-trained BART model \(bart\-large\-mnli\) on a small subset of labeled user tweets \(20%\)\. It employs random oversampling for balancing classes, evaluating performance on an independent test set\.
- •Michelson and Macskassy\([2010](https://arxiv.org/html/2604.18955#bib.bib89)\): This method extracts named entities from tweets and maps them to DBpedia categories to build user interest profiles\. It infers user attributes based on the frequency of categories associated with these entities, capturing topical interests rather than textual semantics\.
- •Pennacchiotti and Popescu\([2011](https://arxiv.org/html/2604.18955#bib.bib90)\): A gradient boosting classifier that utilizes textual features extracted via TF\-IDF vectorization of user tweets\. The classifier addresses class imbalance using sample weighting to optimize predictive performance\.

### Qualitative Analysis for Task III \(Case Studies\)

Table[21](https://arxiv.org/html/2604.18955#A6.T21)shows occupation predictions by four LLMs \(Gemini, GPT\-4o, DeepSeek, and Llama\) based solely on tweets from a user involved in emergency management\. Among the models, Gemini correctly identifies the occupation as “Other Protective Service Workers," accurately capturing specialized professional cues such as “Certified Emergency Manager," mentions of FEMA, and emergency response drills\. In contrast, GPT\-4o incorrectly emphasizes general community support themes and hashtags, leading to a prediction in social and community services\. DeepSeek misinterprets hazard\-related language as indicating frontline firefighting roles, failing to distinguish between emergency response coordination and direct hazard management\. Finally, Llama provides an overly broad classification \(“Public Service"\), overlooking precise professional indicators and domain\-specific terminology found clearly within the user’s tweets\.

Table 21:Occupation \(L2\) inference from tweets\.User Bio \(not shown to models\):Disaster technologist, inclement weather enthusiast, tender‑hearted public servant\. Just trying to make the world a safer place\. Views expressed are my own\.Sample Tweets:∙\\bullet“I am aninternationally Certified Emergency Managerwith a Bachelor of Science in Emergency Management\.”∙\\bullet“I’m wrapping up the last bit of work‑fun for the week by reviewing the new@FEMA ICS/NIMS courses\.”∙\\bullet“Taught anIncident Command System \(ICS\) coursewith the brilliant @IDICworld today\!”∙\\bullet“Just wrapping‑up a fun day helping test our statewide response capability to acomplex coordinated cyber attack\.”∙\\bullet“I can not say enough good things about theEmergency Management Accreditation Program\! Urban or rural, private or public …”∙\\bullet“Those are some pretty serious polygons\! Stay safe friends\! \#KSWX”∙\\bullet“\#Preach\! Everyone is entitled to \#SelfCare in any form, as long as it isn’t hurting anyone else\.”Ground‑Truth Occupation \(SOC 2018\):Other Protective Service Workers \(33‑9099\)Model Predictions and Diagnostic Analysis:Gemini:*Other Protective Service Workers*Comment:Correct\. The model accurately identified clear evidence, such as “Certified Emergency Manager,” “ICS training,” mentions of “FEMA,” and disaster\-response drills, matching closely with the protective\-service occupation\.GPT‑4o:*Counselors, Social Workers, and Other Community & Social Service Specialists*Comment:Incorrect\. The model was influenced heavily by general expressions of well\-being and hashtags like \#SelfCare, overlooking specific professional terms related to emergency management\.DeepSeek:*Firefighting and Prevention Workers*Comment:Incorrect\. The model overly focused on tweets about hazards and severe weather \(e\.g\., “polygons” in weather warnings\), mistakenly interpreting coordination and training roles as frontline firefighting\.Llama:*Public Service*Comment:Incorrect\. Too broad and vague, this prediction missed the specific professional cues \(e\.g\., certifications and specialized acronyms such as “ICS,” “NIMS,” and “CEM”\) clearly indicating emergency management work\.
### Interest and Occupation Category Taxonomies

Tables[18](https://arxiv.org/html/2604.18955#A6.T18),[19](https://arxiv.org/html/2604.18955#A6.T19), and[20](https://arxiv.org/html/2604.18955#A6.T20)list the interest categories and the Level 1 and Level 2 occupation categories, respectively, as defined by the IAB Content Taxonomy and the 2018 SOC classification\. These tables document the label space used for user attribute inference and support reproducibility of our experiments\.

## Appendix GRelated Work

The research in social media analytics has been conducted using conventional machine learning and statistical methodsInjadatet al\.\([2016](https://arxiv.org/html/2604.18955#bib.bib75)\)\. The introduction of transformer\-based LLMs specially versatile ones such as Gemini and GPT, has marked a significant advancement in social media analytics\. Next, we briefly review the recent studies that utilize LLMs in the three tasks of our interest\.

Recent authorship attribution and verification studies increasingly use LLMs to capture distinctive writing styles from text aloneHuertas\-Tatoet al\.\([2024](https://arxiv.org/html/2604.18955#bib.bib66)\)\. Fine\-tuned transformer encoders \(e\.g\., BERT, RoBERTa\) applied to authorship tasks have achieved state\-of\-the\-art accuracy, surpassing traditional stylometric approaches that rely on handcrafted featuresHuet al\.\([2024b](https://arxiv.org/html/2604.18955#bib.bib63)\)\. More recent works explore both fine\-tuning and prompting strategies with advanced LLMs\. For example,\(Huanget al\.,[2024](https://arxiv.org/html/2604.18955#bib.bib30)\)demonstrated that GPT\-based LLMs can accurately verify authorship \(and even attribute texts to the correct author among many candidates\) in a zero\-shot setting without task\-specific training, essentially establishing new performance benchmarks\. Other researchers have proposed prompt\-based techniques to harness LLMs’ knowledge; for instance, a “PromptAV” method uses step\-by\-step stylometric cues to improve GPT\-3\.5’s verification accuracy and explainability, and a linguistically informed prompting approach similarly guides GPT\-3\.5/4 models to strong authorship verification results even without fine\-tuningHuet al\.\([2024b](https://arxiv.org/html/2604.18955#bib.bib63)\)\. However, these LLM\-driven approaches typically consider only textual content, omitting valuable contextual signals such as user profile bios or social network features\. In the social media domain, ignoring such metadata can be limiting – social media posts are short and rife with slang, often making it difficult to identify the author from text aloneAlsanoosyet al\.\([2024](https://arxiv.org/html/2604.18955#bib.bib67)\)\. In contrast to prior studies, our work integratesrich contextual metadata\(e\.g\., profile descriptions and network\-derived features\) into the authorship verification pipeline\. We introducesystematic and robust user/post sampling strategiesto construct a diverse evaluation set, and wemitigate potential data leakagebiases by using tweet content posted LLMs’ knowledge cut\-off\.

A growing body of work uses LLMs to write \(generate\) social media posts, yet each study tackles a very specific goal\. RePALM fine‑tuned GPT‑3\.5 with a reinforcement‑learning reward that predicts likes and retweets, so it generates quote‑tweets optimized purely for popularityYuet al\.\([2024](https://arxiv.org/html/2604.18955#bib.bib32)\)\. Pillai et al\. prompted GPT‑4 to rewrite news headlines into tweets in three fixed persona styles \(formal, casual, factual\) to boost engagementPillaiet al\.\([2025](https://arxiv.org/html/2604.18955#bib.bib38)\)\. Qiu et al\. first predicted whether a user will retweet, quote or reply to a trending post and then let GPT‑4 craft the corresponding response, but the model still handled one interaction type at a timeQiuet al\.\([2025](https://arxiv.org/html/2604.18955#bib.bib69)\)\. Across these efforts, the LLM sees little more than the source post \(plus an optional style tag\); richer cues such as the author’s bio, follower network, or recent tweets are ignored—even though adding social signals during pre‑training is known to improve tweet representationsZhanget al\.\([2023b](https://arxiv.org/html/2604.18955#bib.bib70)\); Zhaoet al\.\([2025](https://arxiv.org/html/2604.18955#bib.bib37)\)\. Our study fills this gap by conditioning generation on a compact user‑context block—bio, follower/followee counts, and representative past tweets—and by rating the outputs onfour complementary dimensions: \(i\) semantic fidelity \(how closely each tweet’s meaning matches the user’s real posts\), \(ii\) output diversity \(coverage of topics and phrasings\), \(iii\) stylistic congruence \(faithfulness to the user’s voice or brand tone\), and \(iv\) perceived authenticity \(how natural and human‑like the tweets sound\)\. More importantly, we assess LLMs’ post\-generation capability byasking real users to rate tweetsthe models create from their own timelines\.

Recent LLM\-based approaches to social media user profiling typically target a single user attribute at a time – for example, classifying only a person’s occupation or their political leaningLiuet al\.\([2024](https://arxiv.org/html/2604.18955#bib.bib62)\); Wenet al\.\([2023](https://arxiv.org/html/2604.18955#bib.bib35)\)\. To improve predictive power, many studies leverage auxiliary user data beyond the social media posts themselves\. It is common to incorporate profile descriptions, full timelines, or social network cues along with the post textHonget al\.\([2021](https://arxiv.org/html/2604.18955#bib.bib61)\); Wenet al\.\([2023](https://arxiv.org/html/2604.18955#bib.bib35)\)\. For instance, users often self\-report their job roles or hobbies in their biosHonget al\.\([2021](https://arxiv.org/html/2604.18955#bib.bib61)\), and some profiling methods feed such metadata \(and even friendship information\) into models alongside tweet contentWenet al\.\([2023](https://arxiv.org/html/2604.18955#bib.bib35)\)\. Moreover, prior work seldom applies standard taxonomies for labeling user attributes\. Instead, researchers usually define task\-specific or coarse\-grained categories – e\.g\. grouping occupations into a few broad classesLiuet al\.\([2024](https://arxiv.org/html/2604.18955#bib.bib62)\)or using ad\-hoc sets of interest topics – rather than mapping to established schemas\. In contrast, our approach infers both occupation and personal\-interest profiles simultaneously using only each user’s tweet content, without relying on any self\-description or network features\. We further constrain and explain the model’s outputs by grounding them inofficial taxonomies\(the SOC for occupations and the IAB Tech Lab content taxonomy for interests\), which enables more standardized, interpretable predictions in comparison to previous methods\.
Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest

Similar Articles

Benchmarking Large Language Models for Safety Data Extraction

Understanding the capabilities, limitations, and societal impact of large language models

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

Better language models and their implications

The "One-Size-Fits-All" AI era is dead. I benchmarked GPT-5.5, Claude 4.7, Gemini 3.1 Pro, and DeepSeek V4 Pro here is the actual state of the frontier.

Submit Feedback

Similar Articles

Benchmarking Large Language Models for Safety Data Extraction
Understanding the capabilities, limitations, and societal impact of large language models
Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction
Better language models and their implications
The "One-Size-Fits-All" AI era is dead. I benchmarked GPT-5.5, Claude 4.7, Gemini 3.1 Pro, and DeepSeek V4 Pro here is the actual state of the frontier.