Curation and Extraction of Drug-Related Entities from Reddit Platform
Summary
Introduces ReDose, a dataset of 6,435 Reddit posts annotated for drug, dose, and effect entities, and benchmarks various models including BiomedBERT, Llama-3 70B, and GPT-4 for extraction.
View Cached Full Text
Cached at: 05/27/26, 09:06 AM
# Curation and Extraction of Drug-Related Entities from Reddit Platform Source: [https://arxiv.org/html/2605.26445](https://arxiv.org/html/2605.26445) Zihan XuPopulation Health Sciences, Weill Cornell Medicine, New York City, USASchool of Computing and Information Systems, University of Melbourne, Melbourne, AustraliaYishu WeiPopulation Health Sciences, Weill Cornell Medicine, New York City, USAMichael CharyEmergency Medicine, Weill Cornell Medicine, New York City, USACorresponding author\(s\)\. Email\(s\):[mic9189@med\.cornell\.edu](https://arxiv.org/html/2605.26445v1/[email protected]),[yip4002@med\.cornell\.edu](https://arxiv.org/html/2605.26445v1/[email protected])Yifan PengPopulation Health Sciences, Weill Cornell Medicine, New York City, USACorresponding author\(s\)\. Email\(s\):[mic9189@med\.cornell\.edu](https://arxiv.org/html/2605.26445v1/[email protected]),[yip4002@med\.cornell\.edu](https://arxiv.org/html/2605.26445v1/[email protected]) ###### Abstract Physicians learn primarily about illicit drugs from clinical overdose cases, limiting their understanding of real\-world usage\. Meanwhile, drug users share first\-hand experiences online, offering insights into dosage and effects of drugs\. To bridge this gap, we introduceReDose\(REddit Drug DOSe and Effect\), a dataset of 6,435 Reddit posts on substance use\. A board\-certified toxicologist primarily annotated both the training and test sets, while two medical science students contributed to the test set, labeling DRUG, DOSE, and EFFECT entities\. We benchmarked 6,267 annotations using BERT\-based, large language model \(LLM\)\-based, and Retrieval\-Augmented Generation \(RAG\) models\. BiomedBERT achieved an F1\-score of 0\.843 for DRUG, while Llama\-3 70B outperformed GPT\-4 \(F1 = 0\.79 vs\. 0\.72\)\. EFFECT extraction remains challenging, with GPT\-4 achieving a recall of 0\.41\.ReDosecaptures patient\-curated narratives to advance medical data extraction from social media\. ###### keywords: Natural Language Processing Named Entity Recognition Drug Abuse Large Language Models ## 1Introduction The epidemiology of substance use has transformed from single\-substance use to polysubstance use, and the inventory of possible drugs has expanded from a handful to a dizzying pantheon of novel psychoactive substances \(NPS\)\. NPS include novel synthetic opioids, hallucinogenic stimulants, and designer benzodiazepines\. According to the United Nations Office of Drug Control, the number of known NPS increased from 251 in 2012 to 780 by 2016\[[29](https://arxiv.org/html/2605.26445#bib.bib23)\]\. These substances emerge and disappear too rapidly for federal surveys to track or routine toxicology screens to detect, necessitating the development of improved detection methods\. People frequently discuss substance use on social media\[[13](https://arxiv.org/html/2605.26445#bib.bib19),[14](https://arxiv.org/html/2605.26445#bib.bib6),[27](https://arxiv.org/html/2605.26445#bib.bib20)\], where a great proportion of the discussions is about NPS use\[[6](https://arxiv.org/html/2605.26445#bib.bib22),[31](https://arxiv.org/html/2605.26445#bib.bib21)\]\. It has been previously demonstrated that online discussions on Twitter/X about opioid use can be used to predict real\-world use in the next 30 days\[[7](https://arxiv.org/html/2605.26445#bib.bib26)\]\. In addition to previous work on dose\-response information extraction from online platforms such as YouTube comments\[[8](https://arxiv.org/html/2605.26445#bib.bib25)\]and online bulletin boards\[[1](https://arxiv.org/html/2605.26445#bib.bib24)\], Reddit is also frequently cited as a rich source of information on various medical topics\[[28](https://arxiv.org/html/2605.26445#bib.bib7),[14](https://arxiv.org/html/2605.26445#bib.bib6)\]\. Previous studies have shown that using social media text is a valid method for tracking how people use specific substances\. However, less research has focused on identifying which new substances are emerging\. This gap creates an opportunity to determine whether analyzing online commentaries can help identify problematic new substances before they cause severe public health issues\. Natural Language Processing \(NLP\) techniques are well\-suited to processing large volumes of unstructured text\. The current key challenge is the lack of standardized, curated datasets for benchmarking extraction methods\. Existing Named Entity Recognition \(NER\) datasets from social media are only annotated on a few substance entities, but none of them touch on effects or doses\[[16](https://arxiv.org/html/2605.26445#bib.bib9),[14](https://arxiv.org/html/2605.26445#bib.bib6),[28](https://arxiv.org/html/2605.26445#bib.bib7),[24](https://arxiv.org/html/2605.26445#bib.bib8),[21](https://arxiv.org/html/2605.26445#bib.bib17),[12](https://arxiv.org/html/2605.26445#bib.bib15)\]\. Open\-source clinical NLP datasets are annotated with drug dosages, but they are primarily based on structured physician narratives\[[16](https://arxiv.org/html/2605.26445#bib.bib9)\]\. In contrast, the online commentaries do not strictly use clinical terminology to avoid sensitive checks from online platforms\. This poses a significant challenge for developing models that transfer well between physician notes and textual data from social media\. To address this barrier, we introduceReDose\(REddit Drug DOSe and Effect\), a dataset of 6,435 unique documents collected from 7 drug\-related subreddits\. Each document withinReDosehas been annotated with three entities: the drugs mentioned, their reported doses, and the reported effects\. We chose these three entities because establishing a dose\-effect relationship is a cornerstone of pharmacology\. For many substances described online, there is no other feasible data source for this purpose \(Table[1](https://arxiv.org/html/2605.26445#S1.T1)\)\. Each entry inReDoseincludes an unprocessed document, its annotated version, and a timestamp\. All identifiable and protected health information has been removed\. Our goal withReDoseis to enhance the ability of NLP models to extract clinically relevant information about emerging substances, thereby informing clinical practice and public health guidelines\. To our knowledge,ReDoseis the first such dataset collected from online platforms with such detailed annotations, including three attributes\. Table 1:Summary of Datasets in Related StudiesWe present benchmark results using BERT\-based models, one\-shot prompting Large Language Models \(LLMs\), and Retrieval\-Augmented Generation \(RAG\)\-based LLMs\. In traditional one\-shot or few\-shot prompting, the same example is used for all inputs, which limits the impact of examples because their semantics may differ significantly from the input\. To address this limitation, we developed a retrieval\-based method to extract the most similar examples from the training dataset and append them to the prompt\. This approach significantly improved the recall rate for DRUG extraction\. In comparing BERT with LLMs, we found that while LLMs may be easier to implement, their performance does not yet match that of the fine\-tuned BERT\. The performance difference is most significant in the metrics on the EFFECT entity\. In summary, this study provides several key contributions\. \(1\) We introduceReDose, a comprehensive dataset of online commentaries about drug use\. This dataset comprises 6,435 documents and 6,267 DRUG, DOSE, or EFFECT entities, with a high inter\-annotator agreement score of 0\.75\. \(2\) We provide benchmark results for a fair comparison between BERT and LLMs\. \(3\) We conduct a deep analysis of the differences across these models, exploring how the inherent large knowledge databases in LLMs may yield performance comparable to the supervised training results of BERT\-based models\. This analysis helps understand the strengths and limitations of each approach in handling complex medical NER tasks\. ## 2Related Work ### 2\.1Biomedical Named Entity Recognition \(NER\) Dataset There have been abundant datasets designed for medication\-related entity recognition\. We collect the most relevant datasets and present their annotated attributes in Table[1](https://arxiv.org/html/2605.26445#S1.T1)\. We highlight some major limitations of the existing datasets in the following discussion\. First, some datasets extend beyond drugs to general medication\. While most drugs have the potential to be abused, studies have shown that certain drugs, such as depressants, opioids or morphine derivatives, and nerve stimulants, appear to be more addictive than others\[[26](https://arxiv.org/html/2605.26445#bib.bib5)\]\.ReDoseis a more focused dataset compared to a general medication\-based dataset and is better suited to research focused on substance abuse\. In addition, since adverse events are prevalent inReDosedue to frequent overdosing, it can also serve as a complementary dataset in research related to adverse drug events\. Most existing datasets extracted from Reddit are not open\-sourced, which limits their reproducibility and reuse\. Meanwhile, widely used open datasets, such as n2c2, mainly rely on reports written by medical professionals that use standard clinical terminologies\. By comparison,ReDoseintroduces new terminology used by broader populations, helping physicians to familiarize themselves with the synonyms of the drugs\. Moreover, some datasets only investigate a single substance\. For example, in the work by Graves et al\.\[[14](https://arxiv.org/html/2605.26445#bib.bib6)\], Reddit posts from the/r/suboxonesubreddit were mined to study user discussions around a specific drug \(Suboxone\), with a focus on its symptoms and usage patterns\. While insightful, that approach was narrowly scoped to one substance and lacks coverage of the broader landscape of multiple substances across diverse user groups\. By contrast,ReDoseexpanded the source to 7 related subreddits spanning multiple drugs\. This would allow medical researchers to have a wider perspective of commonly abused drugs\. With regards to NLP techniques and substance abuse, Spadaro et al\.\[[28](https://arxiv.org/html/2605.26445#bib.bib7)\]investigated discussions of precipitated opioid withdrawal \(POW\) in the context of fentanyl and buprenorphine induction by analyzing 267,136 posts from seven opioid\-related subreddits between 2012 and 2021\. Using a combination of keyword searches and NLP filtering, they identified and thematically analyzed several hundred posts specifically referencing POW and microdosing \(Bernese method\)\. While their approach yielded valuable insights into community experiences, it was limited by its reliance on keyword\-based retrieval, which may have excluded relevant discussions in other communities—potentially introducing selection bias and constraining the generalizability of their findings\. Henry et al\.\[[16](https://arxiv.org/html/2605.26445#bib.bib9)\]focused on the 2018 National NLP Clinical Challenges shared task, which aimed to extract Adverse Drug Events \(ADEs\) from clinical records\. The task evaluated three main areas: concept extraction, relation classification, and end\-to\-end systems\. The study employed deep learning\-based methods, specifically BiLSTM\-CRF models, and achieved high performance across various areas\. However, BiLSTM\-CRF models faced significant challenges in identifying ADEs and reason concepts because they require inference across multiple sentences or paragraphs\. Symptoms or reactions may be implied rather than explicitly stated, causing difficulty for models that excel to local sequence labeling but struggle with long\-range dependencies\. In a recent dataset, ‘Reddit Impact’, Ge et al\.\[[12](https://arxiv.org/html/2605.26445#bib.bib15)\]analyzed clinical and social impacts of Substance Use Disorders \(SUDs\) using data from Reddit\. It introduced the Reddit\-Impacts dataset, derived from posts in fourteen opioid\-related subreddits, aiming to capture the clinical and social impacts of substance use as reported by individuals discussing their personal experiences\. The researchers adopted NLP techniques, including BERT, RoBERTa, DANN, and GPT\-3\.5, to automatically identify and classify these impacts\. Despite the dataset’s value in highlighting real\-world impacts of SUDs, its limitations included the sparsity of annotated impacts, which instead relied on the LLM’s judgment\. This could lead to unreliable annotations, as language models struggle to accurately extract impacts from SUDs that contain slang\. Potential selection bias may arise from focusing on specific subreddits and failing to accurately represent the broader population\. Compared with other studies, our study features a higher standard of annotation, including the involvement of a medical toxicology specialist\. Additionally, to ensure accuracy, additional annotators were brought in to annotate the documents in the validation dataset\. Thus, we believeReDosecovers a broader range of drugs with professional annotations, which makes it a stronger candidate for medical research\. ### 2\.2Models on medical NER tasks Large Language Models:Since the era of LLMs, much work has been conducted on how they can be used in the medical field\. Recent works by Li et al\.\[[24](https://arxiv.org/html/2605.26445#bib.bib8)\]investigated the performance of various LLMs on medical NER\. GPT\-4 achieved satisfactory F1 scores, with models like PromptNER and GPT\-NER achieving F1 scores above 90% on the BC5CDR and NCBI datasets\. Ashok and Lipton\[[4](https://arxiv.org/html/2605.26445#bib.bib30)\]introduced PromptNER, which used a chain\-of\-thought approach to improve named entity recognition by generating a logical sequence of steps to identify entities in text\. Wang et al\.\[[30](https://arxiv.org/html/2605.26445#bib.bib31)\]introduced GPT\-NER by appending special tokens and adding a self\-verification strategy\. Another work by Hu et al\.\[[18](https://arxiv.org/html/2605.26445#bib.bib32)\]evaluated the use of GPT\-3\.5 and GPT\-4 for clinical NER tasks, focusing on datasets from MTSamples and VAERS\. By employing a structured prompt engineering framework, the models demonstrated improved performance in extracting medical problems, treatments, and tests, as well as in identifying adverse events related to nervous system disorders\. Despite these improvements, the GPT models still lag behind BioClinicalBERT, which had superior performance on both datasets\. The study highlighted the potential of GPT models for clinical NER tasks, but also underscored the need for further refinement and the development of better evaluation metrics\. All datasets and codes are publicly available, promoting further research and development in this area\. Small Language Models:Small Language Models \(SLMs\) demonstrated greater sensitivity to the amount of training data, with performance improving significantly as the amount of data increased\. For example, models like W\-PROCER\[[25](https://arxiv.org/html/2605.26445#bib.bib34)\]and MetaNER\[[9](https://arxiv.org/html/2605.26445#bib.bib33)\]performed better on the 5\-shot datasets than on the 1\-shot datasets\. However, SLMs struggled with fewer annotations and lacked the robustness seen in LLMs, particularly in scenarios with limited training data\. ## 3Materials and Methods ### 3\.1Data source The documents inReDosewere collected from seven subreddits detailed in Table[2](https://arxiv.org/html/2605.26445#S3.T2)\. These subreddits were chosen since previous smaller studies have demonstrated their richness and validity\[[14](https://arxiv.org/html/2605.26445#bib.bib6),[28](https://arxiv.org/html/2605.26445#bib.bib7),[12](https://arxiv.org/html/2605.26445#bib.bib15)\]\. We first wrote custom programs usingpraw\[[5](https://arxiv.org/html/2605.26445#bib.bib18)\], a widely used Python wrapper for the Reddit API, which allows authenticated programmatic access to posts, comments, and metadata\. This enables the extraction of the text of each post and its timestamp from these subreddits\. To protect user privacy and ensure data integrity, metadata such as explicit usernames, geolocation references, or external links were discarded at the point of collection\. Duplicate posts were also excluded\. In other words, if the text was posted in two different subreddits, we removed all but one mention in the amalgamated dataset to prevent overrepresentation of repeated content\. Table 2:Summary of the 7 Subreddits ### 3\.2Annotation Process Each document inReDose, including training and testing datasets, was annotated with three entities: the drug, its dose, and the reported effect, by a board\-certified and clinically active medical toxicologist\. A sample annotation with three types of entities was presented in Figure[1](https://arxiv.org/html/2605.26445#S3.F1)\. Additionally, two medical science students independently annotated the validation dataset for the same entities\. “DRUG” represents a drug, vitamin, or herb, but not a neurotransmitter, which is termed a xenobiotic in pharmacology and toxicology\. “DOSE” indicates the quantity and units of the substance taken\. In our annotation scheme, annn\-gram of any size could be annotated\. “EFFECT” represents a change in the physical or mental state that the writer attributed to the substance\. For example, “SEA \#4”, “black tar heroin”, and “cocaine” all received the label DRUG\. We chose this approach to capture the same concept written in different ways\. Online commentaries often use periphrastic or obfuscatory constructions\. Spacing between words is frequently irregular: “SEA\#4” and “SEA \#4” occur with almost equal frequency\. One goal of developing an NER module is to feed its output into a named entity linker\. While “heroin”, “SEA \#4”, and “black tar heroin” may all refer to the substance heroin, they are distinguished as different drug variants during the linking process, since “SEA \#4” \(southeast Asia \#4\) is more potent and “black tar heroin” is the formulation most associated with wound infections\. Figure 1:A sample annotation with three types of entities\. ### 3\.3Validation of annotations To better quantify the agreement between two annotators, we employed a more suitable inter\-annotator agreement \(IAA\) calculation method\. The traditional IAA uses a binary measurement, which classifies each rater as fully agreeing or fully disagreeing with each document\. This metric is less informative for documents that contain multiple entities\. Only by analyzing the inter\-rater agreement entity by entity can the traditional IAA calculation be effective\. However, it would not assess how wellReDosecan train models for the intended use case\. Thus, we adopted the method described by Jarrar et al\.\[[20](https://arxiv.org/html/2605.26445#bib.bib14)\]for our needs: κ\\displaystyle\\kappa=Po−Pe1−Pe\\displaystyle=\\frac\{P\_\{o\}\-P\_\{e\}\}\{1\-P\_\{e\}\}\(1\)Pe\\displaystyle P\_\{e\}=1N2∑TnT,1×nT,2\.\\displaystyle=\\frac\{1\}\{N^\{2\}\}\\sum\_\{T\}n\_\{T,1\}\\times n\_\{T,2\}\.\(2\)In this formula,PoP\_\{o\}refers to the observed agreement between annotators\.PeP\_\{e\}refers to an expected agreement\.NNrepresents the total number of annotations in the dataset,TTthe number of different labels or categories in the label set,nT,in\_\{T,i\}the number of times the annotatoriiassigned the labelTT\. ### 3\.4Developing the BERT\-based models We first developed three BERT\-based models: BaseBERT\[[11](https://arxiv.org/html/2605.26445#bib.bib4)\], BioBERT\[[22](https://arxiv.org/html/2605.26445#bib.bib10)\], and BiomedBERT\[[15](https://arxiv.org/html/2605.26445#bib.bib13)\]\. These models were pre\-trained on general corpora, medical knowledge, and biomedical data, respectively, and then fine\-tuned on theReDosetraining set\. We excluded sentences that contain no entities from the training set\. Since most articles exceed the token limit of BERT \(512\), we applied sentence tokenization using spaCy\. The experiment was conducted on a single server equipped with two A6000 GPUs, each offering 48 GB of memory\. The hyperparameters used in the experiment include 5 training epochs, a learning rate of 3e\-5, a batch size of 8, and the default AdamW optimizer\. To assess whether additional neural network layers could enhance performance, we incorporated two models that combine a conditional random field \(CRF\) and long short\-term memory \(LSTM\) layers on top of the better\-performing BaseBERT and BioBERT models separately, following the architecture proposed by Huang et al\.\[[19](https://arxiv.org/html/2605.26445#bib.bib16)\]\. #### Evaluation settings We employed the span\-level precision, recall, and F1 scores as evaluation metrics\. For a prediction to be considered a true positive, it must satisfy two conditions: \(1\) the predicted span is identical to the truth entity, and \(2\) they should have the same entity type\. To assess the statistical significance of our models, we performed document\-level bootstrapping for five BERT\-based models\. Specifically, for the test set, we sampled 367 documents with replacement and evaluated the model on these documents\. This process was repeated 100 times, yielding a distribution of the performance metric, such as Kappa\. From this distribution, we reported the 95% confidence intervals\. ### 3\.5Developing the Large Language Models We included results from two LLMs in our study: GPT\-4\[[2](https://arxiv.org/html/2605.26445#bib.bib12)\]and the open\-source Llama 3 model\[[3](https://arxiv.org/html/2605.26445#bib.bib11)\]\. We applied one\-shot prompting \(Box[3\.5](https://arxiv.org/html/2605.26445#S3.SS5)\) and RAG\-based prompting\[[23](https://arxiv.org/html/2605.26445#bib.bib3)\]\. For the Llama 3 model, we tested both the small\-scale 8B and the large\-scale 70B variants\. ``` Carefully read the following sentence. Output any mention that can be classified as DRUG, EFFECT, or DOSE and return the mention in the format of a "token" - label. If there's a duplicate, return one is enough. A DRUG is defined as a drug, vitamin, or herb, but not a neurotransmitter. Here is an example for DRUG: 'Also, in addition to a saturated solution in terms of dissolved APAP, a cloudy solution has all that plus a shitload of suspended / unsettled APAP as well ... what you see at the bottom is indeed some APAP, but also a shitload of insoluble excipients/fillers/colorants.' Should return an output: "APAP" - DRUG. An EFFECT is defined as a change in physical or mental state associated with the substance. An example for EFFECT: `Is there anything anyone can recommend for the RLS ?' should return an output: 'RLS' - EFFECT; A DOSE is defined as the quantity of a medicine or drug taken or recommended to be taken at a particular time. An example for DOSE: `I was thinking like 2 mg' should return an output: "2" - DOSE; If there are multiple mentions of DRUG and EFFECT in the input, output them all. If there is no mention, please output: NO MENTION ``` We constructed a corpus by reshaping the training corpus for BERT\-based models into JSON format, with each item consisting of a sentence and its relevant labels\. With each retrieval round, the system utilized the BertSimilarity module from\[[33](https://arxiv.org/html/2605.26445#bib.bib2)\]to identify the top 5 examples from the corpus most similar to the current query and append them to the RAG prompt \(Box[3\.5](https://arxiv.org/html/2605.26445#S3.SS5)\)\. The concatenated prompt was then later fed into the generation pipeline\. We provided an example of our final prompt, along with the query \(extracted from the test dataset\) and relevant examples, below\. ``` Carefully read the following sentence and output any mention that can be classified as DRUG, EFFECT, or DOSE and return the mention in a list of "token" - label. No justification is needed. If there's no mention, return an empty list. If there's a duplicate, return one is enough. Examples of similar sentences are provided as well. Examples of similar sentences: 1) ... "example sentence": "Because you are wearing tracks deep into your neural processes by flooding them with cheap dopamine and serotonin frequently .", "example label": "[(dopamine, DRUG), (serotonin, DRUG)]" .... 2) ... "example sentence": "Still on an antidepressant but not one affecting serotonin as much.", ``example label": "[(antidepressant, DRUG), (serotonin, DRUG)]" ... 3) ... "example sentence": "I 2nd this ; your depression is reduced by dopamine stimulated by mu - receptor temporarily .", "example label": ``[(dopamine, DRUG)]" 4) ... "example sentence": "I definitely get a burst of energy / dopamine from subs I've been on them for many years same dose ( actually less is more for me ) .", "example label": "[(dopamine, DRUG)]" 5) ... "example sentence": "Benzos just do nt touch me antidepressants do not help and stimulants are a no go." ... Input sentence: But I don't know of any antidepressants that are selective dopamine reuptake inhibitors ... ``` #### Evaluation settings Given that models typically struggle to correctly offset tokens due to insensitivity to numbers, we used document\-level precision, recall, and F1 scores as our evaluation metrics\. For a prediction to be classified as a true positive, it must meet two criteria: \(1\) the predicted entity must exactly match the true entity within each document, and \(2\) the predicted entity and the true entity must share the same entity type\. If an entity appears multiple times, it was counted only once per document\. Note that although our prompt included an instruction to extract DOSE, the subsequent scoring process became highly complex and subjective\. The main reason is that DOSE entities are typically composed of multiple tokens, and it is common for LLMs to extract parts of them or return them as separate pairs, which makes evaluation difficult\. Thus, we did not include the DOSE in the LLM report, and F1 scores were calculated without considering the DOSE\. ## 4Results ### 4\.1Dataset Records ReDosecontains 6,435 documents and 15,469 sentences with 6,267 drug\-related entities\. There are 4,784 DRUG, 750 EFFECT, and 733 DOSE entities\. The dataset is split into 6,068 documents for training and 367 documents for testing \(Figure[3](https://arxiv.org/html/2605.26445#S4.F3)\)\. At the sentence level, it is split into 14,260 sentences for training and 1,209 for testing\. Table[3](https://arxiv.org/html/2605.26445#S4.T3)outlines the entity distribution: the training set includes 4,004 DRUG, 647 DOSE, and 674 EFFECT entities, while the test set contains 833 DRUG, 98 DOSE, and 128 EFFECT entities\. While annotations can be represented in various formats, we used the BioC XML format due to several considerations\[[10](https://arxiv.org/html/2605.26445#bib.bib27)\], especially because the format is simple and easy to modify, allowing analysis tools to be applied rapidly\. A sample XML is provided; the starting point for the BioC format is acollectionof documents \(Figure[2](https://arxiv.org/html/2605.26445#S4.F2)\)\. Figure 2:BioC format\.Eachdocumentconsists of a series of passages\. Everypassageincludes anoffset, which suggests the character offset within the parent document, andtext, which stores the actual text of the passage\. In each passage, annotations, which identify entities of DRUG, DOSE, and EFFECT, are applied directly to the surfacetext, which means the exact words as they appear rather than the transformed or normalized clinical concepts\. The annotation containslocation; it specifies the startingoffsetand thelengthof the annotated text within the passage\. Besides, the IAA between our two annotators on the validation dataset is 75\.1%\\%\. The labels from both annotators are all included in our dataset in the testing folder\. Figure 3:Creation of theReDosedataset\.Table 3:Description ofReDose\. ### 4\.2Performance of the BERT\-based models Table[4](https://arxiv.org/html/2605.26445#S4.T4)shows the performance for different BERT\-based models in entity extraction tasks\. BaseBERT achieved the highest recall for DRUG extraction \(0\.873\), compared with BioBERT \(0\.838\) and BiomedBERT \(0\.787\)\. Additionally, it achieved a competitive F1\-score of 0\.882\. In comparison, BioBERT achieved the highest recall for DOSE extraction \(0\.788\), compared with BaseBERT \(0\.752\) and BiomedBERT \(0\.714\)\. Notably, BiomedBERT achieved the highest precision in DRUG extraction \(0\.907\), compared with BaseBERT \(0\.891\) and BioBERT \(0\.892\)\. However, the analysis of these metrics also highlighted significant variability in EFFECT extraction\. Specifically, BiomedBERT showed a considerably low recall of 0\.025 and an F1\-score of 0\.043, suggesting that while this model is robust for DRUG extraction, it struggles to detect more nuanced entities, such as EFFECT\. For BaseBERT and BioBERT, the performance is neither competitive with EFFECT; they show a precision of 0\.424 and 0\.380, a recall of 0\.231 and 0\.167, and an F1\-score of 0\.298 and 0\.230, respectively\. To determine if better performance could be achieved, we added CRF\+LSTM layers to the BaseBERT and BioBERT models\. Table 4:Performance Comparison for BERT\-based models\.When comparing BERT models with the CRF\+LSTM architecture, we found that the BioBERT\+CRF\+LSTM and BaseBERT\+CRF\+LSTM models performed better in identifying EFFECT than BiomedBERT, achieving notably higher F1\-scores of 0\.231 and 0\.291, respectively, compared to 0\.043\. However, their performance in extracting DRUG and DOSE was comparable to that of other BERT\-based models\. ### 4\.3Performance of the Large Language Models Table[5](https://arxiv.org/html/2605.26445#S4.T5)compares the performance of four types of LLMs in DRUG and EFFECT extraction\. Among these, Llama\-3 70B achieved the highest performance in DRUG extraction with a precision of 0\.74, a recall of 0\.84, and an F1\-score of 0\.79\. In contrast, GPT\-4 showed stronger performance in EFFECT extraction with a recall of 0\.41\. In contrast, GPT\-4 exhibited more balanced performance across both entity types\. Although its DRUG F1\-score \(0\.72\) was slightly lower than that of Llama\-3 70B, GPT\-4 achieved a much higher recall on EFFECT extraction \(0\.41\), resulting in the best overall F1\-score for EFFECT\. Besides, GPT\-4’s micro averages \(precision: 0\.46, recall: 0\.69, F1\-score: 0\.55\) indicated average\-level performance but fell slightly behind Llama\-3 70B \(precision: 0\.64, recall: 0\.76, F1\-score: 0\.70\)\. Further experiments showed that the RAG approach significantly improved Llama\-3 8B’s performance in DRUG extraction, with a recall exceeding 80% vs\. 44% without RAG and an F1\-score 75% vs\. 55% without RAG\. However, gains in EFFECT extraction were modest, with the F1\-score increasing only from 0\.05 to 0\.07\. Table 5:Performance Comparison for LLMs\. ## 5Discussion ReDoseis a novel dataset that not only integrates with other datasets to develop a comprehensive medical NER dataset but also provides valuable insights into substances used by the public and the vernacular terms they used, which often differ markedly from formal medical terminologies\. For example, the drug “fentanyl” may be referred to as “fent” or “f” on online forums\. Epidemiologists unfamiliar with these terms may miss emerging hotspots, delaying intervention\. Physicians unfamiliar with informal terms might misunderstand a patient’s pattern of substance use, or have difficulty building rapport\. This highlights another crucial aspect ofReDose– it bridges the gap between the formal medical language used in healthcare settings and the colloquial terms used in everyday discussions about drugs\. When testing both the small\-scale 8B and the large\-scale 70B variants for Llama 3, we observed that the base model demonstrated a significantly weaker ability to follow the instructions in the prompts\. This observation is consistent with prior work demonstrating that instruction\-tuned models outperform base models in the medical field\[[17](https://arxiv.org/html/2605.26445#bib.bib28),[32](https://arxiv.org/html/2605.26445#bib.bib29)\]\. These findings underscore the importance of instruction\-tuning in instruction\-intensive tasks, particularly in the medical field\. Regarding BERT\-based models, several factors may contribute to observed variations in performance\. First, discussions on platforms like Reddit about drugs, dosages, and effects often utilize language more akin to everyday speech\. This linguistic alignment might account for the similar performance levels observed in BaseBERT and BiomedBERT\. Secondly, the relative simplicity of the task may limit the advantages of employing more complex models, such as the CRF\+LSTM architecture, which has not shown significant improvements over simpler models\. Additionally, tokenization plays a crucial role in the performance of NER systems\. For instance, both BaseBERT and BioBERT split the word “chloride” into tokens “ch\-lo\-ride”, treating “lo” and “ride” as special tokens and leaving “ch” for prediction by the model\. This approach to tokenization could significantly affect the performance\. Moreover, there appears to be confusion in how BERT\-based models manage certain terminology\. For instance, the term “Straight” is labeled as O \(Outside\) in the training set but is labeled as DRUG in the testing set due to different meanings represented in the sentence, which suggests inconsistencies in dataset annotations\. Similarly, the term “Oxy” is annotated as B\-DRUG \(Beginning of DRUG\) in the training data but changes to I\-DRUG \(Inside of DRUG\) in the testing set\. Finally, the models struggle particularly with the EFFECT category\. The annotation of EFFECT often requires recognizing spans of multiple words, such as in “depress your respiration”\. While the models may accurately identify “depress” as B\-EFFECT \(Beginning of EFFECT\), they frequently fail to recognize “your” as part of the same entity \(I\-EFFECT\), leading to incorrect or incomplete entity predictions\. In contrast, the performance differences among LLMs can be attributed to several other factors\. Firstly, unlike BERT\-based models, LLMs operate without supervision and therefore lack access to label distribution\. This limitation leads to LLMs having a poor estimate of entity frequency in the dataset\. Consequently, we observe more false\-negative extractions from LLMs than from the gold\-standard labels\. Secondly, our evaluation applies a strict criterion: extracted entities are deemed incorrect unless every word in an entity is output exactly as annotated\. This high standard particularly challenges EFFECT extraction, since one may use diverse language to describe their feelings\. In our dataset, EFFECT spans range from canonical affective terms \(e\.g\., “euphoria”\) to abbreviations \(e\.g\., “WD” for withdrawal, as well as shorthand such as “PWD”\), and even metaphorical expressions \(e\.g\., “kill your hormones”\)\. Such variability increases boundary ambiguity and often necessitates co\-reference resolution to determine the referent of an effect mention\. Moreover, under our strict criteria, predictions that deviate by even a single token are counted as incorrect, thereby disproportionately penalizing EFFECT relative to more lexically stable entity types\. A related challenge arises for DOSE extraction, as dosage information is frequently expressed in relative or ranged forms \(e\.g\., proportions or intervals\) rather than as standardized absolute quantities\. Future work could improve EFFECT extraction by augmenting training data with abbreviation and slang normalization \(e\.g\., mapping WD to withdrawal\) and lexicon\-assisted weak supervision, and by incorporating context modeling beyond sentence\-level NER, such as co\-reference resolution, to better link effects to the triggering drug\-use context\. This study has several limitations\. One limitation ofReDoseis the potential for selection bias\. Not all substance users share their personal experiences in public forums or honestly represent their full experience\.ReDosefocuses on the dose\-effect relationship, a relationship essential to understanding toxicity and problematic usage\. But, dose\-effect relationships do not tell the whole story of substance use\. Medical records provide a more appropriate data source for analyzing fatalities\. Our benchmarks evaluated the original large language models with a one\-shot trial\. Given the length of our dataset, we did not perform fine\-tuning onReDose, which might improve performance\. Finally, while annotations were curated by domain\-informed annotators, future extensions ofReDosemay benefit from broader engagement with experts\. We hope that future efforts will address these limitations and contribute to the development of improved models and benchmarks\. Notably, BERT\-based models outperformed state\-of\-the\-art large language models in terms of recall and F1\-score\. However, BaseBERT achieved higher recall and F1 scores compared to domain\-specific models\. This may be attributed to the nature of Reddit content, where user\-generated text tends to align more closely with general language corpora than with strictly medical jargon\. Additionally, attempts to enhance BERT\-based models with CRF\-LSTM layers did not yield better performance\. This may be due to the increased complexity of these layers, which may have introduced noise rather than improving the model’s ability to extract relevant information from unstructured text\. These results suggest that simpler models may be better suited to user\-generated content, particularly when the text deviates from domain\-specific language\. Several directions remain for future exploration\. First, expanding the annotation schema would improve coverage of colloquial expressions, rare substances, and nuanced descriptions of effects, which are currently underrepresented\. Such schema expansion may also facilitate the identification of emerging or previously unseen drugs by capturing recurring patterns of drug use discourse, even when specific substance names are not yet standardized or widely known\. Second, more advanced fine\-tuning approaches, including parameter\-efficient tuning and domain\-adaptive pretraining, could further enhance model robustness across entity types\. Beyond the current English\-only dataset, cross\-lingual extensions ofReDosewould enable monitoring of substance use discourse in diverse communities worldwide\. Another promising direction is multimodal integration, incorporating non\-textual cues such as emojis, images, or memes may enrich the contextual interpretation of user posts\. ## 6Conclusion To bridge the gap between formal medical terminology and the colloquial language commonly used in online discussions about drug use, we proposedReDose\. By curating a dataset of 6,435 Reddit posts with annotated drug names, doses, and effects,ReDoseenabled the development and benchmarking of machine learning models to extract clinically relevant information from unstructured social media text\. The benchmarking results demonstrated that while LLMs such as GPT\-4 and Llama\-3 showed promise, domain\-specific, fine\-tuned models, such as BioBERT and BiomedBERT, also achieved high precision, particularly for drug extraction\. The results suggest that RAG may be a better alternative to manually selecting the most representative examples under this setting\.ReDosenot only provides a valuable resource for advancing natural language processing in the substance use domain but also emphasizes the importance of integrating user\-generated data into public health research to detect emerging substance use trends and guide interventions\. Future work should address dataset biases, refine annotations, and explore fine\-tuning methods to further improve model performance\. ## Acknowledgment This research was supported by the National Library of Medicine under the grant numbers R01LM014306\. ## References - \[1\]A\. Abdelati, M\. M\. Burns, and M\. Chary\(2023\)Sublethal toxicities of 2, 4\-dinitrophenol as inferred from online self\-reports\.PLoS one18\(9\),pp\. e0290630\.Cited by:[§1](https://arxiv.org/html/2605.26445#S1.p1.1)\. - \[2\]J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2024\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.External Links:2303\.08774Cited by:[§3\.5](https://arxiv.org/html/2605.26445#S3.SS5.p1.1)\. - \[3\]AI@Meta\(2024\)Llama 3 model card\.External Links:[Link](https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md)Cited by:[§3\.5](https://arxiv.org/html/2605.26445#S3.SS5.p1.1)\. - \[4\]D\. Ashok and Z\. C\. Lipton\(2023\)PromptNER: prompting for named entity recognition\.arXiv preprint arXiv:2305\.15444\.Cited by:[§2\.2](https://arxiv.org/html/2605.26445#S2.SS2.p1.1)\. - \[5\]B\. Boe\(2023\)PRAW: the python reddit api wrapper\.External Links:[Link](https://arxiv.org/html/2605.26445v1/praw.readthedocs.io/en/stable/)Cited by:[§3\.1](https://arxiv.org/html/2605.26445#S3.SS1.p1.1)\. - \[6\]D\. A\. Bowen, J\. O’Donnell, and S\. A\. Sumner\(2019\)Increases in online posts about synthetic opioids preceding increases in synthetic opioid death rates: a retrospective observational study\.Journal of general internal medicine34,pp\. 2702–2704\.Cited by:[§1](https://arxiv.org/html/2605.26445#S1.p1.1)\. - \[7\]M\. Chary, N\. Genes, C\. Giraud\-Carrier, C\. Hanson, L\. S\. Nelson, and A\. F\. Manini\(2017\)Epidemiology from tweets: estimating misuse of prescription opioids in the usa from social media\.Journal of Medical Toxicology13\(4\),pp\. 278–286\.Cited by:[§1](https://arxiv.org/html/2605.26445#S1.p1.1)\. - \[8\]M\. Chary, E\. H\. Park, A\. McKenzie, J\. Sun, A\. F\. Manini, and N\. Genes\(2014\)Signs & symptoms of dextromethorphan exposure from youtube\.PloS one9\(2\),pp\. e82452\.Cited by:[§1](https://arxiv.org/html/2605.26445#S1.p1.1)\. - \[9\]J\. Chen, Y\. Lu, H\. Lin, J\. Lou, W\. Jia, D\. Dai, H\. Wu, B\. Cao, X\. Han, and L\. Sun\(2023\-07\)Learning in\-context learning for named entity recognition\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 13661–13675\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.764)Cited by:[§2\.2](https://arxiv.org/html/2605.26445#S2.SS2.p3.1)\. - \[10\]D\. C\. Comeau, R\. Islamaj Doğan, P\. Ciccarese, K\. B\. Cohen, M\. Krallinger, F\. Leitner, Z\. Lu, Y\. Peng, F\. Rinaldi, M\. Torii, A\. Valencia, K\. Verspoor, T\. C\. Wiegers, C\. H\. Wu, and W\. J\. Wilbur\(2013\)BioC: a minimalist approach to interoperability for biomedical text processing\.Database2013,pp\. bat064\.External Links:[Document](https://dx.doi.org/10.1093/database/bat064),ISSN 1758\-0463Cited by:[§4\.1](https://arxiv.org/html/2605.26445#S4.SS1.p1.1)\. - \[11\]J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova\(2019\-06\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),Minneapolis, Minnesota,pp\. 4171–4186\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§3\.4](https://arxiv.org/html/2605.26445#S3.SS4.p1.1)\. - \[12\]Y\. Ge, S\. Das, K\. O’Connor, M\. A\. Al\-Garadi, G\. Gonzalez\-Hernandez, and A\. Sarker\(2024\)Reddit\-impacts: a named entity recognition dataset for analyzing clinical and social effects of substance use derived from social media\.arXiv preprint arXiv:2405\.06145\.External Links:2405\.06145Cited by:[Table 1](https://arxiv.org/html/2605.26445#S1.T1.12.12.3),[§1](https://arxiv.org/html/2605.26445#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.26445#S2.SS1.p6.1),[§3\.1](https://arxiv.org/html/2605.26445#S3.SS1.p1.1)\. - \[13\]R\. L\. Graves, C\. Tufts, Z\. F\. Meisel, D\. Polsky, L\. Ungar, and R\. M\. Merchant\(2018\)Opioid discussion in the twittersphere\.Substance use & misuse53,pp\. 2132–2139\.Cited by:[§1](https://arxiv.org/html/2605.26445#S1.p1.1)\. - \[14\]R\. L\. Graves, J\. Perrone, M\. A\. Al\-Garadi, Y\. Yang, J\. S\. Love, K\. O’Connor, G\. Gonzalez\-Hernandez, and A\. Sarker\(2022\-Jul\-Aug\)Thematic analysis of reddit content about buprenorphine\-naloxone using manual annotation and natural language processing techniques\.J Addict Med16\(4\),pp\. 454–460\.Cited by:[Table 1](https://arxiv.org/html/2605.26445#S1.T1.2.2.3),[§1](https://arxiv.org/html/2605.26445#S1.p1.1),[§1](https://arxiv.org/html/2605.26445#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.26445#S2.SS1.p4.1),[§3\.1](https://arxiv.org/html/2605.26445#S3.SS1.p1.1)\. - \[15\]Y\. Gu, R\. Tinn, H\. Cheng, M\. Lucas, N\. Usuyama, X\. Liu, T\. Naumann, J\. Gao, and H\. Poon\(2021\)Domain\-specific language model pretraining for biomedical natural language processing\.ACM Transactions on Computing for Healthcare \(HEALTH\)3\(1\),pp\. 1–23\.Cited by:[§3\.4](https://arxiv.org/html/2605.26445#S3.SS4.p1.1)\. - \[16\]S\. Henry, K\. Buchan, M\. Filannino, A\. Stubbs, and Ö\. Uzuner\(2020\)2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records\.Journal of the American Medical Informatics Association27\(1\),pp\. 3–12\.Cited by:[Table 1](https://arxiv.org/html/2605.26445#S1.T1.10.10.3),[§1](https://arxiv.org/html/2605.26445#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.26445#S2.SS1.p5.1)\. - \[17\]Y\. Hou, C\. Bert, A\. Gomaa, G\. Lahmer, D\. Höfler, T\. Weissmann, R\. Voigt, P\. Schubert, C\. Schmitter, A\. Depardon, S\. Semrau, A\. Maier, R\. Fietkau, Y\. Huang, and F\. Putz\(2024\-01\)Fine\-tuning a local llama\-3 large language model for automated privacy\-preserving physician letter generation in radiation oncology\.Frontiers in Artificial Intelligence7,pp\. 1493716\.External Links:[Document](https://dx.doi.org/10.3389/frai.2024.1493716)Cited by:[§5](https://arxiv.org/html/2605.26445#S5.p2.1)\. - \[18\]Y\. Hu, Q\. Chen, J\. Du, X\. Peng, V\. K\. Keloth, X\. Zuo, Y\. Zhou, Z\. Li, X\. Jiang, and Z\. Lu\(2024\-09\)Improving large language models for clinical named entity recognition via prompt engineering\.Journal of the American Medical Informatics Association31\(9\),pp\. 1812–1820\.External Links:[Document](https://dx.doi.org/10.1093/jamia/ocad259)Cited by:[§2\.2](https://arxiv.org/html/2605.26445#S2.SS2.p2.1)\. - \[19\]Z\. Huang, W\. Xu, and K\. Yu\(2015\)Bidirectional lstm\-crf models for sequence tagging\.arXiv preprint arXiv:1508\.01991\.External Links:1508\.01991Cited by:[§3\.4](https://arxiv.org/html/2605.26445#S3.SS4.p2.1)\. - \[20\]M\. Jarrar, M\. Khalilia, and S\. Ghanem\(2022\)Wojood: nested arabic named entity corpus and recognition using bert\.InProceedings of the Thirteenth Language Resources and Evaluation Conference,pp\. 3626–3636\.Cited by:[§3\.3](https://arxiv.org/html/2605.26445#S3.SS3.p1.8)\. - \[21\]E\. C\. Leas, E\. M\. Hendrickson, A\. L\. Nobles, R\. Todd, D\. M\. Smith, M\. Dredze, and J\. W\. Ayers\(2020\)Self\-reported cannabidiol \(cbd\) use for conditions with proven therapies\.JAMA Network Open3\(10\),pp\. e2020977\.External Links:[Document](https://dx.doi.org/10.1001/jamanetworkopen.2020.20977)Cited by:[Table 1](https://arxiv.org/html/2605.26445#S1.T1.4.4.3),[§1](https://arxiv.org/html/2605.26445#S1.p2.1)\. - \[22\]J\. Lee, W\. Yoon, S\. Kim, D\. Kim, S\. Kim, C\. H\. So, and J\. Kang\(2019\-09\)BioBERT: a pre\-trained biomedical language representation model for biomedical text mining\.Bioinformatics36\(4\),pp\. 1234–1240\.Cited by:[§3\.4](https://arxiv.org/html/2605.26445#S3.SS4.p1.1)\. - \[23\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§3\.5](https://arxiv.org/html/2605.26445#S3.SS5.p1.1)\. - \[24\]J\. Li, Y\. Sun, R\. J\. Johnson, D\. Sciaky, C\. Wei, R\. Leaman, A\. P\. Davis, C\. J\. Mattingly, T\. C\. Wiegers, and Z\. Lu\(2016\)BioCreative v cdr task corpus: a resource for chemical disease relation extraction\.Database \(Oxford\)2016\.Cited by:[Table 1](https://arxiv.org/html/2605.26445#S1.T1.8.8.3),[§1](https://arxiv.org/html/2605.26445#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.26445#S2.SS2.p1.1)\. - \[25\]M\. Li, Y\. Ye, J\. Yeung, H\. Zhou, H\. Chu, and R\. Zhang\(2023\)W\-procer: weighted prototypical contrastive learning for medical few\-shot named entity recognition\.arXiv preprint arXiv:2305\.18624\.Cited by:[§2\.2](https://arxiv.org/html/2605.26445#S2.SS2.p3.1)\. - \[26\]N\. I\. of Health U\.S\. Department of Health and H\. Services\(2011\-11\)Commonly abused prescription drugs\.Cited by:[§2\.1](https://arxiv.org/html/2605.26445#S2.SS1.p2.1)\. - \[27\]S\. Pandrekar, X\. Chen, G\. Gopalkrishna, A\. Srivastava, M\. Saltz, J\. Saltz, and F\. Wang\(2018\)Social media based analysis of opioid epidemic using reddit\.AMIA Annual Symposium Proceedings2018,pp\. 867\.Cited by:[§1](https://arxiv.org/html/2605.26445#S1.p1.1)\. - \[28\]A\. Spadaro, A\. Sarker, W\. Hogg\-Bremer, J\. S\. Love, N\. O’Donnell, L\. S\. Nelson, and J\. Perrone\(2022\-06\)Reddit discussions about buprenorphine associated precipitated withdrawal in the era of fentanyl\.Clin Toxicol \(Phila\)60\(6\),pp\. 694–701\.Cited by:[Table 1](https://arxiv.org/html/2605.26445#S1.T1.6.6.3),[§1](https://arxiv.org/html/2605.26445#S1.p1.1),[§1](https://arxiv.org/html/2605.26445#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.26445#S2.SS1.p5.1),[§3\.1](https://arxiv.org/html/2605.26445#S3.SS1.p1.1)\. - \[29\]J\. N\. Tettey and S\. Levissianos\(2017\)The global emergence of nps: an analysis of a new drug trend\.InNovel Psychoactive Substances: Policy, Economics and Drug Regulation,pp\. 1–12\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-60600-2%5F1)Cited by:[§1](https://arxiv.org/html/2605.26445#S1.p1.1)\. - \[30\]S\. Wang, X\. Sun, X\. Li, R\. Ouyang, F\. Wu, T\. Zhang, J\. Li, G\. Wang, and C\. Guo\(2025\-04\)GPT\-NER: named entity recognition via large language models\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 4257–4275\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.239)Cited by:[§2\.2](https://arxiv.org/html/2605.26445#S2.SS2.p1.1)\. - \[31\]A\. P\. Wright, C\. M\. Jones, D\. H\. Chau, R\. M\. Gladden, and S\. A\. Sumner\(2021\)Detection of emerging drugs involved in overdose via diachronic word embeddings of substances discussed on social media\.Journal of Biomedical Informatics119,pp\. 103824\.Cited by:[§1](https://arxiv.org/html/2605.26445#S1.p1.1)\. - \[32\]C\. Wu, P\. Qiu, J\. Liu, H\. Gu, N\. Li, Y\. Zhang, Y\. Wang, and W\. Xie\(2025\)Towards evaluating and building versatile large language models for medicine\.npj Digital Medicine8\(1\),pp\. 58\.Cited by:[§5](https://arxiv.org/html/2605.26445#S5.p2.1)\. - \[33\]M\. Xu\(2022\-04\)Similarity: Text similarity calculation toolkit for Java\.External Links:[Link](https://github.com/shibing624/similarity)Cited by:[§3\.5](https://arxiv.org/html/2605.26445#S3.SS5.p3.1)\.
Similar Articles
RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models
RedBench introduces a universal dataset aggregating 37 benchmark datasets with 29,362 samples across 22 risk categories and 19 domains to enable standardized and comprehensive red teaming evaluation of large language models. The work addresses inconsistencies in existing red teaming datasets and provides baselines, evaluation code, and open-source resources for assessing LLM robustness against adversarial prompts.
Depression Risk Assessment in Social Media via Large Language Models
Researchers present a zero-shot LLM system that assesses depression risk from Reddit posts, achieving competitive F1 scores and demonstrating scalable mental-health monitoring.
Companies Are Using Reddit to Manipulate ChatGPT and Google AI Search
Companies are using Reddit spam to manipulate AI search results from ChatGPT and Google, prompting the r/Biohackers subreddit to ban peptide and HRT posts due to content quality degradation from AEO (AI engine optimization) tactics.
Cognitive-Linguistic Indicators of Depression in Online Communities: Analysed by DistilBERT and Holographic Reduced Representation
This paper presents a hybrid model combining DistilBERT embeddings with Holographic Reduced Representation vectors encoding cognitive-linguistic features (first-person pronouns, absolutist words, negative emotion ratios) to detect depression in Reddit posts, achieving a macro F1 of 0.94 and demonstrating that theory-driven features complement contextual embeddings for explainable mental health NLP.
EmbGen: Teaching with Reassembled Corpora
EmbGen is a synthetic data generation pipeline that reassembles corpora into entity-description pairs using embedding similarity to generate diverse QA pairs for fine-tuning small language models on specialized domains, showing significant improvements in factual accuracy.