LLM-based Models for Detecting Emerging Topics in Service Feedback

arXiv cs.AI Papers

Summary

This paper presents a novel methodology integrating LLMs, statistical techniques, and human-AI teaming to detect emerging topics in multilingual service feedback, aiming to improve service quality and fairness in public sector organizations.

arXiv:2606.26595v1 Announce Type: new Abstract: Enhancing the analysis of service feedback is essential for public sector organizations, particularly tax administrations, where trust and compliance depend on fair and effective service delivery. As feedback volumes grow, identifying emerging service quality issues and potential disparities across diverse populations becomes increasingly challenging. Traditional approaches often rely on manual review or static expert-defined indicators, limiting scalability and the ability to capture complex patterns in textual feedback. This paper presents a novel methodology that integrates large language models (LLMs), statistical techniques, and human-AI collaboration to improve multilingual customer feedback analysis. The primary objective is to detect emerging service quality topics that may also reveal potential inequities in service delivery. Our framework combines fine-tuned, quantized LLMs with expert oversight to produce accurate, computationally efficient, and context-aware analyses. The proposed approach was evaluated using similarity analysis and assessments from experienced tax officers, demonstrating stronger alignment with expert judgments than baseline models. By incorporating a human-in-the-loop framework, the methodology reduces LLM fabrication while improving the reliability and relevance of generated insights. The results demonstrate the practicality of combining LLMs with human expertise to support scalable, evidence-based decision-making in public sector organizations. This work contributes to the development of responsible AI systems that enhance service quality, responsiveness, fairness, and public trust through more effective analysis of multilingual customer feedback.
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:14 AM

# LLM-based Models for Detecting Emerging Topics in Service Feedback
Source: [https://arxiv.org/html/2606.26595](https://arxiv.org/html/2606.26595)
Ruth BankeyCristián BravoDepartment of Statistical and Actuarial Sciences, The University of Western Ontario, London, Ontario, Canada\. \{mtavako5, cbravoro\}@uwo\.caCanada Revenue Agency\. Ruth\.Bankey@cra\-arc\.gc\.ca

###### Abstract

Enhancing the analysis of service feedback data quality is crucial for public sector organizations, particularly tax administrations, where trust and compliance depend on fair and effective service delivery\. A key challenge in this domain is to ensure that service quality remains consistent and inclusive across a diverse population, particularly as feedback volumes increase, a dynamic that can reveal disparities in how services are experienced, a topic of growing importance in public service analytics aimed at promoting equitable delivery\.

Traditional methods for analyzing service feedback and supporting the detection of potential disparities or bias often rely on manual processes or static expert\-driven indicators, which are not easily scalable and may struggle to capture the nuanced patterns present in textual data\. In contrast, data\-informed approaches that take advantage of actual feedback offer a more robust and dynamic means of identifying emerging service quality issues and promoting equitable service delivery\.

This paper presents a novel methodology that integrates LLMs, statistical techniques, and human\-AI teaming to improve multilingual customer feedback analysis, with the primary goal of detecting emerging topics in service quality, areas that can also indicate potential biases\. Using AI\-powered tools alongside expert oversight, our aim is to identify service delivery trends and support equitable, responsive outcomes across diverse demographic groups\. Our approach incorporates fine\-tuned and quantized LLMs optimized for accuracy while minimizing computational demands\. Validation was performed using similarity analysis and an evaluation survey by tax officers with direct experience and expertise in tax service feedback operations, demonstrating improved alignment with expert assessments compared to baseline models\.

This study addresses the challenges of adapting LLMs to specific organizational contexts through a targeted fine\-tuning approach\. By integrating a human\-in\-the\-loop methodology, we mitigate LLM fabrication and ensure reliable, context\-aware outputs\. The results highlight the practicality and effectiveness of this methodology in improving service quality, responsiveness, and decision\-making in public sector operations\. This research contributes to the development of responsible AI systems that prioritize fairness, inclusivity, and public trust in automated service delivery\.

###### keywords:

Multilingual Feedback Analysis, Customer Service, Large Language Models \(LLMs\), Trend Detection, Fine\-tuning, Quantization, Fairness, Human\-in\-the\-Loop Evaluation, Topic Categorization\.

## 1Introduction

Enhancing customer service quality has become a cornerstone of modern organizational success, as it directly influences customer satisfaction, loyalty, and trust\[[45](https://arxiv.org/html/2606.26595#bib.bib1)\]\. In public sector services, such as tax administrations, the stakes are even higher\. An efficient and responsive service ensures compliance and fosters public satisfaction, which is critical to maintaining trust in government institutions\. However, public satisfaction with tax services in Canada and the United States has been moderate, with room for improvement\. In Canada, the overall Client Satisfaction Index \(CSI\) score was 63 out of 100, reflecting a moderately high level of satisfaction\[[25](https://arxiv.org/html/2606.26595#bib.bib2)\]\. In the United States, citizen satisfaction with federal government services reached 69\.7 out of 100 in 2024, the highest since 2017\[[11](https://arxiv.org/html/2606.26595#bib.bib3)\]\. These figures underscore the need for public organizations to ensure fairness in treating diverse demographic groups and responding promptly and effectively to customer needs\. One way to assess fairness is to analyze the topics that emerge more frequently in feedback across different demographic segments\. Significant differences in topic frequency can indicate disparities in service experience, an important signal of possible bias\. Although this paper does not aim to detect bias directly, identifying such patterns supports fairness objectives and aligns with the broader definition of bias as uneven treatment or outcomes among groups\. Addressing these disparities, whether they stem from unconscious individual biases or systemic issues in automated systems, is essential to maintain public trust and ensure equitable service delivery\.Despite this critical need, existing approaches in public service organizations often rely on manual processes or quantitative indicators that are not data\-driven, which are insufficient to capture the complexity of taxpayer concerns or address the nuanced bias and inequality patterns that arise in diverse demographic groups\[[32](https://arxiv.org/html/2606.26595#bib.bib17)\]\. Manual methods lack the efficiency and scope required to systematically detect and address biases across diverse demographic groups, especially as the volume and diversity of feedback increases\. These processes are often slow, resource intensive and prone to human error and intrinsic variability, making it difficult to identify patterns or trends in a timely manner\. This issue can be exacerbated in countries with more than one official language\. This limitation becomes particularly critical when analyzing feedback at scale, where nuanced insights, such as disparities in service delivery between demographic segments, may be missed\. In addition, automating real\-time service quality evaluations based on customer behavior has demonstrated notable improvements in user engagement, evaluation accuracy, and overall efficiency of service operations\[[29](https://arxiv.org/html/2606.26595#bib.bib72),[7](https://arxiv.org/html/2606.26595#bib.bib66)\]\. These advances ensure that organizations not only address underlying biases, but also streamline processes, offering personalized and equitable services at scale, which further enhances customer satisfaction and operational performance\.

To achieve this goal, the integration of advanced tools like artificial intelligence, particularly when paired with statistical techniques, can be highly effective\. These tools can analyze large datasets, detect inequality patterns, and ensure that decision\-making processes are fair and inclusive on a large scale\. Recognizing this need, we aim to develop a methodology that integrates LLMs, statistical techniques, and human expertise\.

Previous research has increasingly emphasized the importance of identifying disparities in service experience and bias, given its profound implications for fairness and equality in various social sectors\[[19](https://arxiv.org/html/2606.26595#bib.bib7)\]\. This growing focus stems from the need to develop more equitable systems and data\-informed decision\-making processes\. Numerous studies have underscored the importance of addressing bias\[[18](https://arxiv.org/html/2606.26595#bib.bib117),[60](https://arxiv.org/html/2606.26595#bib.bib5)\], demonstrating how targeted interventions can foster inclusivity and improve the overall effectiveness of systems that influence both daily life and institutional operations\. By engaging in a comprehensive investigation of disparities, scholars contribute to the development of a more just and equitable society\[[19](https://arxiv.org/html/2606.26595#bib.bib7)\]\.

Recent research increasingly emphasizes detecting emerging trends in customer feedback across demographic groups as a foundation for assessing fairness\. While our study does not directly detect bias, identifying consistent group\-level differences in concerns can signal potential inequalities\. Common methodologies include statistical techniques such as factor analysis, ANOVA, and regression to examine disparities that may reflect systemic service gaps or unintended biases\. For instance, Linzmajer et al\.\[[30](https://arxiv.org/html/2606.26595#bib.bib9)\]employed lab and field experiments to assess how identity factors influence customer perceptions in service encounters\. With advances in artificial intelligence \(AI\) and machine learning \(ML\), researchers have increasingly turned to more sophisticated computational techniques to support the detection of disparities and promote fairness\.

In addition to general language processing tasks, researchers have increasingly turned their attention to analyzing customer feedback and reviews to uncover disparities in how service quality is experienced in different demographic groups\[[54](https://arxiv.org/html/2606.26595#bib.bib53)\]\. Since customer reviews play a vital role in shaping perceptions of service performance, identifying patterns in topic prevalence by demographic segment can support fairer evaluations and help organizations address unequal service outcomes\[[20](https://arxiv.org/html/2606.26595#bib.bib12)\]\. These emerging topic trends offer an indirect but meaningful way to surface potential disparities that may reflect underlying biases\. Studies have confirmed that feedback and reviews significantly influence operational decisions, making them a valuable area for equity\-focused analysis\[[6](https://arxiv.org/html/2606.26595#bib.bib13)\]\.

With the growing use of textual data in customer experience evaluation, NLP techniques have become crucial for large\-scale analysis\.More recently, LLMs have been employed to model topic distributions and explore contextual bias, as demonstrated by Kumar et al\.\[[27](https://arxiv.org/html/2606.26595#bib.bib16)\]\. While prior work emphasizes explicit bias detection, our study focuses on emerging multilingual feedback patterns that, when analyzed by demographic group, may highlight disparities in public service delivery\. While prior research has advanced automated bias detection in customer service, aligning methodologies with an organization’s specific goals and context remains a challenge\. Existing statistical approaches often overlook the nuanced realities of service delivery\. Detecting emerging feedback topics by demographic group offers a more scalable and context\-sensitive method for uncovering disparities\. Although not explicitly labeled as bias, such patterns can highlight unequal experiences or unmet needs—key insights for enhancing fairness and service quality in public institutions\. Much of the research on text\-based analysis remains monolingual, overlooking the multilingual needs of organizations serving diverse populations\. Public sector entities, in particular, require tools that can detect biases and trends across languages to enhance accessibility and inclusivity\[[42](https://arxiv.org/html/2606.26595#bib.bib84)\]\.

Although advanced NLP and LLM\-based methods offer advantages in speed and accuracy, their application to bias detection remains limited\. Many studies rely on general\-purpose pre\-trained models, which often fail to capture domain\-specific nuances\[[21](https://arxiv.org/html/2606.26595#bib.bib74)\]\. Fine\-tuned models tailored to specific sectors, such as customer service and finance, have shown improved accuracy, yet their adoption is constrained by computational demands\[[58](https://arxiv.org/html/2606.26595#bib.bib86)\]\. To address this, developing quantized, resource\-efficient models is vital for broader accessibility\. Moreover, integrating human expertise into the evaluation process is essential, as automated systems may miss subtle contextual cues\. A hybrid human\-AI approach improves fairness and reliability by combining large\-scale automation with expert oversight\[[41](https://arxiv.org/html/2606.26595#bib.bib121)\]\.

AI\-powered tools, particularly NLP, hold great promise for analyzing large\-scale customer feedback and identifying emerging trends across demographic groups\. These trends can reveal disparities in service perception or delivery\[[1](https://arxiv.org/html/2606.26595#bib.bib87),[46](https://arxiv.org/html/2606.26595#bib.bib88)\]\. Miraz et al\.\[[33](https://arxiv.org/html/2606.26595#bib.bib133)\]emphasized the effectiveness of AI chatbots in enhancing customer service through improved communication, usability, and user trust, with broad impacts on operations and engagement\. However, the deployment of such technologies must address algorithmic bias—systemic unfairness arising from biased data or design\. To mitigate these risks, ethically aligned AI principles are essential, promoting fairness, transparency, and accountability in AI\-driven applications\[[35](https://arxiv.org/html/2606.26595#bib.bib81)\]\.

Ethically aligned AI principles serve as a foundation for responsible AI development, emphasizing key attributes such as transparency, accountability, explainability, and fairness\[[5](https://arxiv.org/html/2606.26595#bib.bib122)\]\.Fairnessensures that AI systems provide equitable treatment to all demographic groups, preventing discrimination, and fostering inclusivity\.Accountabilityintroduces human oversight into the AI decision making process, allowing for the validation and refinement of AI\-generated outcomes\.Transparencyplays a vital role in ensuring that AI processes remain open and comprehensible, allowing users to understand how decisions are made\. Additionally,explainabilityenhances trust by providing clear, justifiable reasoning behind AI decisions, ensuring that users and stakeholders can interpret AI\-driven conclusions with confidence\. By integrating these principles, organizations can develop AI systems that perform effectively and uphold ethical standards\. This approach helps mitigate biases, fosters public trust, and ensures that AI technologies serve society in a fair and responsible manner\.

To address existing gaps, our methodology uses fine\-tuned LLMs to classify customer feedback into predefined categories \(Service Quality Elements, SQEs\) and analyze trends—emerging, persistent, or disappearing—across demographic groups\. By focusing on Equity, Diversity, and Inclusion \(EDI\) and organizational goals, the tool captures issues often missed in traditional feedback systems\. The approach blends statistical modeling with quantized LLMs for conceptual relevance and resource efficiency\. Compared to baseline models, our LLMs showed stronger alignment with expert judgments, as confirmed through similarity analysis and expert surveys\.

Based on the objectives described, this study seeks to explore the following key research questions\.

1. 1\.How can we develop an automated human\-in\-the\-loop system for thematic analysis of taxpayer feedback that ensures transparency and explainability while enabling fair and inclusive service evaluation across diverse demographic groups?
2. 2\.How can we design a model that accurately identifies emerging patterns across different demographic groups within the feedback?
3. 3\.How can we implement automated topic modeling for feedback analysis to categorize responses into specific predefined categories?

Our methodology is designed to address three critical challenges that often arise when deploying LLMs for operational improvements\. One of the primary concerns is adapting LLMs to specific organizational contexts\. Many organizations use domain\-specific language or specialized terminology that differs significantly from the everyday language\. Standard pre\-trained LLMs may struggle to interpret or process this specialized vocabulary accurately, leading to potential miscommunication or inefficiencies\. To overcome this, our approach incorporates a fine\-tuning process in which the LLM is trained on organizational datasets\. By doing so, the model gains a more profound understanding of industry\-specific terminology, ensuring that it can effectively navigate the nuances of the organization’s language\.

Another major challenge involves mitigating confabulation, instances where LLMs generate inaccurate or fabricated outputs\. Given the potential risks of misinformation, we adopt a human\-in\-the\-loop approach to enhance reliability\. In this framework, specialized officers actively review, validate, and refine the LLM’s results before they are applied in decision\-making\. This oversight ensures that the output generated is not only accurate, but also contextually relevant and actionable, thereby strengthening trust in the recommendations of the AI system\. Finally, our methodology uses the efficiencies of LLMs to enhance process improvements, particularly in the area of feedback management\. Traditional methods of categorizing and analyzing feedback can be time\-consuming and resource\-intensive\. By integrating AI\-driven automation, we streamline the feedback categorization process and facilitate trend detection with greater speed and precision\. This shift reduces the manual workload, accelerates the generation of insights, and ultimately improves decision making throughout the organization\. Through these strategies, our approach ensures that LLMs are not only effectively integrated into operational workflows but also optimized to meet the specific needs and challenges of the organization\.

The paper is methodically structured to explore and validate the proposed multilingual LLM\-based model to analyze service feedback in the Tax Service111The term “Tax Service” is used to anonymize the name of the tax authority and its country of origin\.\. It begins with an introduction and a comprehensive review of the literature, covering existing customer service improvement systems, the background on topic modeling, and identifying current research gaps\. The methodology section details the processes of data collection and preprocessing, the general framework of the proposed system, and the specific techniques used for topic modeling and trend analysis\. The Results section presents an in\-depth analysis, including text visualizations, findings from topic modeling, model evaluation, and trend detection\. Finally, the article concludes with a discussion of the key findings, acknowledges the limitations of the study, and suggests directions for future research to improve the scalability and adaptability of the model\.

## 2Literature Review

### 2\.1Background on Topic Modeling

As we use topic modeling in our model as part of automated service feedback analysis and trend detection, our aim is to provide a literature review on well\-known topic modeling techniques, including LDA\-like methods\[[61](https://arxiv.org/html/2606.26595#bib.bib77)\], BERT\-based models and LLM\-based frameworks\[[28](https://arxiv.org/html/2606.26595#bib.bib51)\]\. Topic modeling has experienced significant advances over the years, incorporating diverse methodologies to address evolving analytical challenges\. Topic modeling has experienced significant advances over the years, incorporating diverse methodologies to address evolving analytical challenges\. As reviewed by Murshed et al\.\[[37](https://arxiv.org/html/2606.26595#bib.bib18)\]foundational approaches such as Latent Dirichlet Allocation \(LDA\), and Latent Semantic Analysis \(LSA\) have been pivotal in text analysis\. LLMs have progressively augmented these traditional techniques to tackle challenges such as short text processing, event detection, and sentiment analysis, reflecting their growing relevance in contemporary applications\.

Recent progress in topic modeling has also seen the integration of transformer\-based models such as BERT, which enhances topic coherence and interpretability\. Li et al\. introduced a hybrid model that combines BERT with probabilistic topic modeling techniques to refine topic representations by using semantic word embeddings and multimodal supervision through labels and visual features\[[28](https://arxiv.org/html/2606.26595#bib.bib51)\]\. Rogers et al\. provided an exhaustive review of the BERT architecture, underscoring its adaptability and effectiveness in various NLP tasks\[[43](https://arxiv.org/html/2606.26595#bib.bib93)\]\. Mishra et al\. demonstrated the efficacy of BERTopic in capturing evolving themes within computational economics, further highlighting the superiority of modern methods over traditional techniques\[[34](https://arxiv.org/html/2606.26595#bib.bib94)\]\. Recent advances in the application of LLMs across diverse fields offered valuable insight into improving topic\-modeling methodologies\. Zhao et al\. investigated the integration of LLMs into complex design tasks, addressing challenges such as fine\-tuning and adapting models to specific real\-world contexts, which were critical to effective topic modeling\[[59](https://arxiv.org/html/2606.26595#bib.bib25)\]\. Sufi et al\. focused on abstractive summarization and examined how LLM addressed semantic inconsistencies, offering implications for improving coherence and accuracy in topic detection\[[51](https://arxiv.org/html/2606.26595#bib.bib135)\]\. Applications beyond traditional domains also provided transferable insights\. Tzelves et al\. highlighted the role of LLM in surgical innovation, emphasizing methodological refinements that could be applied to topical modeling frameworks\[[53](https://arxiv.org/html/2606.26595#bib.bib28)\]\. Similarly, educational applications of LLM focus on improving semantic precision and contextual relevance crucial for interpreting complex feedback in educational settings, enhancing both understanding and participation in topic modeling\[[15](https://arxiv.org/html/2606.26595#bib.bib124)\]\. Together, these studies underscored the versatility and evolving capabilities of LLMs, demonstrating their transformative potential to refine topic modeling techniques in various domains\.

Identified Gaps Emerging from the Literature: Existing topic modeling studies often focus on broad topic extraction rather than expert\-defined themes, limiting their applicability in domains requiring precise categorization\. This misalignment hinders integration with operational processes\. Moreover, many approaches rely on generic LLMs without fine\-tuning them for domain\-specific feedback, reducing their effectiveness in interpreting nuanced comments\. Our research addresses these gaps by exploring the benefits of fine\-tuning LLMs on service manuals and applying quantization to enhance deployment efficiency in resource\-constrained environments\. These advancements support a transition from statistical to deep learning approaches and inform our development of a human\-in\-the\-loop pipeline for accurate, fair, and scalable multilingual topic categorization in public service feedback\.

### 2\.2Leveraging Machine Learning for Customer Feedback and Service Optimization

As part of our methodology for analyzing taxpayer feedback at the Tax Service, we draw from the broader literature on machine learning and natural language processing applications in customer feedback analysis\.

ML has emerged as a transformative tool for improving customer service\[[40](https://arxiv.org/html/2606.26595#bib.bib52)\], predicting customer behaviors\[[44](https://arxiv.org/html/2606.26595#bib.bib67)\], and leveraging customer feedback\[[39](https://arxiv.org/html/2606.26595#bib.bib75)\]in various industries\. Recent studies demonstrate the growing importance of ML models over traditional statistical methods in addressing customer\-centric problems such as customer satisfaction, return behavior, and targeted marketing\[[8](https://arxiv.org/html/2606.26595#bib.bib65)\]\. Yi and Liu \(2020\)\[[56](https://arxiv.org/html/2606.26595#bib.bib98)\]applied ML algorithms for customer sentiment analysis to recommend products and stores based on customer reviews\. The authors demonstrated that ML techniques significantly outperform existing approaches, achieving high accuracy \(98%\) in predicting product recommendations\.Hwang et al\. \(2020\)\[[23](https://arxiv.org/html/2606.26595#bib.bib101)\]extended this line of inquiry by predicting customer return visits in airline services through ML classifiers\. Their results achieved an accuracy of 83\.42%, highlighting the role of sentimental features in improving predictive performance\. The study also identified that higher word counts in feedback enhanced prediction accuracy, showcasing the importance of review content quality in analyzing customer loyalty, and highlighting the growing reliance on ML to automate feedback processing\. In addition, while ML models are widely used, studies have compared their performance with traditional statistical methods\[[31](https://arxiv.org/html/2606.26595#bib.bib76)\]\.

However, ML models face challenges when applied to real\-world data\. Simester et al\. \(2020\)\[[49](https://arxiv.org/html/2606.26595#bib.bib100)\]examined the robustness of ML methods based on model\-driven and classification in customer targeting\. Although model\-driven methods excelled under ideal conditions, their performance diminished under data challenges, such as covariate shift and imbalanced data\. The classification methods performed poorly, prompting the need for further optimization in the handling of degraded datasets\. These findings stress the importance of data quality in ML applications for customer service\. In contrast, Zaghloul et al\. \(2024\)\[[57](https://arxiv.org/html/2606.26595#bib.bib99)\]demonstrated that traditional ML models, such as Random Forest, outperformed deep learning approaches in predicting customer satisfaction in e\-Commerce, achieving 92% accuracy\. The study identified critical satisfaction drivers, such as delivery time and order accuracy, and emphasized the practical value of simpler, interpretable ML methods over complex deep learning models\. The integration of ML with optimization\-based models has shown substantial success in enhancing customer outcomes\. Feldman et al\. \(2022\)\[[12](https://arxiv.org/html/2606.26595#bib.bib103)\]compared a multinomial logit \(MNL\)\-based model with ML algorithms to determine optimal product displays on Alibaba’s marketplaces\. The MNL\-based approach outperformed the Alibaba ML model, generating a 28% revenue increase and highlighting the potential of combining choice models with ML features for revenue optimization\.

NLP has become a crucial tool to improve customer service by automating feedback analysis, query resolution, and sentiment evaluation\. Recent research highlights diverse applications of NLP techniques, offering insights into improving customer satisfaction, loyalty, and decision\-making across various domains\.

Focusing on sentiment analysis \(SA\) in underserved languages, Islam et al\.\[[24](https://arxiv.org/html/2606.26595#bib.bib112)\]analyzed YouTube comments to explore public opinion about the war\. They used a sentiment analysis tool along with an unsupervised BERT model to uncover key topics associated with the war\. Nair et al\.\[[38](https://arxiv.org/html/2606.26595#bib.bib131)\]proposed an NLP\-driven approach to improve chatbot\-based customer service by enabling natural and human\-like interactions\. Their model leverages sentiment analysis, entity extraction, and intent detection to improve response accuracy and customer satisfaction\. Despite scalability and efficiency, challenges such as ambiguous language and the need for large training datasets persist\. This aligns with our work in that both approaches aim to improve automated service interactions using NLP techniques\.

For social network applications, sentiment analysis plays a key role in opinion mining\.\[[26](https://arxiv.org/html/2606.26595#bib.bib130)\]introduced a hybrid model that integrates ResNeXt with a recurrent neural framework to improve multiclass classification\. This method improves accuracy by removing noise through unsupervised processing and minimizing annotation efforts compared to traditional techniques\. The model was tested on Amazon and Twitter datasets\. To improve customer loyalty, Tarnowska and Ras \(2023\)\[[52](https://arxiv.org/html/2606.26595#bib.bib108)\]developed CLIRS2, an NLP\-powered recommender system to extract actionable insights from unstructured text data\.

Shahin et al\. \(2024\)\[[47](https://arxiv.org/html/2606.26595#bib.bib109)\]showcased GPT\-3\.5 Turbo’s effectiveness in extracting nuanced multilingual insights for Voice of Customer \(VoC\) analysis, integrating NLP with Lean Six Sigma 4\.0 to support real\-time service strategies\. Shu et al\. \(2024\)\[[48](https://arxiv.org/html/2606.26595#bib.bib110)\]applied NLP to analyze social media feedback on electric vehicles, identifying key perceived risks—such as Performance and Time Risk—with over 40% negative sentiment in all categories\. Huang et al\. \(2022\)\[[22](https://arxiv.org/html/2606.26595#bib.bib134)\]highlighted the potential of NLP\-powered chatbots in improving communication and engagement, particularly within customer service and language learning applications\.

Recent advances in machine learning \(ML\) and natural language processing \(NLP\) have transformed customer feedback analysis, enabling organizations to extract actionable insights for personalized service delivery\. Yang et al\. \(2023\)\[[55](https://arxiv.org/html/2606.26595#bib.bib113)\]demonstrated how Siamese TextCNN and attention mechanisms help to analyze customer sentiments to increase engagement\. Bauer et al\. \(2023\)\[[4](https://arxiv.org/html/2606.26595#bib.bib114)\]stressed the importance of explainable AI, showing that feature\-based explanations improve user understanding but also risk biases and decision\-making inconsistencies\.

Despite extensive research on customer feedback in commercial sectors, there remains a notable gap in understanding and modeling service feedback within public tax administrations\. This is especially relevant for organizations like the Canada Revenue Agency, which operate in complex, multilingual, and policy\-driven environments\. The existing literature rarely addresses the specific nature of taxpayer feedback or the administrative challenges faced by public sector institutions, making this a critical and underexplored area\. Our research directly responds to this gap by developing a feedback analysis system tailored to the operational context of the Tax Service and of tax services in general\.

In addition, several critical gaps remain, underscoring the need for our proposed solution\. One of the primary limitations of existing research is the insufficient focus on multilingual feedback analysis\. Although numerous studies have used NLP and machine learning techniques, they often do not address the complexities involved in processing multilingual feedback\. Current approaches struggle with the simultaneous analysis of feedback in multiple languages, limiting their applicability in diverse linguistic environments\. Another limitation in current research is the absence of automated trend detection and analysis within feedback systems\. Although there are various methods for topic and sentiment analysis, they rarely incorporate automated mechanisms to identify emerging, persistent, and disappearing trends\. An end\-to\-end system capable of categorizing feedback while simultaneously tracking trends over time is still lacking in the literature\. Lastly, the integration of domain experts’ experience with LLM\-generated output remains underexplored, particularly in multilingual feedback analysis within public sector applications\. Although some studies emphasize the value of combining expert judgment with machine learning models this approach has not been extensively applied in public service feedback analysis\. Our research addresses this gap by evaluating topic categorization using two complementary methods: a similarity score and a statistical comparison of human versus model\-assigned scores\. This dual validation approach ensures a more rigorous assessment of system performance and enhances the reliability of AI\-assisted feedback analysis\.

## 3Methodology

Our methodology includes human oversight during both data de\-identification and model validation, ensuring fairness, trust, and accountability in line with ethical AI standards\.

### 3\.1Data Collection & Preprocessing

The dataset for this project was sourced from multiple decentralized systems within the Tax Service, all of which were approved for use from both privacy and legal standpoints\. Although this presented substantial challenges in terms of data integration and consistency, rigorous care was taken to uphold data protection standards\. Beyond these compliance considerations, it is essential to emphasize that organizations like the Tax Service must also systematically understand service feedback to drive continuous quality improvement and enhance the responsiveness of public service delivery\. The collected dataset comprises both structured numerical features, such as categorical data and demographic attributes, and unstructured free\-form text entries, mainly consisting of taxpayer feedback\. This diverse mix of data types required customized preprocessing strategies to maximize the utility of each feature for analysis\.

#### 3\.1\.1Text Data

As illustrated in Figure[1](https://arxiv.org/html/2606.26595#S3.F1), preprocessing steps were necessary to prepare the raw taxpayer feedback for analysis\. A crucial component of this process was to ensure data security and privacy compliance, particularly in the handling of sensitive information\. Due to the sensitive nature of taxpayer feedback, which frequently contains personally identifiable information \(PII\), the first and foremost step was a thorough de\-identification procedure\. This was essential not only to comply with stringent data protection regulations, but also to facilitate secure data transfer to computational resources with greater processing power \(e\.g\., enhanced GPU capabilities\) for further analysis\. The de\-identification process was comprehensive, involving several key actions: detecting and masking personal information such as Social Insurance Numbers and phone numbers; anonymizing individual names to prevent identity disclosure; masking or removing references to specific organizations and locations while retaining generalized geographic data \(e\.g\., provinces\); altering structured data like agent numbers and specific monetary amounts to prevent reidentification risks; and identifying and masking email addresses, web links, and other personal identifiers\. In addition, certain numerical data were masked, although information such as percentages was retained to ensure the integrity of the statistical analysis\. This rigorous process balanced the need for privacy with the preservation of data’s analytical value\. This process was also conducted to ensure that the model processing the feedback is less prone to bias, as no mention of a taxpayer’s PII will be in the feedback that is fed to the LLM\.

![Refer to caption](https://arxiv.org/html/2606.26595v1/pre.jpg)Figure 1:Overview of Text Preprocessing Pipeline: De\-identification, Named Entity Recognition \(NER\), and Custom Functions for Ensuring PrivacyWe employed two approaches and tools, specifically utilizing the following methods for de\-identification purposes\.

Advanced Entity Recognition Using Transformers:To de\-identify sensitive personal information within taxpayer feedback, we employed a methodology based on Transformer models, leveraging the powerful capabilities of BERT\. Specifically, we utilized theXLM\-Roberta\-Large\-Finetuned\-Conll03\-Englishmodel, a variant of RoBERTa \(Robustly Optimized BERT Pretraining Approach\) that has been fine\-tuned over the CoNLL\-03 dataset for Named Entity Recognition \(NER\)\. This model is particularly well suited for multilingual NER tasks, making it highly effective in processing both English and French texts, an essential capability given the bilingual nature of the feedback\. However, recognizing that NER models alone may not capture all forms of personally identifiable information \(PII\), especially in noisy, informal, or domain\-specific text, we complemented the transformer\-based approach with additional rule\-based pattern matching techniques\. These included customized regular expressions to detect structured entities such as Social Insurance Number \(SIN\) formats, postal codes, and monetary values\. Finally, we performed an expert review of the flagged examples to ensure a robust removal of sensitive information while preserving the contextual integrity of the feedback\.

Creating Functions for Specific Name Structures:In situations where our standard model did not fully obscure private information or when we required a more tailored approach to anonymization, we developed a custom function to enhance privacy protection based on the organization’s specific needs\. This function relies primarily on regular expressions to identify and anonymize patterns associated with financial figures, phone numbers, specific year formats, addresses, cities of employment, websites, email addresses, and certain personal names that follow identifiable patterns, along with various other personally identifiable information \(PII\)\. Although the function is designed to mask sensitive details such as monetary amounts and years, it selectively retains nonsensitive data such as monetary percentages and month names to align with specific data retention requirements\. Additionally, it effectively anonymizes names that may not be easily recognized by conventional models, as well as city names, while allowing country names to remain visible\. This customized approach is usually necessary and industry\-dependent\. It ensures a more precise and adaptable de\-identification process, meeting stringent data privacy requirements while preserving essential contextual information\.

#### 3\.1\.2Numeric Data

Service Quality Elements:Each feedback entry is assigned at least one Service Quality Element topic, SQE, categorizing the feedback into specific service quality dimensions\. This categorization has been performed manually\. Tax Service Feedback Program Officers\. The various SQE topics are illustrated in Figure[2](https://arxiv.org/html/2606.26595#S3.F2)\.

Access toescalationAccessibilityAvailabilityClarityCompletenessConsistencyConvenienceFindabilityInformationAccuracyInformationFormatOfficiallanguagesProfessionalismTimelinessFigure 2:The 13 Service Quality Elements in which each feedback is classified into\. As opposed to the current Tax Service process, each piece of feedback can be associated with more than one SQE\.Demographic Features:For each feedback category, we also capture demographic information, including ’Gender’, ’Preferred Language’ and ’Age\.’ These are the demographic features for which we will track emerging topics within the feedback texts\.

### 3\.2General framework

This study introduces a tool to analyze taxpayer feedback by classifying it into Service Quality Elements \(SQEs\) and detecting topic trends across demographic groups\. Figure[3](https://arxiv.org/html/2606.26595#S3.F3)outlines the overall methodology, while Sections[3\.3](https://arxiv.org/html/2606.26595#S3.SS3)and[3\.4](https://arxiv.org/html/2606.26595#S3.SS4)detail technical components such as model tuning and trend analysis\.

1. 1\.Preprocessing:Raw feedback is de\-identified to protect privacy and eliminate bias linked to personal identifiers, supporting ethical AI practices\.
2. 2\.Text Processing via LLM Classifiers:A fine\-tuned and quantized LLM classifies feedback into SQEs with high accuracy and efficiency, capturing nuanced content in both English and French\.
3. 3\.Agent Validation:Tax Service representatives validate model outputs, enhancing accountability and transparency, and helping refine the model for greater alignment with human judgment\.
4. 4\.Trend Detection & Pattern Analysis:Regression models identify whether SQE\-related topics are emerging, persistent, or fading across demographics, revealing potential disparities and improving explainability\.
5. 5\.Ethical Standard:The system emphasizes fairness, transparency, and human oversight\. It detects unequal topic trends across groups using demographic stratification and regression\. Explainability is ensured through justifications for each SQE assignment, enabling trust and accountability in public service applications\. To reinforce accountability, expert oversight remains integral to the process, with experts retaining final responsibility for decisions—ensuring that a person, not just an algorithm, is accountable for the model’s outcomes\. Transparency is equally critical; users are clearly informed about the AI’s role in processing and analyzing feedback, enabling informed engagement with system recommendations\. Explainability, as defined in the ethical AI literature, refers to the ability of external users—such as stakeholders, or auditors—to understand and trust the model’s output\. This contrasts with interpretability, which typically emphasizes internal teams understanding the model’s architecture or parameters\[[17](https://arxiv.org/html/2606.26595#bib.bib136)\]\. In our system, explainability is enhanced by generating explicit justifications for each topic and relevance score, allowing stakeholders to trace how feedback is categorized\. We align our framework with responsible AI design principles that emphasize fairness, transparency, and human agency, particularly in public sector applications\[[36](https://arxiv.org/html/2606.26595#bib.bib137)\]\.

![Refer to caption](https://arxiv.org/html/2606.26595v1/general.jpg)Figure 3:Flowchart showing the process of analyzing de\-identified text feedback to detect trends across demographics\. Steps include de\-identification, LLM classification, expert validation, pattern detection, and trend analysis\. This process adheres to Ethically aligned AI principles, ensuring fairness, accountability, transparency, and explainability\.
### 3\.3Topic Modeling

The methodology aims to categorize taxpayer feedback into 13 predefined elements of service quality \(Figure[2](https://arxiv.org/html/2606.26595#S3.F2)\)\.To classify feedback into 13 categories, we moved beyond traditional models like BERT and LDA because they often produced overlapping and indistinct topics, making it difficult to clearly assign feedback to specific service elements\. Instead, we adopted a refined LLM\-based approach, evaluating both prompt engineering and fine\-tuning methods on small transformer\-decoder models \(5B–13B\), selected based on organizational constraints and deployment timing\.

#### 3\.3\.1Prompt Engineering

For prompt engineering, we use Zephyr\-7B\-β\\beta\[[10](https://arxiv.org/html/2606.26595#bib.bib127)\]\. Zephyr is a 7 billion parameter language model, similar in architecture to GPT models, and fine\-tuned for natural language processing tasks\. It is based on Mistral\-7B\-v0\.1\[[3](https://arxiv.org/html/2606.26595#bib.bib128)\], an efficient and powerful Transformer\-based model known for its strong language understanding capabilities\. Zephyr has been specifically fine\-tuned on a diverse mix of publicly available and synthetic datasets, enabling it to generate high\-quality text while maintaining broad generalizability across various applications\. One of Zephyr’s key advantages is its MIT license, which provides flexibility for both research and commercial use\. This makes it an attractive option for developers and organizations seeking to integrate an advanced NLP model into their workflows without restrictive licensing constraints\.

#### 3\.3\.2Fine\-Tuned LLM

Mistral\-7B\-Instruct\-v0\.2\[[2](https://arxiv.org/html/2606.26595#bib.bib129)\]is an updated version with a greatly expanded context window size, from 8,000 to 32,000 tokens, which enhances its ability to handle longer text sequences\. This model has moved away from the sliding\-window attention mechanism to a more efficient design, which reduces computational costs and increases processing speed\. Importantly, the Mistral model has been trained multilingually, making it capable of understanding and processing inputs in both English and French, an essential feature for analyzing Canadian taxpayer feedback\. We chose to fine\-tune this model to compare its performance against the prompt engineering approach\. Our goal is to adapt the model to better handle the task using its enhanced architecture, which includes a larger context window and optimized attention mechanisms\. These improvements make it particularly well suited for analyzing complex and lengthy texts, ensuring more accurate and context\-aware processing\. By refining it with authority\-specific data and expert input, we improve its ability to detect subtle patterns and specific nuances in taxpayer feedback, which generic models might overlook\. This customization not only ensures a more in\-depth understanding of the diverse feedback but also aligns the model more closely with operational requirements and standards, thus improving the accuracy and efficiency of our service quality evaluations\.

#### 3\.3\.3Quantize fine\-tuned model using GPTQ:

After fine\-tuning the Mistral\-7B\-Instruct\-v0\.2 we decided to quantize it using Gradient\-Preserving Quantization \(GPTQ\)\[[13](https://arxiv.org/html/2606.26595#bib.bib132)\]\. Gradient\-Preserving Quantization \(GPTQ\) is a technique designed to reduce the computational complexity of large neural networks without significantly affecting their performance\. By quantizing the weights of the model to lower precision, such as reducing 32\-bit floating point numbers to 16\-bit or 8\-bit, GPTQ helps decrease the model’s memory footprint and speeds up processing\. Unlike traditional quantization methods that can lead to a loss in precision, GPTQ maintains the effectiveness of the model by carefully adjusting weight reductions in a way that minimizes the impact on gradient flows during training\[[14](https://arxiv.org/html/2606.26595#bib.bib59),[50](https://arxiv.org/html/2606.26595#bib.bib60)\]\. This makes GPTQ particularly useful for deploying complex models on devices with limited computational resources, maintaining a balance between efficiency and performance\.

This approach was chosen to enhance the model’s resource efficiency without significantly affecting its performance\. Quantization minimizes the computational load of the model by reducing the precision of the data it processes, which decreases memory requirements and speeds up response times\. Implementing GPTQ ensures that these resource reductions do not undermine the model’s ability to recognize subtle feedback differences, which is critical for the detailed analysis necessary in our work\. This strategy enables Mistral\-7B\-Instruct\-v0\.2 to operate more efficiently on the organization’s existing infrastructure, facilitating a more sustainable and cost\-effective implementation while maintaining robust performance\. The overall workflow, from preprocessing to topic modeling, is depicted in Figure[4](https://arxiv.org/html/2606.26595#S3.F4)\.

Preprocessed TextMistral FinetuneGPTQ QuantizationTopic CategoriesFigure 4:Flowchart illustrating the processing of the preprocessed text through the model, which is then fine\-tuned and subsequently quantized using GPTQ before being utilized for Topic Modeling purposes\.

### 3\.4Trend Detection Analysis

After extracting topics and digitizing relevance scores, the methodology progresses to its final stage: analyzing emerging trends using logistic regression\. This approach allows for a more in\-depth understanding of how specific topics have evolved across feedback received while considering key demographic variables\. The analysis focuses on categorical factors that were selected as the most critical to the organization’s operation: age group, gender, and preferred language to uncover meaningful patterns within the feedback\. Through this method, topics are classified into three different trend categories based on their prominence and evolution\.Emerging trendsrepresent new topics that have recently surfaced in the feedback data, indicating evolving concerns\.Persistent trendshighlight topics that remain consistently significant in both time periods, demonstrating their ongoing relevance\. In contrast,disappearing trendsreflect topics that were initially prominent but have shown a decrease in frequency, suggesting a reduced level of concern or interest\.

![Refer to caption](https://arxiv.org/html/2606.26595v1/trend.jpg)Figure 5:Workflow of trend analysis across two time periods\. The figure depicts the process of data splitting, feature extraction, logistic regression modeling, coefficient comparison, and the classification of trends into emerging, persistent, or disappearing categories\.#### 3\.4\.1Timeframes for the Trend Analysis Process

Our trend analysis begins with establishing a split date to divide the dataset into two distinct segments\. Given that the feedback spans an entire year, we employ two different splitting strategies to examine the evolving dynamics of taxpayer concerns\. These approaches help uncover significant shifts in feedback trends while balancing sensitivity to short\-term changes and broader patterns over time\.

##### Entire Year Versus New Quarter

An approach involves using the entire year as a baseline period and comparing it with the most recent quarter\. This method allows us to detect notable changes in feedback themes, revealing whether new concerns have emerged or if existing ones have gained or decreased in prominence versus the operation during the previous year\. By analyzing a full year of data as context, we ensure that trends are assessed comprehensively\. However, this approach may obscure subtler or emerging signals, as comparing a short recent period against a long baseline can dilute the visibility of weaker patterns that are only beginning to manifest\.

##### Semester Versus Last Semester

Another approach divides the year into two equal periods of six months\. This split facilitates a direct comparison between the first and second halves of the year, enabling a more precise analysis of changes in feedback patterns\. However, this method may introduce additional noise, as tax declarations and related interactions may vary across different times of the year due to seasonal effects\. An alternative strategy, if more data were available, would be to compare equivalent semesters over consecutive years\. This would help determine whether issues from the last semester persist into the following year, offering a more reliable way to assess long\-term trends while mitigating seasonal distortions\.

By applying these trend analysis techniques, we gain a clearer understanding of how feedback evolves, ensuring that emerging patterns are detected with both short\-term responsiveness and long\-term stability\.

#### 3\.4\.2Logistic Regression Model

The deidentified and privacy\-compliant dataset is then prepared for logistic regression analysis\.

Two sets of logistic regression models are trained separately on feedback from earlier and later periods to assess how demographic influences on topic relevance change over time\. For each SQE topic, the dependent variableYi∈\{0,1\}Y\_\{i\}\\in\\\{0,1\\\}indicates whether the topic was present in feedbackii\. Demographic featuresXi​jX\_\{ij\}serve as predictors in estimating the coefficientsβj\\beta\_\{j\}, representing their effect on topic occurrence:

log⁡\(P​\(Yi=1\)1−P​\(Yi=1\)\)=β0\+∑j=1kβj​Xi​j\\log\\left\(\\frac\{P\(Y\_\{i\}=1\)\}\{1\-P\(Y\_\{i\}=1\)\}\\right\)=\\beta\_\{0\}\+\\sum\_\{j=1\}^\{k\}\\beta\_\{j\}X\_\{ij\}\(1\)
To detect evolving patterns, coefficients from the two time periods are compared:

Δ​βj=βj\(2\)−βj\(1\)\\Delta\\beta\_\{j\}=\\beta\_\{j\}^\{\(2\)\}\-\\beta\_\{j\}^\{\(1\)\}
This difference captures the change in influence of each demographic feature over time, helping identify emerging, persistent, or fading topic trends\.

#### 3\.4\.3Trend Detection

To assess changes in demographic influence on topic trends, we apply a bootstrapping method by resampling the dataset thousands of times and recalculating topic coefficients\. This process yields distributions of coefficient estimates, from which we compute mean values and 95% confidence intervals\. A topic is consideredemergingif a feature becomes statistically significant in the current period,disappearingif it loses significance, andpersistentif it remains significant across both periods\. Features that are insignificant in both timeframes are excluded\. This method enables a robust, statistically grounded analysis of evolving feedback patterns\.

## 4Result

### 4\.1Text visualization

We have collected a dataset consisting of 6,515 feedback records in English and 1,646 records in French\. To gain a better understanding of the patterns, themes, and key topics present in both languages, we employed a unigram and bigram\[[16](https://arxiv.org/html/2606.26595#bib.bib126)\]approach for text analysis\. Unigrams refer to individual words in the feedback, and analyzing them helps us capture the most frequent words or terms that occur across all records\. Bigrams, on the other hand, consider pairs of consecutive words, allowing us to capture common phrases or word associations that convey more context or meaning than single words\. In Figure[6](https://arxiv.org/html/2606.26595#S4.F6), we observe four graphs \(labeled a, b, c, and d\) that illustrate the unigram \(graphs a and c\) and bigram \(plots b and d\) analyses from the English and French feedback datasets\. Plots \(a\) and \(c\) show unigram frequencies with the most common words that appear on the text\. For instance, terms like ‘tax’ and ‘account’ are dominant in the English feedback, while words like ‘*jai*’, ‘*demande*’ and ‘*dossier*’ are prominent in the French feedback\. Plots \(b\) and \(d\) show the most frequent bigrams, revealing common\-word pairs\. In English, phrases such as “tax return” and “Tax Service account” reflect key user concerns, whereas in French, expressions like “jai fait” suggest frequent references to receiving documents or services\. This variation between the English and French datasets may suggest a potential bias in the concerns and feedback provided by English\-speaking and French\-speaking taxpayers\. This raises the need for further investigation in subsequent sections to explore whether differences in feedback reflect differing experiences or expectations\.

![Refer to caption](https://arxiv.org/html/2606.26595v1/x1.png)\(a\)Plot 1: Top 10 Most Frequent Unigrams in English Feedback
![Refer to caption](https://arxiv.org/html/2606.26595v1/x2.png)\(b\)Plot 2: Top 10 Most Frequent Bigrams in English Feedback
![Refer to caption](https://arxiv.org/html/2606.26595v1/x3.png)\(c\)Plot 3: Top 10 Most Frequent Unigrams in French Feedback
![Refer to caption](https://arxiv.org/html/2606.26595v1/x4.png)\(d\)Plot 4: Top 10 Most Frequent Bigrams in French Feedback

Figure 6:Unigram and Bigram Frequency Analysis of English and French Feedback
### 4\.2Topic Modeling Result

As discussed in the methodology section, one step in the research process involves assigning topics \(Figure[2](https://arxiv.org/html/2606.26595#S3.F2)\) to the text\. For this, we applied two approaches: prompt engineering using the Zephyr\-7B\-beta model and a quantized, fine\-tuned LLM based on Mistral\-7B\-Instruct\-v0\.2\. The following sections detail the implementation and results of these methods\.

To effectively leverage a pre\-trained model, we designed specific prompts in both English and French to accommodate the bilingual nature of the feedback\. We used the Zephyr\-7B\-beta model, instructing it to categorize feedback based on key topics \(SQEs\) using relevance scores derived from its responses\. For each feedback entry, we structured an instruction set that combined an initial “system” directive with the user\-provided content \(i\.e\., the actual feedback\)\. This was formatted in a conversational manner to improve the model’s comprehension and response accuracy\. The model was configured with a token limit of 2048 and parameter settings designed to balance response diversity while maintaining relevance\. After generating responses, we post\-processed the output to extract only the relevant sections, compiling a categorized collection of feedback for further analysis\.

#### 4\.2\.1LLM: Fine\-Tuned & Quantized for Efficiency

To enhance the precision of topic modeling in taxpayer feedback, we fine\-tuned the Mistral model specifically for the Tax Service’s context\. For this purpose, initially, a detailed prompt is fed into the model along with the feedback, outlining the SQEs and instructing the LLM to assign relevance scores and provide justification for each category\. The LLM then processes this information to output a relevance score ranging from 1 to 5 for each SQE, accompanied by a word or sentence from the taxpayer feedback that justifies the score, thereby establishing a direct correlation between the feedback and specific service elements\. Subsequently, the output text containing the categorized feedback and relevance scores is parsed and structured into a data frame for further analysis\. This process is repeated for each piece of feedback in the dataset, accumulating a comprehensive output that categorically assigns relevance scores to the feedback according to the predefined SQEs\. In extending our methodology to embrace Canada’s bilingual landscape, the same approach is applied to French taxpayer feedback, utilizing a corresponding French prompt\.The flow of the process is visualized in Figure[7](https://arxiv.org/html/2606.26595#S4.F7), which divides the methodology into three primary phases: Tokenization, Fine\-tuning, and Digitalization\.

![Refer to caption](https://arxiv.org/html/2606.26595v1/topic.jpg)Figure 7:Three\-phase methodology for taxpayer feedback categorization: Tokenization prepares text input for processing; Fine\-tuning adapts the model to Tax Service\-specific SQEs; Digitalization organizes generated outputs into structured data for trend analysis and insights extraction\.
#### Phase 1: Tokenization

This study employed the Mistral\-7B\-Instruct\-v0\.2 model, utilizing Hugging Face’sAutoTokenizerto convert textual input into tokens for model processing\. Padding was applied to the right\-hand side of sequences to standardize input lengths for batch processing\. The end\-of\-sequence \(EOS\) token served as the padding token to prevent the model from learning from padded content\. Additionally, special tokens—begin\-of\-sequence \(BOS\) and EOS—were added to mark sequence boundaries, aiding the model in identifying input structure during training\.

#### Phase 2: Fine\-Tuning

The fine\-tuning of the Mistral model involved several key steps aimed at optimizing both performance and resource efficiency\. To achieve this, the Low\-Rank Adaptation \(LoRA\) technique was used in conjunction with parameter\-efficient fine\-tuning methods\. These configurations were designed to strike a balance between maintaining model accuracy and reducing computational demands, particularly by applying k\-bit training to the base model and selectively fine\-tuning only certain parameters\.

Parameter\-Efficient Fine\-Tuning \(PEFT\) was adopted to minimize computational complexity and memory usage\. By fine\-tuning only a subset of the model’s layers, the LoRA method significantly reduces training overhead while preserving task\-specific performance\. This allows the model to efficiently adapt to new tasks without requiring full model retraining\. As illustrated in Figure[7](https://arxiv.org/html/2606.26595#S4.F7), the middle section focuses on fine\-tuning\. The fine\-tuning process consists of several key phases\. The Model Preparation phase involved configuring the base model for k\-bit training, reducing both memory usage and computational overhead\.

For parameter setup, we used a numerical precision format known as nf4 \(Normal Float 4\), which represents a 4\-bit floating point format optimized for neural network training\. Unlike standard quantization formats, nf4 preserves more dynamic range and numerical stability, allowing efficient training while minimizing information loss\[[9](https://arxiv.org/html/2606.26595#bib.bib138)\]\. This format has been shown to perform well in low\-precision training scenarios, especially when combined with techniques like LoRA\. We also processed the data using a 16 bit brain floating point format \(bfloat16\) to process the activations and gradients during training\. This 16\-bit format strikes a balance between computational efficiency and numerical accuracy, especially in mixed\-precision training setups\. This hybrid use of nf4 and bfloat16 allows us to reduce memory consumption without significantly compromising model performance\. During the Training Setup phase, the environment was optimized with critical hyperparameters, including the learning rate, batch size, and optimization strategy\. Then, after the training setup, we designed a prompt to guide the model in generating structured responses aligned with Tax Service\-specific SQEs\.

Quantization:To enhance the efficiency of our Large Language Model, we employed quantization, which reduces the model’s memory footprint and accelerates inference without substantially compromising accuracy\. Specifically, we used a 4\-bit quantization configuration utilizing GPTQ \(General Purpose Quantization\) as the method\. The bits parameter determines the bit width used for representing each weight in the neural network\. This drastically reduces the memory requirements and model size, as compared to the standard 16\-bit \(FP16\) or 32\-bit \(FP32\) floating point representations\. A 4\-bit quantization is a trade\-off between accuracy and computational efficiency; while it introduces some approximation error, the reduction in memory usage allows the model to run faster and be loaded on devices with lower memory capacity, such as consumer GPUs\.

#### Phase 3: Digitization

After extracting topics and assigning relevance scores using our Large Language Model \(LLM\), the next step is to convert these scores into a structured format\. This involves organizing the output into a dataframe for further analysis\. Since LLM generates text, we need to separate each piece of feedback into its key components: the assigned topic, the relevance score, and the reasoning behind the assignment\. At the core of this process is a function that uses regular expressions \(regex\) to identify patterns in the text\. This post\-processing is applied to all feedback, resulting in a structured dataset with distinct columns for scores, topics, and justifications\. This structured data enables further analysis, including trend detection\.

#### 4\.2\.2Model Evaluation

#### Topics Comparison

In this section, we compare the topic categorization by Tax Service experts with those identified by our LLMs, including both a prompt engineering\-based approach \(Zephyr\) and a fine\-tuned quantized model \(Mistral\)\. As outlined earlier, the organization’s experts played a critical role in validating the model output through a structured evaluation process\. This human\-in\-the\-loop step ensures that the automated classifications of the model align with the expert understanding of the Service Quality Elements and maintain the interpretability and trustworthiness of the feedback analysis system\. The distribution of the topics identified by the Tax Service experts, shown in Figure[8](https://arxiv.org/html/2606.26595#S4.F8), is heavily skewed toward the category “Timeliness”\. This distribution likely reflects the prioritization of the timely response by experts, highlighting issues they deem most urgent within their capacity to assign only one Service Quality Element per feedback item\. In contrast, Figure[9](https://arxiv.org/html/2606.26595#S4.F9), which shows the topic distribution of the prompt\-engineering model, indicates a more balanced allocation among different SQEs\. This suggests that the LLM captures a wider range of issues, offering a more diverse perspective on feedback topics\. Although timeliness remains the most frequent category, its dominance is less pronounced compared to the human\-assigned distribution, making the overall spread more comparable\. The concentration of human\-labeled data on “Timeliness” may reflect the organizational emphasis on meeting predefined service standards within the Tax Service’s Service Feedback Program, which prioritizes resolving taxpayer concerns within a specific timeframe\. As a result, the reviewers may have focused more heavily on feedback related to response time and deadlines, aligning with operational goals\. Although this focus ensures accountability in turnaround times, it may also inadvertently overshadow other, less obvious but still important service dimensions\. In contrast, the broader distribution of the LLM output captures a wider variety of topics, including subtler concerns that may not be as immediately actionable, but are nevertheless critical to improving the overall taxpayer experience\. Finally, Figure[10](https://arxiv.org/html/2606.26595#S4.F10)shows how the fine\-tuned and quantized model approaches topic categorization\. Similarly to the expert collaborator distribution, this model also emphasizes “Timeliness”, but does so in a way that maintains better balance across other service quality elements\. This approach reflects a closer alignment with human review patterns, showing only a modest difference between “Timeliness” and other categories\. By balancing attention between various elements, the fine\-tuned model ensures that while timeliness is prioritized, other essential service quality dimensions are not overlooked, ultimately fostering a comprehensive and efficient approach to improving service quality\.

![Refer to caption](https://arxiv.org/html/2606.26595v1/x5.png)Figure 8:Topic distribution across Service Quality Elements as categorized by Tax Service experts\. The distribution shows a significant emphasis on “Timeliness”, indicating the prioritization of promptly actionable issues\.![Refer to caption](https://arxiv.org/html/2606.26595v1/x6.png)Figure 9:Topic distribution across SQEs identified by the pretrained LLM \(Zephyr\)\. The results display a more balanced spread, highlighting the model’s capability to recognize a wider range of issues in the feedback data\.![Refer to caption](https://arxiv.org/html/2606.26595v1/x7.png)Figure 10:Topic distribution across SQEs identified by the fine\-tuned and quantized model \(Mistral\)\. The model closely mirrors the expert\-labeled distribution, emphasizing “Timeliness” while maintaining a balanced representation across other categories\.
#### Similarity Measurement

In this subsection, we evaluate the alignment between topics categorized by experts and those predicted by our models\. To assess similarity, we matched topics assigned by the Tax Service experts to those generated by two variations of our Large Language Models\. The similarity score is calculated based on the proportion of unique topic groups where the expert\-labeled Service Quality Element aligns with the predicted topics from the models\. Instead of simply counting exact matches across the entire dataset, the measure considers matches within each group of related entries, making it more robust and representative of the true alignment\.

S​i​m​i​l​a​r​i​t​yT​o​p​i​c=\(Matched Feedback InstancesTotal Feedback Instances\)×100Similarity\_\{Topic\}=\\left\(\\frac\{\\text\{Matched Feedback Instances\}\}\{\\text\{Total Feedback Instances\}\}\\right\)\\times 100where: Matched Feedback Instancesrepresents the number of individual feedback texts where the topic predicted by the model matches the expert labeled “Service Quality Element”\.Total Feedback Instancesis the total number of feedback texts in the dataset\.

The comparison reveals a distinct difference in performance between the two models\. The pretrained model achieves a lower similarity score of approximately 24\.27%, indicating limited alignment with expert\-labeled topics at the individual feedback level\. This lower score may reflect the broader generalization tendencies of the pre\-trained model, which capture diverse topics but do not necessarily align with the specific categorizations made by experts\. On the other hand, the fine\-tuned quantized model achieves a higher similarity score of a much higher 66\.64%\. This improvement suggests that the model, after fine\-tuning and quantization, becomes more adept at recognizing and aligning with the feedback topics deemed important by the experts\. The fine\-tuning process likely helps the model focus on the nuanced context\-specific themes and language present in the feedback texts, leading to a better alignment with expert judgment\. In summary, while the pre\-trained model offers a wider scope of topic identification, the fine\-tuned quantized model demonstrates a stronger capability to reflect the expert\-labeled categorizations at the individual feedback level, resulting in higher similarity scores and a more accurate representation of expert\-defined topics\.

#### Expert Evaluation of Model Applicability

In addition to measuring similarity between model outputs and Tax Service\-labeled SQEs, we conducted a survey with Tax Service Feedback Officers to assess the practical applicability of the model with the lowest similarity score\. This evaluation offered a complementary perspective based on expert judgment rather than strict label matching\. While similarity scores provide a quantitative benchmark, they may not fully capture the model’s real\-world utility, given the subjective nature of language and topic classification\. To address this, we implemented a human\-in\-the\-loop evaluation using bilingual feedback samples randomly selected to avoid bias\. The model assigned SQEs to each sample, and experts were asked to review the assignments, indicate agreement or disagreement, and propose alternatives where necessary\. We then conducted pairwise t\-tests on 51 SQE instances derived from 10 randomly selected feedback texts \(in both English and French\)\. This sample size satisfies the Central Limit Theorem’s requirement \(n ¿ 30\), allowing for valid inference\. The goal was to test whether differences between model and expert categorizations were statistically significant\. Results, summarized in Table[1](https://arxiv.org/html/2606.26595#S4.T1), showed no significant differences at the 95% confidence level\. This indicates that the model’s classifications are statistically indistinguishable from those of Tax Service experts\. Moreover, alignment was observed across all five reviewers, suggesting that the model effectively mirrors human scoring patterns\.

ComparisonT\-valueP\-valueSignificanceModel vs\. Officers\-1\.9680\.085No significant differenceTable 1:T\-test results comparing the model’s scores against the most divergent expert evaluatorAdditional comparisons among Service Feedback Program evaluators yielded similar results, without significant differences\. These findings support the conclusion that the LLM scoring is comparable to the expert scoring, with no statistically significant deviation in the comparisons\. The close alignment between the model and the expert evaluators suggests that the model can reliably approximate human assessment, making it a valuable tool for scaling evaluations and maintaining accuracy in feedback analysis\. This evaluation, which combines similarity measures with subjective expert judgment and statistical validation, ensures that the LLM is not only aligned with expert categorizations but also applicable and effective in practical, user\-centered contexts\.

### 4\.3Trend Detection Result

As described in the methodology, trend analysis begins with segmenting the dataset into two time periods using a flexible quantile cutoff\. A cutoff of 0\.5, for instance, splits the data into two 6\-month intervals, while other values can isolate specific events or seasonal effects\. This segmentation enables targeted analysis of temporal patterns\. Categorical feedback topics were then label\-encoded for modeling\. We trained separate multinomial logistic regression models for each period, fitting 26 binary regressions—two for each of the 13 SQE categories\. The sample size per model reflected the frequency of each topic label, as shown in Figure[10](https://arxiv.org/html/2606.26595#S4.F10)\. To detect shifts over time, we compared categorical variable coefficients across periods using 95% bootstrap confidence intervals\. Trends were classified as emerging, persistent, or disappearing based on coefficient changes\. An illustrative example \(Table[2](https://arxiv.org/html/2606.26595#S4.T2)\) shows how logistic regression identifies temporal topic dynamics while respecting data privacy\.

Table 2:Illustrative Example of Trend Detection Based on Service Quality ElementsThe table highlights several notable trends in feedback patterns over time\. One key observation is a disappearing trend in accessibility concerns among users under 19\. In the second period, reports of accessibility issues from this group decreased, suggesting either an improvement in services or a reduced perception of barriers\. Using this tool, the organization can now identify whether specific measures taken to improve the taxpayer experience have become effective\.

In contrast, emerging trends indicate that clarity issues have become more prominent among older adults \(60 and older\) as well as the middle\-aged group \(19 to 60 years\)\. This shift suggests a growing concern or a change in how these age groups perceive communication and information delivery\. A similar upward trend is observed in timeliness\-related feedback, which has increased among both male and female users, as well as those who prefer English\. This pattern may point to a broader issue with service delays or responsiveness, which requires further attention\. These insights illustrate how logistic regression modeling and coefficient comparison can reveal meaningful shifts in user feedback\. By identifying disappearing, emerging, and persistent trends, this approach supports data\-driven decision making and helps improve service delivery\.

## 5Discussion

This study introduced a multilingual human\-in\-the\-loop system that integrates fine\-tuned and quantized LLMs with statistical trend detection to categorize taxpayer feedback into service quality elements and uncover evolving concerns across demographic groups\. By combining advanced natural language processing, resource\-efficient model deployment, and expert validation, the framework not only enhances the scalability of public service feedback analysis but also promotes equity, transparency, and operational alignment in government service delivery\. The model demonstrates alignment with expert\-labeled categorizations and effectively identifies emerging, persistent, and disappearing concerns in multilingual feedback\.

### 5\.1Limitations

Despite the promising results of this study, there are some limitations that must be recognized\. First, and due to system availability, the dataset used for analysis was limited to a single year of feedback, which may not fully capture long\-term or seasonal trends\. Finally, the models used, although fine\-tuned, were constrained by the available computational resources, limiting the extent of fine\-tuning and the depth of analysis that could be achieved by using more powerful models \(36B parameter\+\)\. These constraints may affect the generalizability and scalability of the findings across different contexts or larger datasets\.

### 5\.2Future Work

Future research should focus on addressing these limitations by expanding the scope of the analysis to include more extensive multi\-year datasets\. Incorporating feedback in additional languages could enhance the inclusivity and accuracy of the system, providing a more comprehensive understanding of diverse taxpayer experiences\. Also, future work could explore the use of advanced non\-linear modeling techniques such as neural network\-based classifiers or ensemble learning approaches, which may offer improved performance in trend detection\. Integrating continuous learning mechanisms, where the model updates and refines its analysis in real\-time, could further enhance adaptability and responsiveness\. Furthermore, implementing a more interactive human\-in\-the\-loop system could provide ongoing validation and refinement of the model output, ensuring alignment with real\-world expectations and needs\. By addressing these areas, the proposed framework can evolve into a robust tool for dynamic, multilingual feedback analysis, and trend monitoring at scale\.

### 5\.3Theoretical Implications

This research contributes to the growing literature on AI\-assisted public service analysis by demonstrating how domain\-specific fine\-tuning and quantization of LLMs can be systematically applied to detect trends in structured feedback\. Unlike traditional topic modeling approaches that prioritize unsupervised or general\-purpose tasks, this study operationalizes topic modeling based on predefined expert taxonomies\. This bridges a key theoretical gap between thematic modeling and institutional interpretability\. Furthermore, the integration of stratified logistic regression with bootstrapped confidence intervals introduces a novel fairness\-aware approach for detecting demographic disparities in topic prevalence\. This supports broader theoretical efforts in explainable and equitable NLP, showing how statistical reasoning can complement LLM predictions for rigorous, demographically contextualized trend analysis\. The research also contributes to the theory of human\-AI teaming by embedding expert validation at multiple stages—preprocessing, model fine\-tuning, and output review\. This addresses confabulation risk in LLMs and responds to ongoing academic calls for ethically aligned, human\-supervised AI in sensitive domains like taxation\.

### 5\.4Practical Implications

The practical contributions of this study lie in its ability to translate advanced AI techniques into actionable tools for public service institutions\. By developing a multilingual feedback analysis system that incorporates fine\-tuned and quantized LLMs, the research offers a scalable solution for organizations to manage increasing volumes of unstructured feedback efficiently\. This approach reduces the dependency on manual review processes, allowing institutions to process and categorize service feedback with greater speed and consistency\. Importantly, the framework is designed with resource constraints in mind\. The quantization of the LLM significantly lowers the computational burden, making it feasible to deploy the system even within infrastructure\-limited environments\. This ensures that smaller agencies or departments, which may lack access to high\-end servers or GPUs, can still benefit from the capabilities of modern NLP tools\. Beyond technical efficiency, the system enhances the quality of public service by enabling real\-time identification of emerging concerns across different demographic groups\. By analyzing topic trends through demographic lenses—such as age, language, and gender—the system allows decision\-makers to detect systemic disparities that might otherwise go unnoticed\. This supports more equitable service delivery by highlighting the specific needs of underserved populations and enabling timely organizational response\. Moreover, the integration of explainable outputs and expert validation strengthens trust in the system’s recommendations\. The ability to trace the model’s rationale for each topic assignment ensures transparency and allows human reviewers to engage meaningfully with the results\. Overall, the framework provides a practical path toward responsible AI adoption in the public sector, offering both operational improvements and enhanced fairness in citizen engagement\.

## 6Conclusions

In this research, we presented a unique approach to analysis of service feedback data that integrates state\-of\-the\-art techniques for identifying feedback patterns, aimed at reducing biases and enhancing service quality\. Our methodology combines statistical modeling with natural language processing techniques, integrating state\-of\-the\-art large language models for topic modeling with traditional techniques for trend analysis, while also incorporating human\-in\-the\-loop methods that incorporate the subject matter expertise of Service Feedback officers\. This hybrid approach ensures that the system is not only effective and scalable to detect trends within multilingual feedback data, but also contextually grounded and aligned with operational realities\. We also explored how fine\-tuning and quantization of the LLM results in alignment with the specific idiosynchratic evaluation processes within the organization, while simultaneously optimizing resource usage\. Our suggestion is that whenever an LLM is considered to be deployed within an organization that has developed its own culture and language, this fine\-tuning is performed\. Otherwise, the model will most likely fail to understand these nuances and miss important details, which will translate into reduced performance\. Our results demonstrate that the fine\-tuned model, customized with Tax Service\-specific text data, was more closely aligned with expert opinion in terms of the categorization of topics\. This alignment was evident in the similarity analysis and was further supported by an evaluation survey, where the Tax Service experts ranked the models according to their output\. The fine\-tuned model showed significant improvements, capturing topics and nuances that better reflected expert assessments\. The second stage of our proposal involved the use of these nuanced outputs for trend analysis\. Our findings indicate that our methodology effectively identified emerging, persistent, and disappearing trends in diverse demographic groups, focusing on specific themes that require targeted improvements\. By employing a multiphase approach, including text tokenization, de\-identification, model fine\-tuning, and trend detection, we ensured a thorough analysis of feedback data within specific demographics\. Our framework adheres to key principles of fairness, transparency, explainability, and accountability, providing a powerful tool to improve the quality of feedback services while addressing potential biases in feedback analysis\.

## References

- \[1\]M\. Adam, M\. Wessel, and A\. Benlian\(2021\)AI\-based chatbots in customer service and their effects on user compliance\.Electronic Markets31,pp\. 427–445\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p8.1)\.
- \[2\]M\. AI\(2023\)Mistral\-7b\-instruct\-v0\.2: a high\-performance instruction\-tuned language model\.External Links:[Link](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)Cited by:[§3\.3\.2](https://arxiv.org/html/2606.26595#S3.SS3.SSS2.p1.1)\.
- \[3\]M\. AI\(2023\)Mistral\-7b\-instruct\-v0\.2: a high\-performance language model\.External Links:[Link](https://huggingface.co/mistralai/Mistral-7B-v0.2)Cited by:[§3\.3\.1](https://arxiv.org/html/2606.26595#S3.SS3.SSS1.p1.1)\.
- \[4\]K\. Bauer, M\. von Zahn, and O\. Hinz\(2023\)Expl\(ai\)ned: the impact of explainable artificial intelligence on users’ information processing\.Information Systems Research34\(4\),pp\. 1582–1602\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p8.1)\.
- \[5\]M\. A\. Camilleri\(2024\)Artificial intelligence governance: ethical considerations and implications for social responsibility\.Expert systems41\(7\),pp\. e13406\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p9.1)\.
- \[6\]N\. Chen, A\. Li, and K\. Talluri\(2021\)Reviews and self\-selection bias with operational implications\.Management Science67\(12\),pp\. 7472–7492\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p5.1)\.
- \[7\]X\. Chen, Y\. Chen, and G\. Yin\(2025\)Exploring the motivations behind behavior: a theory\-driven deep\-learning framework for cyberviolence behavior detection\.Decision Support Systems,pp\. 114409\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p1.1)\.
- \[8\]A\. De Caigny, K\. W\. De Bock, and S\. Verboven\(2024\)Hybrid black\-box classification for customer churn prediction with segmented interpretability analysis\.Decision Support Systems181,pp\. 114217\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p2.1)\.
- \[9\]T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer\(2023\)Qlora: efficient finetuning of quantized llms\.Advances in neural information processing systems36,pp\. 10088–10115\.Cited by:[§4\.2](https://arxiv.org/html/2606.26595#S4.SS2.SSSx2.p3.1)\.
- \[10\]H\. Face\(2023\)Zephyr\-7b\-beta: a fine\-tuned 7b parameter language model\.External Links:[Link](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)Cited by:[§3\.3\.1](https://arxiv.org/html/2606.26595#S3.SS3.SSS1.p1.1)\.
- \[11\]FedScoop\(2024\)Federal government websites public satisfaction\.Note:Accessed: 2025\-02\-22External Links:[Link](https://fedscoop.com/federal-government-websites-public-satisfaction/)Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p1.1)\.
- \[12\]J\. Feldman, D\. J\. Zhang, X\. Liu, and N\. Zhang\(2022\)Customer choice models vs\. machine learning: finding optimal product displays on alibaba\.Operations Research70\(1\),pp\. 309–328\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p3.1)\.
- \[13\]E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh\(2023\)GPTQ: accurate post\-training quantization for generative pre\-trained transformers\.arXiv\.Org\.Cited by:[§3\.3\.3](https://arxiv.org/html/2606.26595#S3.SS3.SSS3.p1.1)\.
- \[14\]E\. Frantaret al\.\(2022\)Gradient\-preserving quantization for efficient large model training and inference\.Journal of Neural Network Research\.Cited by:[§3\.3\.3](https://arxiv.org/html/2606.26595#S3.SS3.SSS3.p1.1)\.
- \[15\]O\. Friha, M\. A\. Ferrag, B\. Kantarci, B\. Cakmak, A\. Ozgun, and N\. Ghoualmi\-Zine\(2024\)LLM\-based edge intelligence: a comprehensive survey on architectures, applications, security and trustworthiness\.IEEE Open Journal of the Communications Society5,pp\. 5799–5856\.External Links:[Document](https://dx.doi.org/10.1109/OJCOMS.2024.3456549)Cited by:[§2\.1](https://arxiv.org/html/2606.26595#S2.SS1.p2.1)\.
- \[16\]M\. Garg\(2022\)UBIS: unigram bigram importance score for feature selection from short text\.Expert Systems with Applications195,pp\. 116563\.Cited by:[§4\.1](https://arxiv.org/html/2606.26595#S4.SS1.p1.1)\.
- \[17\]R\. Guidotti, A\. Monreale, S\. Ruggieri, F\. Turini, F\. Giannotti, and D\. Pedreschi\(2018\)A survey of methods for explaining black box models\.ACM computing surveys \(CSUR\)51\(5\),pp\. 1–42\.Cited by:[item 5](https://arxiv.org/html/2606.26595#S3.I1.i5.p1.1)\.
- \[18\]D\. Guilbeault, S\. Delecourt, T\. Hull, B\. S\. Desikan, M\. Chu, and E\. Nadler\(2024\)Online images amplify gender bias\.Nature626\(8001\),pp\. 1049–1055\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p3.1)\.
- \[19\]P\. Gunarathne, H\. Rui, and A\. Seidmann\(2022\)Racial bias in customer service: evidence from twitter\.Information Systems Research33\(1\),pp\. 43–54\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p3.1)\.
- \[20\]J\. Guo, X\. Wang, and Y\. Wu\(2020\)Positive emotion bias: role of emotional content from online customer reviews in purchase decisions\.Journal of Retailing and Consumer Services52,pp\. 101891\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p5.1)\.
- \[21\]Z\. Hasan, D\. Vaz, V\. S\. Athota, S\. S\. M\. Désiré, and V\. Pereira\(2022\)Can artificial intelligence \(ai\) manage behavioural biases among financial planners?\.Journal of Global Information Management \(JGIM\)31\(2\),pp\. 1–18\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p7.1)\.
- \[22\]W\. Huang, K\. F\. Hew, and L\. K\. Fryer\(2022\)Chatbots for language learning—are they really useful? a systematic review of chatbot‐supported language learning\.Journal of Computer Assisted Learning38\(1\),pp\. 237–257\.External Links:[Document](https://dx.doi.org/10.1111/jcal.12610)Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p7.1)\.
- \[23\]S\. Hwang, J\. Kim, E\. Park, and S\. J\. Kwon\(2020\)Who will be your next customer: a machine learning approach to customer return visits in airline services\.Journal of Business Research121,pp\. 121–126\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p2.1)\.
- \[24\]M\. S\. Islam, M\. Ferdusi, and T\. T\. Aurpa\(2025\)Words of war: a hybrid bert\-cnn approach for topic\-wise sentiment analysis on the russia\-ukraine war\.Expert Systems with Applications,pp\. 127759\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p5.1)\.
- \[25\]Joint councils’ executive report february 2020\.Note:Accessed: 2025\-02\-22External Links:[Link](https://citizenfirst.ca/assets/uploads/research-repository/Joint-Councils-Executive-Report-February-2020.pdf)Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p1.1)\.
- \[26\]L\. R\. Krosuri and R\. S\. Aravapalli\(2023\)Novel heuristic\-based hybrid resnext with recurrent neural network to handle multi class classification of sentiment analysis\.Machine Learning: Science and Technology4\(1\),pp\. 015033\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p6.1)\.
- \[27\]Y\. Kumar, K\. Huang, A\. Perez, G\. Yang, J\. J\. Li, P\. Morreale, D\. Kruger, and R\. Jiang\(2024\)Bias and cyberbullying detection and data generation using transformer artificial intelligence models and top large language models\.Electronics13\(17\),pp\. 3431\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p6.1)\.
- \[28\]H\. Li, Y\. Qian, Y\. Jiang, Y\. Liu, and F\. Zhou\(2023\)A novel label\-based multimodal topic model for social media analysis\.Decision Support Systems164,pp\. 113863\.Cited by:[§2\.1](https://arxiv.org/html/2606.26595#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.26595#S2.SS1.p2.1)\.
- \[29\]N\. Li, X\. Yang, I\. A\. Wong, R\. Law, and J\. Y\. Xu\(2023\)Automating tourism online reviews: a neural network based aspect\-oriented sentiment classification\.Journal of Hospitality and Tourism Technology14\(1\),pp\. 1–20\.External Links:[Document](https://dx.doi.org/10.1108/JHTT-03-2021-0099)Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p1.1)\.
- \[30\]M\. Linzmajer, S\. Brach, G\. Walsh, and T\. Wagner\(2020\)Customer ethnic bias in service encounters\.Journal of Service Research23\(2\),pp\. 194–210\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p4.1)\.
- \[31\]F\. Maibaum, J\. Kriebel, and J\. N\. Foege\(2024\)Selecting textual analysis tools to classify sustainability information in corporate reporting\.Decision Support Systems183,pp\. 114269\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p2.1)\.
- \[32\]K\. Michael\(2024\)In this special section: algorithmic bias—australia’s robodebt and its human rights aftermath\.IEEE Transactions on Technology and Society5\(3\),pp\. 254–263\.External Links:[Document](https://dx.doi.org/10.1109/TTS.2024.1234567)Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p1.1)\.
- \[33\]M\. H\. Miraz, A\. Ya’u, S\. Adeyinka\-Ojo, J\. B\. Sarkar, M\. T\. Hasan, K\. Hoque, and H\. H\. Jin\(2024\)Intention to use determinants of ai chatbots to improve customer relationship management efficiency\.Cogent Business & Management11\(1\)\.External Links:[Document](https://dx.doi.org/10.1080/23311975.2024.2411445)Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p8.1)\.
- \[34\]M\. Mishraet al\.\(2024\)Temporal analysis of computational economics: a topic modeling approach\.International Journal of Data Science and Analytics,pp\. 1–15\.Cited by:[§2\.1](https://arxiv.org/html/2606.26595#S2.SS1.p2.1)\.
- \[35\]E\. Mogaji, J\. Farquhar, P\. van Esch, C\. Durodié, and R\. Perez\-Vega\(2022\)Guest editorial: artificial intelligence in financial services marketing\.International Journal of Bank Marketing\.External Links:[Document](https://dx.doi.org/10.1108/ijbm-09-2022-617)Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p8.1)\.
- \[36\]J\. Morley, L\. Floridi, L\. Kinsey, and A\. Elhalal\(2020\)From what to how: an initial review of publicly available ai ethics tools, methods and research to translate principles into practices\.Science and engineering ethics26\(4\),pp\. 2141–2168\.Cited by:[item 5](https://arxiv.org/html/2606.26595#S3.I1.i5.p1.1)\.
- \[37\]B\. A\. H\. Murshed, S\. Mallappa, and J\. Abawajy\(2023\)Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis\.Artificial Intelligence Review\.External Links:[Link](https://link.springer.com/article/10.1007/s10462-023-10345-9)Cited by:[§2\.1](https://arxiv.org/html/2606.26595#S2.SS1.p1.1)\.
- \[38\]A\. R\. Nair\(2025\)Natural language processing \(nlp\) in chatbot customer service\.International Journal for Research in Applied Science and Engineering Technology13\(3\),pp\. 715–721\.External Links:[Document](https://dx.doi.org/10.22214/ijraset.2025.67353)Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p5.1)\.
- \[39\]A\. Ojo, N\. Rizun, G\. Walsh, M\. I\. Mashinchi, M\. Venosa, and M\. N\. Rao\(2024\)Prioritising national healthcare service issues from free text feedback–a computational text analysis & predictive modelling approach\.Decision Support Systems181,pp\. 114215\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p2.1)\.
- \[40\]A\. M\. Pereira, J\. A\. B\. Moura, E\. D\. B\. Costa, T\. Vieira, A\. R\. Landim, E\. Bazaki, and V\. Wanick\(2022\)Customer models for artificial intelligence\-based decision support in fashion online retail supply chains\.Decision Support Systems158,pp\. 113795\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p2.1)\.
- \[41\]R\. Pillai, Y\. Ghanghorkar, B\. Sivathanu, R\. Algharabat, and N\. P\. Rana\(2024\)Adoption of artificial intelligence \(ai\) based employee experience \(eex\) chatbots\.Information Technology & People37\(1\),pp\. 449–478\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p7.1)\.
- \[42\]S\. Ravfogelet al\.\(2024\)Bias and fairness in large language models: a survey\.Computational Linguistics50\(3\),pp\. 1097–1130\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p6.1)\.
- \[43\]A\. Rogers, O\. Kovaleva, and A\. Rumshisky\(2020\)A primer in bertology: what we know about how bert works\.Transactions of the Association for Computational Linguistics8,pp\. 842–866\.Cited by:[§2\.1](https://arxiv.org/html/2606.26595#S2.SS1.p2.1)\.
- \[44\]L\. Schetgen, M\. Bogaert, and D\. Van den Poel\(2021\)Predicting donation behavior: acquisition modeling in the nonprofit sector using facebook data\.Decision Support Systems141,pp\. 113446\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p2.1)\.
- \[45\]M\. L\. Scott, S\. A\. Bone, G\. L\. Christensen, A\. Lederer, M\. Mende, B\. G\. Christensen, and M\. Cozac\(2024\)Revealing and mitigating racial bias and discrimination in financial services\.Journal of Marketing Research61\(4\),pp\. 598–618\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p1.1)\.
- \[46\]S\. Shahet al\.\(2023\)A review of natural language processing in contact centre automation\.Pattern Analysis and Applications26,pp\. 823–846\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p8.1)\.
- \[47\]M\. Shahin, F\. F\. Chen, A\. Hosseinzadeh, M\. Maghanaki, and A\. Eghbalian\(2024\)A novel approach to voice of customer extraction using gpt\-3\.5 turbo: linking advanced nlp and lean six sigma 4\.0\.The International Journal of Advanced Manufacturing Technology131\(7\),pp\. 3615–3630\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p7.1)\.
- \[48\]T\. Shu, Z\. Wang, L\. Lin, H\. Jia, and J\. Zhou\(2022\)Customer perceived risk measurement with nlp method in electric vehicles consumption market: empirical study from china\.Energies15\(5\)\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p7.1)\.
- \[49\]D\. Simester, A\. Timoshenko, and S\. I\. Zoumpoulis\(2020\)Targeting prospective customers: robustness of machine\-learning methods to typical data challenges\.Management Science66\(6\),pp\. 2495–2522\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p3.1)\.
- \[50\]J\. Smith and A\. Doe\(2022\)Advanced techniques in model quantization: preserving accuracy during training\.IEEE Transactions on Neural Networks and Learning Systems\.Cited by:[§3\.3\.3](https://arxiv.org/html/2606.26595#S3.SS3.SSS3.p1.1)\.
- \[51\]F\. Sufi\(2024\)An innovative way of analyzing covid topics with llm\.Journal of Economy and Technology\.External Links:[Document](https://dx.doi.org/10.1016/j.ject.2024.11.004)Cited by:[§2\.1](https://arxiv.org/html/2606.26595#S2.SS1.p2.1)\.
- \[52\]K\. A\. Tarnowska and Z\. Ras\(2021\)NLP\-based customer loyalty improvement recommender system \(clirs2\)\.Big Data and Cognitive Computing5\(1\)\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p6.1)\.
- \[53\]L\. Tzelves, P\. Juliebø\-Jones, and B\. K\. Somani\(2024\)The evolution of minimally invasive urologic surgery: innovations, challenges, and opportunities\.Frontiers in Surgery11,pp\. 1525713\.Cited by:[§2\.1](https://arxiv.org/html/2606.26595#S2.SS1.p2.1)\.
- \[54\]Y\. Xie, W\. Yeoh, and J\. Wang\(2024\)How self\-selection bias in online reviews affects buyer satisfaction: a product type perspective\.Decision Support Systems181,pp\. 114199\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p5.1)\.
- \[55\]K\. Yang, R\. Y\. Lau, and A\. Abbasi\(2023\)Getting personal: a deep learning artifact for text\-based measurement of personality\.Information Systems Research34\(1\),pp\. 194–222\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p8.1)\.
- \[56\]S\. Yi and X\. Liu\(2020\)Machine learning\-based customer sentiment analysis for recommending shoppers, shops based on customers’ review\.Complex & Intelligent Systems6\(3\),pp\. 621–634\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p2.1)\.
- \[57\]M\. Zaghloul, S\. Barakat, and A\. Rezk\(2024\)Predicting e\-commerce customer satisfaction: traditional machine learning vs\. deep learning approaches\.Journal of Retailing and Consumer Services79\.Cited by:[§2\.2](https://arxiv.org/html/2606.26595#S2.SS2.p3.1)\.
- \[58\]Y\. Zhanget al\.\(2024\)From bias to fairness: the role of domain\-specific knowledge and efficient fine\-tuning in large language models\.Journal of Artificial Intelligence Research58,pp\. 201–225\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p7.1)\.
- \[59\]Y\. F\. Zhao, E\. Niforatos, T\. Custis, Y\. Lu, and J\. Luo\(2024\)Large language models in design and manufacturing\.Journal of Computing and Information Science in Engineering,pp\. 1–6\.Cited by:[§2\.1](https://arxiv.org/html/2606.26595#S2.SS1.p2.1)\.
- \[60\]J\. Zheng, G\. Yin, Y\. Tan, and J\. Ding\(2024\)Does help help? an empirical analysis of social desirability bias in ratings\.Information Systems Research35\(3\),pp\. 1052–1073\.Cited by:[§1](https://arxiv.org/html/2606.26595#S1.p3.1)\.
- \[61\]J\. Zimmermann, L\. E\. Champagne, J\. M\. Dickens, and B\. T\. Hazen\(2024\)Approaches to improve preprocessing for latent dirichlet allocation topic modeling\.Decision Support Systems185,pp\. 114310\.Cited by:[§2\.1](https://arxiv.org/html/2606.26595#S2.SS1.p1.1)\.

Similar Articles

WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback

arXiv cs.CL

WildFeedback is a novel framework that leverages in-situ user feedback from actual LLM conversations to automatically create preference datasets for aligning language models with human preferences, addressing scalability and bias issues in traditional annotation-based alignment methods.

LLMs Can Better Capture Human Judgments--With the Right Prompts

arXiv cs.CL

This paper presents simple prompting strategies that help large language models better capture the full distribution of human judgments, improving alignment on moral scenarios and beliefs. The authors show that asking models to report standard deviations and response proportions, along with ensuring scenario clarity, yields better agreement with human responses.