Fault of Our Stars: Behavioral Drivers of Rating-Sentiment Incongruence

arXiv cs.CL Papers

Summary

This paper investigates the behavioral drivers of incongruence between star ratings and textual sentiment in Sri Lankan tourism reviews, finding that 18.6% of reviews show mismatch with six directional patterns, and identifying venue type, reviewer expertise, and temporal factors as contributors.

arXiv:2606.25518v1 Announce Type: new Abstract: When people share experiences online, they often express thoughts in two ways: a star rating and a written review. In sentiment analysis, ratings are widely used as convenient weak labels for textual sentiment, yet whether the two actually agree is rarely questioned. This study investigates sentiment-rating incongruence, where the sentiment expressed in review text differs from the sentiment implied by the assigned star rating, in Sri Lankan tourism attraction reviews. A dataset of 16,156 reviews from 2010 to 2023 is analyzed using a transformer-based sentiment pipeline that derives textual sentiment independently of assigned ratings. Incongruence occurs in 18.6% of reviews and falls into six directional patterns, with Conservative Rater and Obligatory 5-Star behaviors accounting for the majority of mismatches. Prevalence also varies across venue types, with museums showing the highest rates. Statistical tests, logistic regression, Random Forest, and SHAP analysis identify venue type, reviewer expertise, review length, and temporal factors as contributors to rating-text divergence. Overall, this study demonstrates that star ratings are not interchangeable with textual sentiment and should be validated before being treated as ground-truth labels in NLP.
Original Article
View Cached Full Text

Cached at: 06/25/26, 05:12 AM

# Fault of Our Stars: Behavioral Drivers of Rating–Sentiment Incongruence
Source: [https://arxiv.org/html/2606.25518](https://arxiv.org/html/2606.25518)
Ramanaish Abaiyan, Ruththiragayan Sutharsan, Kusal Amantha, Anusan Krishnathas, Asma Rauff, Kovindarajah Sriyathurshan, Patalee Narasinghe, Nirasha Munasinghe, Nisansa de Silva, Sandareka Wickramanayake

###### Abstract

When people share experiences online, they often express thoughts in two ways: a star rating and a written review\. In sentiment analysis, ratings are widely used as convenient weak labels for textual sentiment, yet whether the two actually agree is rarely questioned\. This study investigates sentiment–rating incongruence, where the sentiment expressed in review text differs from the sentiment implied by the assigned star rating, in Sri Lankan tourism attraction reviews\. A dataset of 16,156 reviews from 2010 to 2023 is analyzed using a transformer\-based sentiment pipeline that derives textual sentiment independently of assigned ratings\. Incongruence occurs in 18\.6% of reviews and falls into six directional patterns, with Conservative Rater and Obligatory 5\-Star behaviors accounting for the majority of mismatches\. Prevalence also varies across venue types, with museums showing the highest rates\. Statistical tests, logistic regression, Random Forest, and SHAP analysis identify venue type, reviewer expertise, review length, and temporal factors as contributors to rating–text divergence\. Overall, this study demonstrates that star ratings are not interchangeable with textual sentiment and should be validated before being treated as ground\-truth labels in NLP\.

## IIntroduction

Online tourism reviews are an important source of user\-generated content for understanding visitor experiences\. Most review platforms allow users to express their experience through both a star rating and a written review\. In sentiment analysis and review mining, star ratings are often treated as convenient weak labels for textual sentiment\[[16](https://arxiv.org/html/2606.25518#bib.bib1),[3](https://arxiv.org/html/2606.25518#bib.bib2)\]\. However, this assumption is not always reliable\. A high rating does not necessarily mean that the review text is fully positive, and a moderate rating may still contain strongly positive language\. This creates an NLP problem: ratings may introduce noisy or context\-biased labels when used as ground truth for sentiment analysis\.

The growth of platforms such as TripAdvisor has expanded tourism review data\[[11](https://arxiv.org/html/2606.25518#bib.bib3)\]\. Many studies use these reviews to analyze destination image, tourist satisfaction, and consumer behavior\. However, the relationship between rating\-derived and text\-derived sentiment remains insufficiently examined\. In many sentiment analysis pipelines, star ratings are used as proxy labels without validating whether the written review expresses the same polarity\. From an NLP perspective, this becomes a weak\-supervision problem, where models trained or evaluated using ratings may learn distorted patterns rather than the actual sentiment expressed in language\.

Previous studies suggest that rating–text inconsistency is a recurring issue in online reviews\[[6](https://arxiv.org/html/2606.25518#bib.bib4),[14](https://arxiv.org/html/2606.25518#bib.bib5)\]\. However, much of the tourism sentiment analysis literature still relies on ratings as sentiment labels, particularly in hotel and restaurant contexts\[[3](https://arxiv.org/html/2606.25518#bib.bib2),[5](https://arxiv.org/html/2606.25518#bib.bib8)\]\. Although aspect\-based sentiment analysis and topic modeling have improved the extraction of fine\-grained information from review text\[[7](https://arxiv.org/html/2606.25518#bib.bib9),[4](https://arxiv.org/html/2606.25518#bib.bib10)\], the broader question of whether ratings reliably represent textual sentiment has received less attention\. This gap is especially important in underrepresented tourism contexts, where review behavior may be shaped by local cultural expectations, attraction types, and reviewer experience\.

Recent transformer\-based language models provide an opportunity to study sentiment–rating incongruence more effectively\. Models such as BERT\[[10](https://arxiv.org/html/2606.25518#bib.bib14),[19](https://arxiv.org/html/2606.25518#bib.bib15)\]and RoBERTa capture contextual meaning more effectively than traditional lexicon\-based approaches\[[22](https://arxiv.org/html/2606.25518#bib.bib6),[17](https://arxiv.org/html/2606.25518#bib.bib7)\]\. In this study, transformer\-based sentiment inference is used to derive textual sentiment independently from assigned ratings, allowing review text to be analyzed as a separate linguistic signal\.

Using 16,156 Sri Lankan tourism attraction reviews collected between 2010 and 2023\[[18](https://arxiv.org/html/2606.25518#bib.bib17)\], this study investigates how often rating\-derived sentiment and NLP\-derived textual sentiment diverge, what directional forms these mismatches take, and which contextual and reviewer\-level factors are associated with them\. Star ratings are grouped into negative, neutral, and positive classes, while textual sentiment is inferred using a transformer\-based sentiment pipeline selected through comparative model evaluation\. The resulting mismatches are organized into six directional incongruence patterns, moving beyond a simple matched/mismatched classification\.

After deriving textual sentiment through transformer\-based NLP inference, statistical and machine learning models are used as secondary explanatory tools to examine factors associated with rating–text divergence\. This study makes four contributions: it evaluates star ratings as weak labels for textual sentiment, applies transformer\-based sentiment inference independently from ratings, introduces six directional sentiment–rating incongruence patterns, and identifies contextual and reviewer\-level drivers of NLP\-derived mismatches\.

Overall, this paper argues that sentiment–rating incongruence is not random noise but a systematic and context\-dependent signal\. For NLP research, the findings highlight the risk of treating star ratings as ground\-truth sentiment labels without validation\. For tourism review analytics, they show that ratings and written reviews capture different aspects of visitor experience, supporting the need for context\-aware sentiment analysis approaches\.![[Uncaptioned image]](https://arxiv.org/html/2606.25518v1/images/huggingface.png)[Data](https://huggingface.co/datasets/Abaiyan/Sri-lankan-tourism-review-incongruence)and![[Uncaptioned image]](https://arxiv.org/html/2606.25518v1/images/github.png)[code](https://github.com/Abaiyan-27/Group-J---Research-Paper.git)for this work is publicly available\.

## IIRelated Work

The foundational survey byPang and Lee \[[16](https://arxiv.org/html/2606.25518#bib.bib1)\]established sentiment analysis as a major research area and reinforced the common assumption that star ratings broadly reflect the sentiment expressed in review text\. Despite recognized limitations, this convention remains widely used in tourism research as a practical weak\-labeling strategy\.Alaeiet al\.\[[3](https://arxiv.org/html/2606.25518#bib.bib2)\]note that ratings are often treated as weak labels without explicit validation, and that the literature has been heavily concentrated on hotels and restaurants\. This imbalance is further highlighted byAmeuret al\.\[[5](https://arxiv.org/html/2606.25518#bib.bib8)\], who report limited venue diversity and restricted geographic coverage in existing studies, especially for emerging tourism destinations\.

Methodological advances have substantially improved sentiment analysis in tourism\.Wenet al\.\[[22](https://arxiv.org/html/2606.25518#bib.bib6)\]demonstrate the effectiveness of transformer\-based models such as BERT\[[10](https://arxiv.org/html/2606.25518#bib.bib14)\]and ERNIE\[[19](https://arxiv.org/html/2606.25518#bib.bib15)\], whilePuh and Babac \[[17](https://arxiv.org/html/2606.25518#bib.bib7)\]show that jointly analyzing sentiment and ratings can provide more detailed insight\. Multilingual approaches and aspect\-based methods further improve interpretability by linking sentiment to specific components of the tourism experience\[[7](https://arxiv.org/html/2606.25518#bib.bib9),[4](https://arxiv.org/html/2606.25518#bib.bib10)\]\. More recently, zero\-shot approaches have expanded the feasibility of analyzing under\-studied datasets with limited labeled data\[[15](https://arxiv.org/html/2606.25518#bib.bib16)\]\.

This inconsistency shows up in regional literature as well\.Abeysinghe and Walgampaya \[[1](https://arxiv.org/html/2606.25518#bib.bib11)\]document rating–text incompatibility in hotel reviews in Anuradhapura, whileAbeysinghe and Bandara \[[2](https://arxiv.org/html/2606.25518#bib.bib12)\]extend this finding across five Sri Lankan cities and propose a self\-learning approach to resolve it\. However, both depend on lexicon\-based methods and define the problem primarily as one requiring correction rather than explanation\. In contrast, this study uses transformer\-based sentiment analysis and interprets incongruence as a context\-dependent NLP weak\-label reliability issue\.

In low\-resource settings where a language does not have an adequate amount of tagged text sentiment data\[[9](https://arxiv.org/html/2606.25518#bib.bib22)\], there have been attempts to derive the text sentiment using star ratings\[[12](https://arxiv.org/html/2606.25518#bib.bib18),[13](https://arxiv.org/html/2606.25518#bib.bib19)\]or Facebook reactions\[[21](https://arxiv.org/html/2606.25518#bib.bib21),[20](https://arxiv.org/html/2606.25518#bib.bib20)\]\. However, empirical findings on the relationship between ratings and review text remain mixed\.Bigneet al\.\[[6](https://arxiv.org/html/2606.25518#bib.bib4)\]report general alignment between the two, but also identify variation across contexts\.George and Ramos \[[11](https://arxiv.org/html/2606.25518#bib.bib3)\]show that ratings may exceed text\-based sentiment in destination\-related reviews, whileKwonet al\.\[[14](https://arxiv.org/html/2606.25518#bib.bib5)\]demonstrate that rating–text inconsistency varies by context and influences perceived review usefulness\. These findings suggest that ratings and text do not always capture the same dimension of experience\.

Reviewer characteristics also appear to matter\.Chua and Banerjee \[[8](https://arxiv.org/html/2606.25518#bib.bib13)\]show that reviewer expertise influences the relationship between ratings and textual content, while related work links sentiment polarity and review depth to perceived usefulness\[[8](https://arxiv.org/html/2606.25518#bib.bib13),[5](https://arxiv.org/html/2606.25518#bib.bib8)\]\. Taken together, these studies indicate that ratings and text may encode different aspects of user experience and that inconsistency may be partly shaped by reviewer\-level behavior\.

Overall, sentiment–rating incongruence remains insufficiently understood, particularly in tourism attraction contexts and emerging destinations\. Although recent methods enable large\-scale and fine\-grained analysis\[[22](https://arxiv.org/html/2606.25518#bib.bib6),[17](https://arxiv.org/html/2606.25518#bib.bib7),[7](https://arxiv.org/html/2606.25518#bib.bib9),[4](https://arxiv.org/html/2606.25518#bib.bib10),[15](https://arxiv.org/html/2606.25518#bib.bib16)\], there is still limited evidence on the structure of directional mismatch patterns and their drivers in multi\-venue, longitudinal datasets\. This study addresses this gap by analyzing rating–text mismatch as a weak\-label reliability problem in NLP\.

## IIIMethodology

TABLE I:Methodology Overview### III\-ADataset and Preprocessing

The framework begins by transforming a large, venue\-diverse collection of reviews into an analytical dataset suitable for analyzing sentiment–rating incongruence across location, time, and reviewer behavior\. The study used the “Tourism and Travel Reviews: Sri Lankan Destinations” dataset from Mendeley Data\[[18](https://arxiv.org/html/2606.25518#bib.bib17)\], which contains 16,156 reviews from 2010 to 2023 across 11 attraction types in Sri Lanka\. Missing values and duplicates were checked during preprocessing\. Date fields were used to create travel year and review delay, with negative delay values set to zero\. Raw location text was processed using rule\-based parsing and manual mapping to identify province and district\.Review\_Lengthwas calculated as the character count of the review text\. Star ratings were grouped into three classes:

- •Negative \(1–2⋆\\star\)
- •Neutral \(3⋆\\star\)
- •Positive \(4–5⋆\\star\)

This grouping follows common practice in sentiment analysis\[[16](https://arxiv.org/html/2606.25518#bib.bib1),[3](https://arxiv.org/html/2606.25518#bib.bib2)\]and makes the rating scale directly comparable with the three\-class sentiment output\. Table[II](https://arxiv.org/html/2606.25518#S3.T2)lists the source columns retained for analysis\. This preprocessing ensured that the sentiment labels, rating classes, temporal variables, and reviewer\-level features were constructed consistently before modeling rating–text divergence\.

TABLE II:Source Columns Used in Analysis
### III\-BSentiment Model Selection

To derive textual sentiment independently from ratings, four transformer\-based models were tested on a manually labeled set of 1,000 reviews\. The dataset was split into 700 training and 300 testing instances, where the training portion was used to fine\-tune selected models and the test set was used for comparative evaluation\. Review titles and texts were combined as input, and model performance was assessed using Macro F1, accuracy, and weighted F1\. Macro F1 was included because sentiment classes may be imbalanced\. As shown in Table[III](https://arxiv.org/html/2606.25518#S3.T3), the pretrainedcardiffnlp/twitter\-roberta\-base\-sentimentmodel performed best overall and was selected to label the full dataset\[[10](https://arxiv.org/html/2606.25518#bib.bib14),[19](https://arxiv.org/html/2606.25518#bib.bib15)\]\. This model achieved a strong balance between classification performance and generalization, outperforming fine\-tuned variants while avoiding potential overfitting given the limited size of the labeled dataset\. This approach measured textual sentiment independently from star ratings, reducing circularity and enabling mismatch detection between two signals: rating\-derived sentiment and NLP\-derived textual sentiment\.

TABLE III:Sentiment Model Performance \(Test Set,n=300n=300\)
### III\-CVariable Construction

After sentiment labeling, raw title and text were removed from further analysis\.Incongruentwas defined as a mismatch betweenSentimentandRating\_Class, whilePatternrecorded the six mismatch types\.Reviewer\_Tiergrouped reviewers as follows: Novice \(0–5\), Casual \(6–20\), Active \(21–100\), Expert \(101\+\)\.

Reviewer tiers were defined by analyzing the distribution of contributions\. The data show strong positive skew \(median=54\\text\{median\}=54,max=9010\\text\{max\}=9010\): most reviewers contribute11–55reviews, while few exceed100100\. Based on this distribution, thresholds were set at0–55,66–2020,2121–100100, and101\+101\+to capture distinct engagement levels\. Each tier corresponds to measurable differences in behavior\. Reviewers with0–55reviews exhibit minimal platform familiarity, whereas the66–2020range reflects casual engagement\. The2121–100100tier identifies active users with sustained participation, and101\+101\+represents highly engaged expert reviewers\. This tiering is further supported by rating behavior: the Conservative Rater pattern increases from27\.4%27\.4\\%among novice reviewers to40\.4%40\.4\\%among experts, indicating experience\-dependent rating practices\. These tiers therefore capture both contribution intensity and observable differences in rating–text alignment\[[8](https://arxiv.org/html/2606.25518#bib.bib13)\]\.

The constructed analytical variables used in this study comprise both target\-defining and explanatory features\.Sentimentis represented as a three\-class label derived from model output, whileRating\_Classis a three\-class label obtained by grouping star ratings \(1–5\); together, these variables define sentiment–rating incongruence, from which the binary variableIncongruent\(0/1\) is derived as the target outcome\.Patternis a six\-category variable formed from the interaction betweenSentimentandRating\_Class, capturing distinct mismatch typologies, andReviewer\_Tieris a four\-level ordinal variable based on grouped contribution levels, used as a predictor\. Continuous predictors includelog\_review\_length, computed as the logarithm of review length in characters usinglog⁡\(1\+x\)\\log\(1\+x\), andlog\_review\_delay, defined as the logarithm of the time gap between visit and posting usinglog⁡\(1\+x\)\\log\(1\+x\)\. Temporal effects are modeled usingTravel\_Year\_c, a centered travel year variable, and its squared termTravel\_Year\_c2, which captures potential nonlinear time trends\. Together, these variables provide a structured representation of textual, behavioral, and temporal factors relevant to the analysis\.

TABLE IV:Modeling and Interpretation Framework
### III\-DStatistical Testing and Predictive Modeling

After deriving textual sentiment through transformer\-based NLP inference, statistical and machine learning models were used as secondary explanatory tools to examine factors associated with rating–text mismatch\.Incongruentwas used as the binary outcome variable, while variables used to define it were excluded from the predictors to avoid data leakage\. The final predictor set included venue type, province, reviewer tier, review length, review delay, and travel year terms, with low multicollinearity \(max VIF=3\.663=3\.663\)\.

Chi\-square and Mann–Whitney U tests were used for bivariate analysis, with Benjamini–Hochberg correction applied for multiple testing\. Logistic regression and logit models were used to examine linear effects and adjusted odds ratios, while Random Forest was used to assess nonlinear relationships\. SHAP was used only as a post hoc interpretation layer for the Random Forest model, supporting the explanation of NLP\-derived incongruence rather than replacing the main sentiment analysis framework\[[17](https://arxiv.org/html/2606.25518#bib.bib7)\]\. Model performance was evaluated using AUC\-ROC\.

## IVResults

### IV\-APrevalence and Six\-Pattern Typology

Each incongruence type reflects a directional mismatch between rating and sentiment polarity\. The pattern names, as shown in Fig[1](https://arxiv.org/html/2606.25518#S4.F1), are original to this study, derived from the behavioral characteristic each mismatch most plausibly reflects\.

![Refer to caption](https://arxiv.org/html/2606.25518v1/x1.png)Figure 1:Distribution of the six directional incongruence patterns\.Among the reviews analyzed, 3,005 were identified as incongruent, giving an overall prevalence of 18\.6%, or roughly one in five reviews\. Fig\.[1](https://arxiv.org/html/2606.25518#S4.F1)further shows that incongruence follows a six\-pattern typology\. The two most common patterns, Conservative Rater \(38\.4%\) and Obligatory 5\-Star \(28\.3%\), together account for 66\.7% of all incongruent reviews\. This indicates that rating–text mismatch is directionally structured rather than random\. In addition, Frustrated Neutral and Polite Inflator account for a further 24\.2% of incongruent cases, showing that negative sentiment is often paired with non\-negative ratings\. This directional structure suggests that rating\-derived labels introduce systematic rather than random noise into sentiment analysis tasks\. The concentration of mismatches in a few recurring patterns also makes the typology useful for interpreting how numerical ratings and written sentiment diverge in review\-mining datasets\.

### IV\-BVariation Across Venue Types

Incongruence rates were compared across 11 attraction categories to assess contextual variation\. The Chi\-square test showed a statistically significant association between venue type and incongruence, with a small but meaningful effect size \(χ2​\(10\)=125\.85\\chi^\{2\}\(10\)=125\.85,p<0\.001p<0\.001; Cramer’s V = 0\.088\)\. As shown in Fig\.[2](https://arxiv.org/html/2606.25518#S4.F2), National Parks had the lowest incongruence rate \(12\.8%\), whereas Museums had the highest \(26\.3%\)\. Overall, the variation across venue types indicates that rating–text mismatch is not evenly distributed, but differs by review context\.

![Refer to caption](https://arxiv.org/html/2606.25518v1/fig_venue_incongruence_rate.png)Figure 2:Incongruence rate by venue type\.
### IV\-CPredictors of Incongruence

Bivariate screening with Benjamini–Hochberg correction was used to identify predictors associated with incongruence\. As shown in Table[V](https://arxiv.org/html/2606.25518#S4.T5), reviewer tier, province, travel year, and review length remained significant after correction, while review delay was not significant \(q=0\.7503q=0\.7503\)\. Expert reviewers were 1\.97 times more likely than novices to produce incongruent reviews\[[8](https://arxiv.org/html/2606.25518#bib.bib13)\]\. In addition, incongruent reviews were longer at the median than congruent reviews \(296 vs\. 279 characters\)\. These findings indicate that both reviewer characteristics and review content are associated with sentiment–rating mismatch, although multivariable modeling is needed to test their independent effects\.

Fig\.[3](https://arxiv.org/html/2606.25518#S5.F3)further illustrates the reviewer expertise effect for selected mismatch patterns, showing that Conservative Rater becomes more common among expert reviewers, while Harsh Deflator becomes less common\.

TABLE V:Bivariate Predictor Screening with Benjamini–Hochberg FDR Correction
### IV\-DModel\-Based Analysis and Interpretation

1\) Model 1A – Logistic Regression:A class\-balanced logistic regression model was used as a linear baseline\. It achieved a mean cross\-validated AUC of0\.5890±0\.00930\.5890\\pm 0\.0093and a test AUC of 0\.5840, indicating modest but stable predictive performance\. This suggests that incongruence is only partly explained by the observed variables\.

2\) Model 1B – Explanatory Logit:The explanatory logit model was used to identify independent predictors of incongruence\. Based on 95% confidence intervals, 19 predictors were statistically significant\. Venue type, reviewer expertise, review length, and travel year showed important effects, while review delay had only a weak negative association\. Overall, the model shows that incongruence is shaped by structural, behavioral, and temporal factors\.

3\) Model 2 – Random Forest \(Nonlinear Structure Test\):A Random Forest model was applied to capture nonlinear relationships and interactions\. It achieved a test AUC of 0\.6095, which was slightly higher than logistic regression\. This indicates that nonlinear effects are present, although their contribution is modest\.

4\) SHAP Analysis:SHAP was used as a post hoc interpretation layer for the Random Forest model to identify influential features behind NLP\-derived incongruence\. The results were broadly consistent with the logit model, highlighting review length, reviewer expertise, travel year, review delay, and venue type as influential factors\. SHAP therefore supports interpretation of nonlinear patterns rather than replacing the main sentiment analysis framework\.

Temporal effects indicate a modest but consistent decline in sentiment–rating incongruence over time\. The linear travel year term shows a significant negative association \(Travel\_Year\_c: OR = 0\.946, CI < 1\), suggesting that more recent reviews are less likely to be incongruent\. In contrast, the quadratic term \(Travel\_Year\_c²: OR = 0\.995\) is not statistically significant, providing no evidence of nonlinear temporal effects and indicating an approximately linear trend\. SHAP analysis further supports this finding, identifying Travel\_Year\_c as an influential predictor \(mean \|SHAP\| = 0\.0219\) and confirming the overall direction of the effect\. This suggests that, over time, reviewers are expressing their opinions more consistently, with written sentiment aligning more closely with their ratings\.

## VDiscussion

### V\-AStructured Incongruence

The incongruence rate of 18\.6% shows that rating–text mismatch is systematic rather than random\. Ratings capture overall judgments, while review text captures more specific details of the visitor experience\. For sentiment analysis, a numerical score may compress a complex experience into a single label, while the written review can express mixed or context\-dependent sentiment\.

The six\-pattern structure further shows that incongruence is directional rather than accidental\.Conservative RaterandObligatory 5\-Starpatterns dominate the mismatched cases, indicating that reviewers do not simply make random rating errors\. For NLP pipelines, this means that star ratings do not function equally across contexts as weak sentiment labels\. Using them without validation can introduce systematic label noise into sentiment analysis models\. Ratings and textual sentiment therefore capture different dimensions of experience, and treating them as interchangeable signals creates modeling risk\[[6](https://arxiv.org/html/2606.25518#bib.bib4)\]\. This supports the use of review text as an independent sentiment signal when building or evaluating review\-mining systems\. It also suggests that rating\-based labels should be checked for domain\-specific bias before being used for supervised sentiment classification\.

### V\-BConservative Rating Behavior

The Conservative Rater pattern further reinforces structural incongruence\. Positive textual sentiment paired with moderate ratings accounts for 38\.4% of incongruent reviews, showing that reviewers often temper numerical scores relative to their written sentiment\. This pattern rises from 27\.4% among novice reviewers to 40\.4% among experts, suggesting that rating behavior becomes more calibrated with experience\. Experienced reviewers may reserve high ratings for exceptional cases while still expressing positive textual sentiment\. For NLP, this creates structured label noise when ratings are used as ground\-truth sentiment labels, making reviewer expertise important for interpreting rating–text relationships\.

### V\-CLocation Type as a Structural Moderator of Label Reliability

Location type affects rating–text reliability\. Museums are over twice as likely to be incongruent compared with national parks \(Adjusted OR = 2\.386\), with similar patterns for beaches, inland waterbodies, zoological gardens, and waterfalls\. This suggests that some attraction types are harder to evaluate using a single numerical score\. Museums and cultural sites often involve layered experiences, where visitors may describe positive exhibits, heritage value, or emotional significance while also mentioning issues such as crowding, pricing, accessibility, or facilities\.

These findings indicate that star\-rating reliability is context\-dependent rather than uniform across all review types\. Rating\-derived labels may therefore introduce systematic bias when treated as equally reliable sentiment labels across different tourism contexts\. For NLP, this supports the need for context\-aware sentiment modeling that validates whether rating\-derived sentiment and textual sentiment are aligned before using ratings as ground\-truth labels\. In practical NLP applications, this means that attraction categories may require different levels of label validation before ratings are reused as sentiment labels\.

### V\-DComplementary Roles of Linear and Nonlinear Models

The results show that linear and nonlinear models offer complementary insights into sentiment–rating incongruence rather than competing explanations\. Logistic regression \(AUC=0\.584\\mathrm\{AUC\}=0\.584\) provides a stable and interpretable baseline, identifying 19 significant predictors, while the Random Forest achieves a modest improvement \(AUC=0\.6095\\mathrm\{AUC\}=0\.6095\), confirming the presence of nonlinear and interaction effects\. However, the limited performance gain suggests that these nonlinearities are not dominant\. The overall predictive range \(AUC≈0\.58\\mathrm\{AUC\}\\approx 0\.58–0\.610\.61\) indicates that incongruence is structured but only partially observable, with substantial variation driven by latent behavioral and contextual factors\. Importantly, this moderate predictive performance should not be interpreted as a weakness; rather, it reflects the inherent complexity and subjectivity of human judgment in review behavior, where not all influencing factors are directly measurable\. SHAP analysis reinforces this interpretation by showing that variables such as reviewer expertise, review length, and venue type contribute in nonlinear and context\-dependent ways, underscoring the need for richer features to further capture the phenomenon\.

### V\-EReview Delay and Reviewer Expertise Effects

Review delay plays a limited role in sentiment–rating incongruence, showing no significance in bivariate analysis and only a weak negative association \(OR = 0\.960\)\. In contrast, reviewer expertise is more influential, with experts exhibiting more conservative and fewer harsh rating behaviors, indicating that incongruence is driven more by reviewer behavior than timing\.

![Refer to caption](https://arxiv.org/html/2606.25518v1/fig_reviewer_expertise_patterns.png)Figure 3:Selected incongruence patterns by reviewer expertise, showing higher Conservative Rater prevalence and lower Harsh Deflator prevalence among expert reviewers\.

## VIConclusion

Sentiment–rating incongruence in tourism reviews is systematic rather than random\. Using transformer\-based sentiment inference, this study showed that 18\.6% of reviews contain rating–text mismatch forming six directional patterns\. The findings show that reviewer expertise, review length, venue type, and temporal factors influence rating–text divergence\. More importantly, star ratings and textual sentiment are not interchangeable signals\[[6](https://arxiv.org/html/2606.25518#bib.bib4)\]\. This distinction is especially important for datasets where ratings are used automatically as training labels\.

For NLP research, the main implication is that star ratings should not be treated as ground\-truth sentiment labels without validation\. Rating\-derived sentiment labels may introduce systematic label noise when the written review expresses a different sentiment polarity from the assigned score\. This study therefore supports the need for context\-aware sentiment analysis methods that evaluate weak\-label reliability before model training or evaluation\. This puts the premise and validity of some prior low\-resource sentiment analysis work\[[12](https://arxiv.org/html/2606.25518#bib.bib18),[13](https://arxiv.org/html/2606.25518#bib.bib19)\]into question\.

The study also contributes to tourism review analytics by showing that visitor evaluations are expressed differently through ratings and written text\. The framework can be applied to other review platforms containing both numerical ratings and textual feedback\. Future work may compare RoBERTa\-based sentiment inference with lexicon\-based baselines\[[7](https://arxiv.org/html/2606.25518#bib.bib9)\]such as VADER or TextBlob and traditional TF\-IDF classifiers, and test whether these patterns generalize across other domains\.

## References

- \[1\]\(2021\-11\)Sentiment analysis in user reviews: a study of incompatibility in hotel reviews in city of anuradhapura, sri lanka\.InProceedings of iPURSE,Vol\.23,Peradeniya, Sri Lanka\.Cited by:[§II](https://arxiv.org/html/2606.25518#S2.p3.1)\.
- \[2\]P\. Abeysinghe and T\. Bandara\(2022\)A novel self\-learning approach to overcome incompatibility on tripadvisor reviews\.Data Science and Management5,pp\. 1–10\.Cited by:[§II](https://arxiv.org/html/2606.25518#S2.p3.1)\.
- \[3\]A\. Alaei, S\. Becken, and B\. Stantic\(2019\)Sentiment analysis in tourism: capitalising on big data\.Journal of Travel Research58\(2\),pp\. 175–191\.Cited by:[§I](https://arxiv.org/html/2606.25518#S1.p1.1),[§I](https://arxiv.org/html/2606.25518#S1.p3.1),[§II](https://arxiv.org/html/2606.25518#S2.p1.1),[§III\-A](https://arxiv.org/html/2606.25518#S3.SS1.p1.2)\.
- \[4\]T\. Ali, B\. Omar, and K\. Soulaimane\(2022\-11\)Analyzing tourism reviews using an lda topic\-based sentiment analysis approach\.MethodsX9,pp\. 101894\.Cited by:[§I](https://arxiv.org/html/2606.25518#S1.p3.1),[§II](https://arxiv.org/html/2606.25518#S2.p2.1),[§II](https://arxiv.org/html/2606.25518#S2.p6.1)\.
- \[5\]A\. Ameur, S\. Hamdi, and S\. B\. Yahia\(2023\-09\)Sentiment analysis for hotel reviews: a systematic literature review\.ACM Computing Surveys56\(2\),pp\. Article 51\.Cited by:[§I](https://arxiv.org/html/2606.25518#S1.p3.1),[§II](https://arxiv.org/html/2606.25518#S2.p1.1),[§II](https://arxiv.org/html/2606.25518#S2.p5.1)\.
- \[6\]E\. Bigne, C\. Ruiz, C\. Perez\-Cabanero, and A\. Cuenca\(2023\)Are customer star ratings and sentiments aligned? a deep learning study of the customer service experience in tourism destinations\.Service Business17,pp\. 281–314\.Cited by:[§I](https://arxiv.org/html/2606.25518#S1.p3.1),[§II](https://arxiv.org/html/2606.25518#S2.p4.1),[§V\-A](https://arxiv.org/html/2606.25518#S5.SS1.p2.1),[§VI](https://arxiv.org/html/2606.25518#S6.p1.1)\.
- \[7\]M\. Chu, Y\. Chen, L\. Yang, and J\. Wang\(2022\-10\)Language interpretation in travel guidance platform: text mining and sentiment analysis of tripadvisor reviews\.Frontiers in Psychology\.Cited by:[§I](https://arxiv.org/html/2606.25518#S1.p3.1),[§II](https://arxiv.org/html/2606.25518#S2.p2.1),[§II](https://arxiv.org/html/2606.25518#S2.p6.1),[§VI](https://arxiv.org/html/2606.25518#S6.p3.1)\.
- \[8\]A\. Y\. K\. Chua and S\. Banerjee\(2015\)Understanding review helpfulness as a function of reviewer reputation, review rating, and review depth\.Journal of the Association for Information Science and Technology66\(2\),pp\. 354–362\.Cited by:[§II](https://arxiv.org/html/2606.25518#S2.p5.1),[§III\-C](https://arxiv.org/html/2606.25518#S3.SS3.p2.21),[§IV\-C](https://arxiv.org/html/2606.25518#S4.SS3.p1.1)\.
- \[9\]N\. de Silva\(2026\)Survey on Publicly Available Sinhala Natural Language Processing Tools and Research\.arXiv preprint arXiv:1906\.02358v26\.Cited by:[§II](https://arxiv.org/html/2606.25518#S2.p4.1)\.
- \[10\]J\. Devlin, M\. W\. Chang, K\. Lee, and K\. Toutanova\(2019\-06\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InNAACL,Minneapolis, MN, USA,pp\. 4171–4186\.Cited by:[§I](https://arxiv.org/html/2606.25518#S1.p4.1),[§II](https://arxiv.org/html/2606.25518#S2.p2.1),[§III\-B](https://arxiv.org/html/2606.25518#S3.SS2.p1.1)\.
- \[11\]O\. A\. George and C\. M\. Q\. Ramos\(2024\)Sentiment analysis applied to tourism: exploring tourist\-generated content in the case of a wellness tourism destination\.International Journal of Spa and Wellness7\(2\),pp\. 139–161\.Cited by:[§I](https://arxiv.org/html/2606.25518#S1.p2.1),[§II](https://arxiv.org/html/2606.25518#S2.p4.1)\.
- \[12\]V\. Jayawickrama, G\. Weeraprameshwara, N\. de Silva, and Y\. Wijeratne\(2021\)Seeking sinhala sentiment: predicting facebook reactions of sinhala posts\.InInternational Conference on Advances in ICT for Emerging Regions,pp\. 177–182\.External Links:[Document](https://dx.doi.org/10.1109/ICter53630.2021.9774796)Cited by:[§II](https://arxiv.org/html/2606.25518#S2.p4.1),[§VI](https://arxiv.org/html/2606.25518#S6.p2.1)\.
- \[13\]V\. Jayawickrama, G\. Weeraprameshwara, N\. de Silva, and Y\. Wijeratne\(2022\)Facebook for sentiment analysis: baseline models to predict facebook reactions of sinhala posts\.The International Journal on Advances in ICT for Emerging Regions15\(2\)\.External Links:[Document](https://dx.doi.org/10.4038/icter.v15i2.7248)Cited by:[§II](https://arxiv.org/html/2606.25518#S2.p4.1),[§VI](https://arxiv.org/html/2606.25518#S6.p2.1)\.
- \[14\]B\. Kwon, J\. Lee, J\. Min, C\. Kwak, and H\. B\. S\. Choi\(2025\)Beyond the stars: the impact of rating\-text inconsistency on perceived review usefulness\.Asia Pacific Journal of Information Systems35\(1\),pp\. 49–72\.Cited by:[§I](https://arxiv.org/html/2606.25518#S1.p3.1),[§II](https://arxiv.org/html/2606.25518#S2.p4.1)\.
- \[15\]I\. Nawawi, K\. F\. Ilmawan, M\. F\. Maarif, and M\. Syafrudin\(2024\-08\)Exploring tourist experience through online reviews using aspect\-based sentiment analysis with zero\-shot learning for hospitality service enhancement\.Information15\(8\),pp\. 499\.Cited by:[§II](https://arxiv.org/html/2606.25518#S2.p2.1),[§II](https://arxiv.org/html/2606.25518#S2.p6.1)\.
- \[16\]B\. Pang and L\. Lee\(2008\)Opinion mining and sentiment analysis\.Foundations and Trends in Information Retrieval2\(1–2\),pp\. 1–135\.Cited by:[§I](https://arxiv.org/html/2606.25518#S1.p1.1),[§II](https://arxiv.org/html/2606.25518#S2.p1.1),[§III\-A](https://arxiv.org/html/2606.25518#S3.SS1.p1.2)\.
- \[17\]K\. Puh and M\. B\. Babac\(2023\)Predicting sentiment and rating of tourist reviews using machine learning\.Journal of Hospitality and Tourism Insights6\(3\),pp\. 1188–1204\.Cited by:[§I](https://arxiv.org/html/2606.25518#S1.p4.1),[§II](https://arxiv.org/html/2606.25518#S2.p2.1),[§II](https://arxiv.org/html/2606.25518#S2.p6.1),[§III\-D](https://arxiv.org/html/2606.25518#S3.SS4.p2.1)\.
- \[18\]T\. Sewwandi\(2023\)Tourism and travel reviews: sri lankan destinations\.Note:Mendeley Data, V1External Links:[Document](https://dx.doi.org/10.17632/2nbvx5m4hs.1)Cited by:[§I](https://arxiv.org/html/2606.25518#S1.p5.1),[§III\-A](https://arxiv.org/html/2606.25518#S3.SS1.p1.1)\.
- \[19\]Y\. Sun, S\. Wang, Y\. Li, S\. Feng, X\. Chen, H\. Zhang, X\. Tian, D\. Zhu, H\. Tian, and H\. Wu\(2019\)ERNIE: enhanced representation through knowledge integration\.arXiv preprint arXiv:1904\.09223\.Cited by:[§I](https://arxiv.org/html/2606.25518#S1.p4.1),[§II](https://arxiv.org/html/2606.25518#S2.p2.1),[§III\-B](https://arxiv.org/html/2606.25518#S3.SS2.p1.1)\.
- \[20\]G\. Weeraprameshwara, V\. Jayawickrama, N\. de Silva, and Y\. Wijeratne\(2022\)Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data\.In2022 The 3rd International Conference on Artificial Intelligence in Electronics Engineering,pp\. 16–22\.External Links:[Document](https://dx.doi.org/10.1145/3512826.3512829)Cited by:[§II](https://arxiv.org/html/2606.25518#S2.p4.1)\.
- \[21\]G\. Weeraprameshwara, V\. Jayawickrama, N\. de Silva, and Y\. Wijeratne\(2022\)Sinhala Sentence Embedding: A Two\-Tiered Structure for Low\-Resource Languages\.InProceedings of the 36th Pacific Asia Conference on Language, Information and Computation,pp\. 325–336\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2210.14472)Cited by:[§II](https://arxiv.org/html/2606.25518#S2.p4.1)\.
- \[22\]Y\. Wen, Y\. Liang, and X\. Zhu\(2023\-03\)Sentiment analysis of hotel online reviews using the bert model and ernie model—data from china\.PLOS ONE18\(3\),pp\. e0275382\.Cited by:[§I](https://arxiv.org/html/2606.25518#S1.p4.1),[§II](https://arxiv.org/html/2606.25518#S2.p2.1),[§II](https://arxiv.org/html/2606.25518#S2.p6.1)\.

Similar Articles

Using Text-Based Causal Inference to Disentangle Factors Influencing Online Review Ratings

arXiv cs.CL

This paper introduces a text-based causal inference methodology using an enhanced CausalBERT to disentangle the effects of individual aspects (e.g., school administration, academic performance) on overall online review ratings, validated on 600K+ U.S. K-12 school reviews. Key improvements include temperature scaling, hyperparameter optimization, and interpretability methods to reduce confounding bias.

Hidden Consensus:Preference-Validity Compression in Human Feedback

arXiv cs.CL

This paper argues that standard RLHF's scalarization of human preferences collapses multiple valid interpretations into a single target, mis-measuring alignment in culturally plural societies. Analyzing a Malaysian dataset, they find 79% of prompts have multiple majority-supported responses that single-winner aggregation discards.

Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles

arXiv cs.CL

This paper investigates whether topic sentiment causally affects perceived political ideology in news articles, comparing human annotations from AllSides with those from LLMs including GPT-4o-mini and Llama-3.3-70B. It finds that fine-tuned GPT-4o-mini exhibits a spurious sentiment-ideology coupling not present in human judgments, highlighting risks of using LLM annotations as proxies in causal analyses.

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Hugging Face Daily Papers

This paper investigates the alignment of LLM-generated reviews with human judgment using 1k real ACL 2025 submissions, finding limited agreement, instability across models/prompts, and a method to artificially inflate scores without meaningful changes. The authors advise against relying solely on LLM reviews and call for discussion on their use in handling increasing submission volumes.