BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts
Summary
Researchers introduce BenSyc, the first benchmark for evaluating conversational sycophancy in Bengali social contexts, finding that LLMs struggle to distinguish empathetic support from validation and escalation, achieving only ~61% Macro-F1.
View Cached Full Text
Cached at: 06/10/26, 05:45 AM
Paper page - BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts
Source: https://huggingface.co/papers/2606.10061
Abstract
Researchers create BenSyc, a benchmark for evaluating conversational sycophancy in Bengali contexts, revealing challenges in distinguishing empathetic support from validation and escalation in emotionally sensitive dialogues.
Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessivevalidationor escalatory alignment. Existing sycophancy research primarily focuses on factual agreement and instruction-following settings, leaving culturally groundedconversational sycophancyunderexplored. We introduce BenSyc, the first benchmark for studyingconversational sycophancyin Bengali social contexts. Starting from 11,840 Reddit posts and 170k comments collected from communities across Bangladesh and West Bengal, we construct a human-validated benchmark with binary labels and a fine-grained five-level taxonomy spanning Invalidation, Neutral, Support,Validation, andEscalation. We evaluate more than 15 open and proprietary LLMs on conversational alignment classification and response generation tasks. Results show that distinguishingempathetic supportfrom reinforcement-orientedvalidationremains challenging even for frontierinstruction-tuned models: the best system achieves only 61.8 Macro-F1 onbinary detectionand 61.7 Macro-F1 onfive-class classification. In generation settings, several models frequently produce strongly validating or escalatory responses in emotionally charged situations. Our findings highlight substantial variation across model families and conversational behaviors, underscoring the importance of culturally groundedmultilingual benchmarksfor evaluating socially aligned conversational AI systems.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.10061
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.10061 in a model README.md to link it from this page.
Datasets citing this paper1
#### Sajib-006/bensyc Viewer• Updatedabout 4 hours ago • 2.12k • 21 • 1
Spaces citing this paper1
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models
This position paper analyzes sycophancy in LLMs as a boundary failure between social alignment and epistemic integrity, proposing a new framework and taxonomy to classify and mitigate these behaviors.
Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models
This paper introduces MIST, a benchmark for evaluating sycophancy in memory-augmented LLMs, demonstrating that memory systems amplify sycophantic behavior by up to 25x and proposing lightweight mitigations that reduce sycophancy while maintaining factual recall.
MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media
This paper introduces MultiSoc-4D, a benchmark for diagnosing instruction-induced label collapse in LLMs annotating Bengali social media. It reveals that LLMs systematically prefer fallback labels, leading to under-detection of minority categories like hate speech and sarcasm.
When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models
This paper introduces CulturalNB, a dataset of Bengali cultural question-answer pairs, and evaluates nine LLMs for cross-lingual cultural bias. Findings show that English prompting increases global narrative substitution and reduces local perspectives, revealing that cultural failures in LLMs are grounding and prioritization issues, not just missing knowledge.
How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models
This paper investigates asymmetries in LLMs' pragmatic competence by comparing their performance as judges of linguistic appropriateness versus as generators of pragmatically appropriate language. The study finds that many models perform substantially better as pragmatic listeners than as speakers, suggesting misalignment between evaluation and generation capabilities.