Tag
Introduces LoFa, a comprehensive benchmark to evaluate LLM robustness against logical fallacies in persuasive contexts, featuring a multi-agent pipeline and a multi-round debate framework.
Introduces LPDS, a framework to systematically evaluate LLM robustness by scaling difficulty of logic-preserving variations, finding that performance drops up to 5x compared to random sampling and that training on harder variations improves robustness.
This research paper investigates the 'Text Uncanny Valley,' a phenomenon where LLM performance in information retrieval tasks degrades non-monotonically as word-boundary corruption increases. The authors propose a mode transition hypothesis to explain this U-shaped performance curve and demonstrate its relevance to real-world noisy text inputs.
This paper introduces WARDEN, a distributionally robust adversarial training framework for large language models that uses f-divergence to dynamically reweight adversarial examples, significantly reducing attack success rates while maintaining computational efficiency.
This paper presents a comprehensive empirical evaluation of how large language models handle corruptions in chain-of-thought reasoning steps, testing 13 models across 5 perturbation types (MathError, UnitConversion, Sycophancy, SkippedSteps, ExtraSteps) on mathematical reasoning tasks. The findings reveal heterogeneous vulnerability patterns with implications for deploying LLMs in multi-stage reasoning pipelines.
This paper investigates how supervised fine-tuning (SFT) increases hallucinations in LLMs by causing knowledge degradation and proposes a self-distillation-based method to mitigate this issue while preserving pre-existing factual knowledge. The authors identify semantic interference among overlapping representations as the primary mechanism behind SFT-induced hallucinations and demonstrate solutions including parameter freezing and self-distillation.