Tag
The paper introduces Chimera Training, a method for logical anomaly detection that uses counterfactual construction at the feature level to train neural rule evaluators without requiring real anomalous images, improving rule-level anomaly detection performance on benchmarks like CLEVRER, OpenImages, and VidOR.
ChaosBench-Logic v2 is a large-scale benchmark of 40,886 questions over 165 dynamical systems that evaluates LLMs' logical reasoning abilities, revealing near-random performance on regime transition reasoning and systematic failure modes even in frontier models.
This paper introduces novel methods for generating high-quality embeddings for Horn logic reasoning using triplet loss, including techniques for balanced training example generation and hard example emphasis, which improve the efficiency of downstream logical reasoning.
LLMEval-Logic is a new Chinese benchmark for evaluating logical reasoning in LLMs, featuring solver-verified answers and adversarial hardening. The benchmark reveals significant gaps in current models, with the best reaching only 37.5% accuracy on hard items.
This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.