Tag
This paper argues that LLM-based coding agents have reached a capability threshold making human code review redundant, and proposes replacing human inspection with agent-driven verification to reduce costs and latency.
Discussion of AI hallucination issues in Google's Gemini model, highlighting challenges in reliability and accuracy of large language models.
An opinion piece arguing that AI systems, especially large language models, are fundamentally bullshitters because they generate plausible but false information without understanding or intent to deceive.
This paper argues that language model agents should assist causal discovery workflows by providing contextual support and explanations rather than generating causal conclusions, and introduces causal-learn+ platform to demonstrate this principle.
Discusses using Qwen 27B for planning tasks and Qwen 35B-A3B for execution tasks, suggesting a specialized model approach.
Tsinghua University Language Processing Lab is recruiting postdocs, researchers, and interns to work on cutting-edge large model research and development. It offers ample computing power, data, funding, and competitive salaries, with a focus on research and open source.
Meituan's GN06 team officially launched AI browser Tabbit 1.0, which integrates multiple top large language models, supports automatic execution of complex tasks across software and web pages, and adds a memory function.
BIM-Edit is a benchmark for evaluating LLMs on natural-language editing of Building Information Models (BIM) in IFC format. Results show a substantial gap, with the best model achieving only 49.5% average score across geometric, semantic, and topological metrics.
This paper presents a systematic review and benchmark of 24 black-box uncertainty estimation methods for large language models across 4 models and 4 dataset settings, finding that no single method dominates but hybrid methods that combine multiple uncertainty signals perform well.
A systematic experimental analysis evaluating eight state-of-the-art Diffusion Language Models across multiple benchmarks, analyzing trade-offs between generation quality and computational efficiency.
This paper introduces Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable intermediate layer in LLMs using entropy-guided search, mitigating the alignment tax and improving reasoning performance on benchmarks like GPQA-Diamond and Omni-MATH with negligible overhead.
Ying Sheng co-wrote SGLang, the inference engine now serving Grok at xAI on a hundred thousand GPUs, achieving 5x cost cuts over DeepSeek's API; she also built FlexGen and helped build Chatbot Arena.
This article, based on the sharing of researcher Victoria Lin, systematically reviews the mainstream technical approaches of native multimodal large models (Chameleon, Transfusion, MOT) and their pros and cons. It points out that multimodal AI is still in the early exploration stage, with open problems such as gaps in scaling laws, inconsistency between image understanding and generation encoding, and connection with the physical world.
A tweet promotes Stanford's free CS324 course on large language models, which uses a simple example of a mouse eating cheese to explain how LLMs work, and includes interactive demos.
This paper investigates how large language models handle the combination of negation and figurative language, finding that this combination poses a particular challenge and that performance depends heavily on prompt style. The authors develop new annotations for the Fig-QA dataset and analyze embedding spaces to uncover additional linguistic factors like tense and concreteness.
Introduces SPO, a stochastic search framework for automatic prompt optimization, with three strategies including SAGE, an agent-guided multi-agent pipeline. Evaluated on benchmarks and deployed on a mental-health chatbot, showing improvements in retention through continuous optimization.
Presents output vector editing, a constrained-optimization weight edit to mitigate memorization in LLMs by modifying MLP neuron output vectors instead of zeroing activations, achieving up to 87.9% suppression with minimal locality failures.
RegMix-D extends RegMix to dynamic data mixing by using loss trajectories from proxy runs to predict optimal mixtures at multiple training stages, achieving improvements over static methods.
This paper introduces VETO, a benchmark to quantify 'misfired alignment' where LLMs avoid correct inferences due to safety training, and finds that all tested models exhibit such failures while humans do not.
This paper introduces PEC-Home, a simulated home dataset for interpreting progressively elliptical commands in smart homes, and finds that current LLM-based assistants struggle with such commands due to referential and intention ambiguity.