Tag
Pramaana Labs raised $27M in seed funding led by Khosla Ventures to apply formal verification (using the LEAN programming language) to improve AI reliability in high-stakes domains like law, drug discovery, and tax preparation.
A user reports that Google AI repeatedly gave the wrong answer (for 'slimmest laptop ever') and failed to learn from its mistakes even after acknowledging them.
The author reports that Google's Gemini consistently fabricates technical answers, inventing features and instructions rather than admitting uncertainty, posing risks for technical guidance.
A user shares their experience using ChatGPT for complex medical caregiving and proposes the idea of aggregating multiple AI models to improve reliability by seeking consensus among different LLMs.
An analysis reveals that 28.9% of GPT 5.5's failures on SWEBench Pro are due to broken or incorrect test cases, and similar issues affect other major AI benchmarks, raising concerns about the accuracy of current evaluation methods.
Explains what large language models actually do (next-token prediction) and why they sound confident even when wrong. Offers a mental model and verification checklist for using LLMs safely.
The article discusses the industry consensus that AI is becoming extremely capable but still faces reliability issues for high-stakes tasks, emphasizing that current systems optimize for plausibility rather than guaranteed truth, and that the path forward involves layered verification systems rather than a single perfect model.
This article discusses the importance of faithfulness in LLM optimization, introducing a Structural Fidelity Score that measures drift across word overlap, constraint survival, and task-type match to ensure prompt optimization does not sacrifice intent.
An audit by the Office of the Auditor General of Ontario found that AI note-taking systems approved for healthcare routinely fabricate information, insert incorrect drug details, and miss critical patient data, with accuracy accounting for only 4% of their evaluation score.
GigaAI announces a new hallucination correction feature that reduces the model's hallucination rate to approximately 1%, claiming superior reliability compared to frontier models.
A paper from Stanford and Harvard researchers argues that agentic AI systems fail in real-world deployment not because they lack intelligence, but due to fundamental issues that cause demo performance to collapse in practice.
A preprint analyzing why computer-use agents succeed once but fail on repeated executions, attributing unreliability to execution stochasticity, task ambiguity, and behavioral variability, and advocating repeated evaluation and stable strategies.