reliability

Tag

Cards List
#reliability

Gemini and AI Hallucination

Reddit r/artificial · 5h ago

Discussion of AI hallucination issues in Google's Gemini model, highlighting challenges in reliability and accuracy of large language models.

0 favorites 0 likes
#reliability

I got tired of AI agents silently failing in production, so I built a runtime control layer for them

Reddit r/AI_Agents · yesterday

The author built a runtime control layer to address the problem of AI agents failing silently in production environments.

0 favorites 0 likes
#reliability

Your automation "expert" built you a time bomb, and they'll ghost the second it goes off.

Reddit r/AI_Agents · 4d ago

A critique of poorly built automation systems created by so-called experts who ignore error handling, documentation, and governance, leaving clients with fragile workflows that fail in production.

0 favorites 0 likes
#reliability

@xingyaow_: People have been asking why OpenHands V1 goes the opposite direction from Claude Managed Agents. I finally found the ti…

X AI KOLs Following · 5d ago Cached

Blog post by Xingyao Wang explaining why OpenHands V1 chose a different architecture from Claude Managed Agents, arguing that reliability comes from implementation details rather than topology.

0 favorites 0 likes
#reliability

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

arXiv cs.LG · 5d ago Cached

This paper demonstrates that interventions on Sparse Autoencoder (SAE) features can be unreliable because suppressed behavior can recover through residual-space optimization, even while the intervention remains active. It reveals a critical gap between feature-level control and actual behavioral completeness in language models.

0 favorites 0 likes
#reliability

Your best model probably isn't your best tool caller

Reddit r/AI_Agents · 6d ago

The article argues that tool-calling reliability often does not scale with model capability; smaller models can outperform larger ones in schema adherence and format discipline, suggesting that raw capability is not the sole factor in choosing a model for tool use.

0 favorites 0 likes
#reliability

@RayFernando1337: Talking about fast and reliable agents. 3x faster start times. 99.99% error free turns (higher reliability)

X AI KOLs Following · 2026-06-16 Cached

Claims significant improvements in agent performance: 3x faster start times and 99.99% error-free turns.

0 favorites 0 likes
#reliability

Probably raises $9M to build a more reliable kind of AI

TechCrunch AI · 2026-06-16 Cached

Probably raises $9M seed from Andreessen Horowitz to build a more reliable AI system using a deterministic validator harness that catches LLM hallucinations, enabling smaller models to run on local hardware.

0 favorites 0 likes
#reliability

ToolMenuBench: Benchmarking Tool-Menu Filtering Strategies for Reliable and Efficient LLM Agents

arXiv cs.AI · 2026-06-16 Cached

ToolMenuBench is a benchmark for evaluating tool-menu filtering strategies in multi-step LLM agents. It shows that causal minimal tool filtering significantly improves task success and reduces token usage compared to unfiltered exposure.

0 favorites 0 likes
#reliability

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

arXiv cs.AI · 2026-06-16 Cached

This paper introduces Metric Match, a method for selecting a subset of samples for human annotation to estimate LLM judge reliability more efficiently, reducing annotation costs by 32.5% and achieving a win-rate of 0.838 against random selection.

0 favorites 0 likes
#reliability

Agent checkpointing is far from production-grade resiliency

Reddit r/AI_Agents · 2026-06-15

A blog post argues that current agent checkpointing is insufficient for production-grade resiliency, highlighting gaps like failure detection, automatic retries, and high availability, and suggests building agents on a highly-available orchestration layer.

0 favorites 0 likes
#reliability

@populartourist: Having worked consistently with Qwen3.6 27B NVFP4 on repos - it's clear that this quant is not reliable, at least for c…

X AI KOLs Timeline · 2026-06-15 Cached

The user reports that the Qwen3.6 27B NVFP4 quantization is unreliable for coding, with inconsistent quality despite high throughput, and suggests that Q4_K_M may be more consistent.

0 favorites 0 likes
#reliability

Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge

arXiv cs.CL · 2026-06-15 Cached

This paper proposes Judge-LS, a protocol to evaluate whether LLM-as-a-judge models are invariant to language switching between English and Chinese. It finds that switching languages causes 10.7-14.4% preference flips and that judges achieve their highest accuracy in English.

0 favorites 0 likes
#reliability

@rohanpaul_ai: Univ of Texas paper shows AI agents can slowly become less reliable after deployment, even when the model itself does n…

X AI KOLs Following · 2026-06-14 Cached

A University of Texas paper introduces AgingBench, a benchmark that reveals AI agents can become less reliable after deployment due to memory and maintenance decay, even when the underlying model remains unchanged.

0 favorites 0 likes
#reliability

Companies are learning that trying to force non-deterministic math into a zero-error business environment creates more work, not less.

Reddit r/ArtificialInteligence · 2026-06-13

Companies are realizing that forcing non-deterministic AI into zero-error business environments is counterproductive, leading to budget cuts and failed pilot programs as ROI remains elusive.

0 favorites 0 likes
#reliability

My AI agent keeps failing the same QA task 10+ times. How do I fix the workflow?

Reddit r/AI_Agents · 2026-06-12

A user reports repeated failures when using an AI agent (Hermes + Claude Code) for exploratory QA on a web app, citing DB errors, cache staleness, and infrastructure debugging. They seek advice on creating a reliable workflow with pre-checks, cache clearing, and limiting agent scope.

0 favorites 0 likes
#reliability

(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable

arXiv cs.AI · 2026-06-12 Cached

This paper proposes that reliability in AI-assisted social science research depends on decision architecture—how cognitive labor is divided between humans and machines. Through a pre-specified factorial experiment, the authors show that an unconstrained multi-agent baseline fails in 72% of runs, while one organized with three architectural commitments (LLMs restricted to reasoning, deterministic data/estimation, and three human decision gates) fails in only 16%.

0 favorites 0 likes
#reliability

I think long context agents are failing in a very boring way

Reddit r/artificial · 2026-06-12

An opinion piece arguing that long context windows don't equate to memory and that agent failures are often mundane, like forgetting constraints or rereading files, emphasizing that reliability depends on context architecture decisions.

0 favorites 0 likes
#reliability

We hit the retry problem hard enough that we open-sourced a fix

Reddit r/AI_Agents · 2026-06-11

Replaysafe is an open-source npm library that ensures idempotent retries by fingerprinting operations, preventing duplicate side effects in AI agent workflows. It integrates with popular frameworks like LangGraph and CrewAI.

0 favorites 0 likes
#reliability

Most AI agents don't fail because the model is bad.

Reddit r/AI_Agents · 2026-06-10

AI agents often fail due to messy environments rather than bad models; improving environment stability makes simple agents perform well.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback