Tag
The author is building a tool to automatically test AI agents by simulating realistic user conversations and providing pass/fail reports, saving developers from manual testing.
This article argues that standard test suites with fixed inputs and expected outputs are insufficient for AI agents due to infinite input spaces and non-deterministic behavior, advocating for property-based testing instead.
Antioch introduces Antioch Agent, a browser-based robotics simulator that lets developers test robot software in a closed agentic loop without physical hardware, accelerating development cycles.
This paper establishes the first sharp thresholds for low-degree polynomial tests in planted-vs-planted settings, matching the known low-degree recovery threshold for counting communities in planted submatrix and dense subgraph models, and identifying a smooth transition for weak testing.
This documentation describes the assertion language in the Ciao Prolog system, which allows annotating code with type and instantiation mode declarations for debugging, testing, optimization, and autodocumentation.
Mutation testing is now generally available in the sydtest Haskell testing framework, enabling developers to automatically verify test suite quality by generating code mutations and checking that tests catch them. The author was motivated by the rise of AI-generated code (via Claude) and the need for an objective, automated measure of test coverage.
Bendex Arc is a tool that resists prompt injection attacks by tracking full sessions, independently verified to be 100% effective against attacks that defeat other tools.
Microsoft released ASSERT at Build 2026, an open-source framework that converts natural language behavior specifications into executable evaluations for AI agents.
A hypothetical question about testing a system that can reason across 100m+ context with near-perfect accuracy raises discussion on proving its capabilities.
This article explains the concept of self-calling executables, where a program starts another instance of itself, and demonstrates its use in Go testing (running the main function in a subprocess) and in TUI tools (e.g., jjui using SSH_ASKPASS to prompt for passwords via a child process).
This article highlights that many AI agent projects fail in production not because of model quality, but because teams launch without clearly defining what constitutes failure, missing critical edge cases that lead to confident incorrect outputs.
Peter Steinberger used Codex to build a fully automated QA bot that automatically generates tests, runs tests after each code commit, and can automatically fix bugs and submit PRs, greatly improving development efficiency.
The author tested AI agents on real browser tasks and found them unreliable due to infrastructure limitations, arguing for a dedicated browser runtime for agents rather than relying on current browsers designed for humans.
replayd is an open source Python tool that captures failed AI agent runs and replays them as regression tests to prevent regressions from returning after changes.
Discusses how to benchmark and grade production builds, focusing on key performance indicators like context-drift, hallucinations, and governance.
Blue Origin's New Glenn rocket exploded during a hotfire test at Cape Canaveral, marking a significant setback. All personnel are safe, and an investigation is underway.
The Bot Company, a $2 billion startup founded by Tesla and Cruise alums, is accused of secretly testing household robots in Airbnbs, causing extensive damage; a host is suing for $12,383.50.
A web page that measures keyboard latency via reaction time and tap duration tests, allowing users to submit results for comparison.
This paper introduces LGMT, a framework that uses first-order logic to generate semantically invariant test cases for evaluating LLM reasoning reliability. Experiments on six LLMs show that LGMT exposes hidden defects missed by static benchmarks, suggesting evaluation should focus on robustness under logical invariance.
This paper introduces CAFD, a learning-based approach for DNN fault detection that integrates model-based, distance-based, and a novel concept-based feature called Concept Failure Ratio (CFR) derived from Vision-Language Models. CAFD consistently outperforms state-of-the-art baselines in fault detection rate across multiple datasets and budgets.