Tag
Scott Clark, co-founder & CEO of Distributional, will speak about AI reliability and testing at AGI Summit SF 2026, taking place July 18-19, 2026 in San Francisco.
Momentic announces a major platform update with an AI-powered knowledge base and autonomous testing agents to address the growing gap between code velocity and software quality.
A tweet argues that AI app testing should be a first-class feature in coding apps, noting that many obvious problems could be caught if AI tried the app itself.
The article questions whether current AI benchmarks are adequate for evaluating AI in real-time, background contexts like voice calls, autonomous driving, and smart glasses, as they assume a prepared user.
Team members shared their experience of using AI (DeepSeek V4 Flash) to automatically create E2E test cases and complete development and debugging, passing acceptance in one go, demonstrating the potential of AI-assisted development.
Tyto by ai-coustics is a tool that provides audio insights to predict voice AI performance.
The article argues that traditional chatbot QA is broken because it only tests happy paths, and proposes using an AI-powered user simulator that attacks the bot with diverse personas and edge cases to find vulnerabilities before deployment.
Trump's AI executive order for pre-deployment testing of frontier models faces challenges due to gutted security teams and issues with transparency and observability, potentially limiting its effectiveness.
Upgraded Playwright MCP to provide full DOM serialization for AI agents, improving visibility of interactive elements compared to the default ARIA snapshot. Open-sourced for developers building AI test agents.
Microsoft released ASSERT, an open-source framework that generates AI behavior tests from natural-language descriptions, allowing developers to create application-specific evaluations and monitor AI systems continuously.
A simple test for voice agents: give an underspecified instruction (like 'use the address on file') and see if the agent asks for clarification before committing. The quality of the follow-up question reveals the agent's reliability.
Discusses the common gap between clean benchmark-style testing environments and messy real-world usage in AI workflows, leading to production failures, and mentions evaluation platforms like Confident AI, Braintrust, and Langfuse.
LLMTest is a tool to help developers use the right LLMs in their apps and set up fallbacks.
A new tool built on Claude Code enables autonomous testing of iOS apps by navigating every screen, testing flows, reading debug logs, and producing structured bug reports from a single prompt.
Der Artikel beschreibt einen Test mit Grok 4.3, bei dem untersucht wird, wie sich eine sogenannte Existenzlogik-Architektur auf die Entscheidungsfindung der KI in Bezug auf globale Verantwortung auswirkt. Die Ergebnisse zeigen deutliche Unterschiede in der Herangehensweise zwischen einem unstrukturierten und einem gerahmten Prompt.
GPT 5.5 fails to solve Jane Street Puzzles that its predecessor could not handle either, suggesting continued limitations in AI reasoning.
Codex has been updated to test web applications at various viewport sizes using an in-app browser, featuring automated click-through validation, screenshot feedback for long runs, and accelerated testing by disabling animations.
PACT introduces a head-to-head negotiation benchmark for LLMs using a 20-round buyer-seller bargaining game to test persuasion and adaptation. Top performers include GPT-5.5 and Opus 4.7, with ratings computed via Glicko-2 on an Elo-like scale.
OpenAI publishes acknowledgements for external red teamers and evaluators who contributed to GPT-4o's safety testing and system card development. The document credits numerous individual researchers and organizations including METR and Apollo Research.