Tag
A new tool built on Claude Code enables autonomous testing of iOS apps by navigating every screen, testing flows, reading debug logs, and producing structured bug reports from a single prompt.
Der Artikel beschreibt einen Test mit Grok 4.3, bei dem untersucht wird, wie sich eine sogenannte Existenzlogik-Architektur auf die Entscheidungsfindung der KI in Bezug auf globale Verantwortung auswirkt. Die Ergebnisse zeigen deutliche Unterschiede in der Herangehensweise zwischen einem unstrukturierten und einem gerahmten Prompt.
GPT 5.5 fails to solve Jane Street Puzzles that its predecessor could not handle either, suggesting continued limitations in AI reasoning.
Codex has been updated to test web applications at various viewport sizes using an in-app browser, featuring automated click-through validation, screenshot feedback for long runs, and accelerated testing by disabling animations.
PACT introduces a head-to-head negotiation benchmark for LLMs using a 20-round buyer-seller bargaining game to test persuasion and adaptation. Top performers include GPT-5.5 and Opus 4.7, with ratings computed via Glicko-2 on an Elo-like scale.
OpenAI publishes acknowledgements for external red teamers and evaluators who contributed to GPT-4o's safety testing and system card development. The document credits numerous individual researchers and organizations including METR and Apollo Research.