ai-testing

#ai-testing

@HowToAI_: Someone built a tool that lets Claude Code autonomously test your entire IOS app It navigate your entire app, opens eve…

X AI KOLs Timeline ↗ · 2d ago Cached

A new tool built on Claude Code enables autonomous testing of iOS apps by navigating every screen, testing flows, reading debug logs, and producing structured bug reports from a single prompt.

0 favorites 0 likes

#ai-testing

Was passiert, wenn eine KI globale Verantwortung übernehmen muss?🌏⚠️ Wir haben eine neue Existenzlogik-Architektur anhand eines der schwierigsten denkbaren Szenarien mit Grok 4.3 getestet.

Reddit r/ArtificialInteligence ↗ · 3d ago

Der Artikel beschreibt einen Test mit Grok 4.3, bei dem untersucht wird, wie sich eine sogenannte Existenzlogik-Architektur auf die Entscheidungsfindung der KI in Bezug auf globale Verantwortung auswirkt. Die Ergebnisse zeigen deutliche Unterschiede in der Herangehensweise zwischen einem unstrukturierten und einem gerahmten Prompt.

0 favorites 0 likes

#ai-testing

GPT 5.5 Cannot Do These Puzzles

Reddit r/singularity ↗ · 3d ago

GPT 5.5 fails to solve Jane Street Puzzles that its predecessor could not handle either, suggesting continued limitations in AI reasoning.

0 favorites 0 likes

#ai-testing

@JamesZmSun: Codex can now use the in-app browser to test your app at different viewport sizes! It will control the device tool bar …

X AI KOLs Following ↗ · 4d ago

Codex has been updated to test web applications at various viewport sizes using an in-app browser, featuring automated click-through validation, screenshot feedback for long runs, and accelerated testing by disabling animations.

0 favorites 0 likes

#ai-testing

PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups.

Reddit r/singularity ↗ · 6d ago

PACT introduces a head-to-head negotiation benchmark for LLMs using a 20-round buyer-seller bargaining game to test persuasion and adaptation. Top performers include GPT-5.5 and Opus 4.7, with ratings computed via Glicko-2 on an Elo-like scale.

0 favorites 0 likes

#ai-testing

GPT-4o System Card External Testers Acknowledgements

OpenAI Blog ↗ · 2024-08-08 Cached

OpenAI publishes acknowledgements for external red teamers and evaluators who contributed to GPT-4o's safety testing and system card development. The document credits numerous individual researchers and organizations including METR and Apollo Research.

0 favorites 0 likes

ai-testing

@HowToAI_: Someone built a tool that lets Claude Code autonomously test your entire IOS app It navigate your entire app, opens eve…

Was passiert, wenn eine KI globale Verantwortung übernehmen muss?🌏⚠️ Wir haben eine neue Existenzlogik-Architektur anhand eines der schwierigsten denkbaren Szenarien mit Grok 4.3 getestet.

GPT 5.5 Cannot Do These Puzzles

@JamesZmSun: Codex can now use the in-app browser to test your app at different viewport sizes! It will control the device tool bar …

PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups.

GPT-4o System Card External Testers Acknowledgements

Submit Feedback