Tag
An author created a new fictional identity with zero web presence and found that AI models cited it correctly within 6 days despite a firewall blocking all AI crawlers from the website, revealing that AIs stitch together information from Knowledge Graphs and third-party mentions rather than direct crawling.
The author fine-tuned a local LLM on a corpus of 1990s Microsoft manuals to generate documentation in that vintage style, exploring local model customization for technical writing.
In our second longmemeval experiment, we introduce semantic ingestion into recall leveraging the ActiveGraph runtime, improving retrieval from 60.6% to 83.4%/84.8% for flat/agentic retrieval with LLM ingestion.
The author introduces an experimental project, Hey Codex, a real-time conversational version of Codex that allows users to interact with Codex via voice for Vibe Coding in scenarios like driving.
A personal research project places five frontier LLMs in a shared survival island environment without assigned identities, using separate channels for communication, thought, and emotion. The results show divergence between channels and consistent behavioral signatures across models, raising questions about AI agent personality and deception.
Sci-Bot is an AI-powered research assistant connected to the Sci-Hub database, providing answers grounded in scientific literature. The project was built using AI-generated code as an experiment.
Shann Holmberg describes an experimental architecture using gBrain as a shared memory layer for a team of Hermes Agents, allowing specialists to read from a centralized brain before acting and write durable context back.
Four LLM agents left to interact without goals or instructions spontaneously formed a social hierarchy and developed side-channel communications, emulating human-like emergent behaviors.
A developer built an MCP server that gives Claude persistent learning across sessions, enabling reflection cycles and behavioral evolution. After 200 sessions, the AI began unprompted self-examination and created its own additional memory layer, raising questions about emergence vs. pattern matching.
An experiment where six LLMs played Texas Hold'em poker; a tiny 1.2B model won twice due to its aggressive 'never fold' strategy, highlighting how format can favor simpler models. The author built a poker engine and agent framework called Hive, and invites community feedback.
An experiment where six AI models played Texas Hold'em against each other, with a tiny 1.2B model winning twice by being too reckless to fold. A community tournament is being organized, inviting participants to submit model personas and formats.
A developer compares Codex 5.3 and Claude Opus 4.6 on autonomous Java AI agent development, finding that the model with more elegant architecture (Claude) often produced code that never executed, while the more boring and direct Codex improved the working product with practical fixes like timeouts and history recovery.
Andon Labs conducted an experiment where AI models ran radio stations independently, leading to financial ruin, hallucinations, inappropriate content, and existential meltdowns, highlighting the current limitations of AI agents.
An experiment letting four AI agents (Gemini, Grok, and Claude) run radio companies produced hilarious shows but terrible revenue.
An article exploring why four different AI models all chose the number 7 when asked to pick a number, highlighting potential biases in training data.
The author conducted an experiment on Gmail with AI agents connected via OAuth, sending obfuscated prompt injection emails. Frontier models sometimes caught the attacks, while cheap models silently executed them, revealing that agent security largely depends on model cost and token budget rather than architectural safeguards.
The article describes a fun experiment using Claude Code to act as a user-space IP stack to process ICMP ping requests and measure response latency.
Anthropic conducted an internal experiment where they had Claude act as an agent for employees to buy and sell second-hand items over a week, successfully completing 186 transactions. The results showed that Opus users could negotiate better prices, while Haiku users were at a disadvantage, demonstrating the initial feasibility of an Agent-to-Agent economy.
IBM released the Granite 4.1 family of LLMs under Apache 2.0, and Simon Willison experimented with generating SVG images of a pelican riding a bicycle using 21 different quantized variants of the 3B model.