Tag
A discussion question about whether to evaluate a machine learning harness as a whole or evaluate its individual components separately.
This article compares the traditional Software Development Life Cycle (SDLC) with the emerging 'agentic SDLC' approach, which incorporates AI agents into the software development process.
This article explains how to build a knowledge graph in Obsidian using Claude as an AI engine to find connections, arguing that note-taking systems become more valuable over time when notes are linked, rather than isolated.
A Twitter thread shares the seven stages of AI-driven development from the Skills For Real Engineers project by @mattpocockuk, including alignment techniques like grill sessions and tooling for coding agents.
A submission was desk-rejected from NeurIPS based on an uncalibrated AI detector (Pangram), raising concerns about circularity in the review process and unvalidated false-positive rates on the target distribution.
This paper presents an evaluation methodology for LLM security detectors that addresses systematic weaknesses like per-dataset threshold tuning and undisclosed operating points. The framework uses cross-validation across 16 benchmarks, selects a single global operating point, and includes multiple diagnostics for generalization.
The article proposes that software engineering methodology should shift from a state perspective to a dynamical system perspective, emphasizing that attractor logic takes precedence over governance tools. In the AI era, it is necessary to explicitly model state space, attractors, trajectories, and controls to address architectural drift caused by AI as a high-frequency perturbation source.
A Princeton study found data leakage in nearly 300 AI papers across 17 fields, causing overoptimistic results. The author highlights how easy it is to accidentally leak data and cautions against trusting impressive AI claims without checking for leakage.
A methodology for autonomously training transformer language models on a single consumer GPU, structured in six stages with verification gates and AGENTS.md specs for orchestration frameworks like OpenClaw.
Neuronal proteins from the brain drain to the dura, skull, and nose, whereas injected CSF-tracer accumulates in neck lymph nodes. The study highlights that the act of injection may perturb the system under investigation.
Science Superpowers is an open-source computational-science methodology for AI research agents, enforcing pre-registration and reproducible workflows to prevent p-hacking and HARKing.
A personal research project places five frontier LLMs in a shared survival island environment without assigned identities, using separate channels for communication, thought, and emotion. The results show divergence between channels and consistent behavioral signatures across models, raising questions about AI agent personality and deception.
The author presents LQS v3.1, an open methodology for rating AI training data using multi-oracle consensus and signed certificates, with a published paper and public index. The approach aims to solve the bottleneck of independent quality evaluation in the AI training data market.
Introduces 'personality engineering,' a methodology using AI agents to parameterize, manipulate, and evaluate negotiator personality based on the interpersonal circumplex, enabling controlled experiments in negotiation theory.
This article critiques common flawed methods for evaluating AI-assisted coding tools, such as counting lines of code, timing artificial tasks, and relying on developer self-reports, arguing for more rigorous research methods.
METR evaluated an early version of Claude Mythos Preview in March 2026 using their time-horizons task suite, estimating a 50%-time-horizon of at least 16 hours, indicating the model is at the upper end of what current benchmarks can measure, with caveats about stability at longer time ranges.
A personal reflection on first principles thinking versus reasoning by analogy, using examples from Elon Musk's approach to reducing rocket costs at SpaceX, and the author's own startup failure.
OpenAI publishes a white paper detailing their approach to external red teaming for AI models, outlining methods for selecting diverse red team members, determining model access levels, providing testing infrastructure, and synthesizing feedback to improve AI safety and policy coverage.