Tag
This paper introduces TerminalWorld, a benchmark for evaluating AI agents on real-world terminal tasks, derived from 80,870 terminal recordings. Current systems achieve at most 62.5% pass rate, highlighting challenges in authentic terminal workflows.
A guide on structuring Generative AI projects for scalability and efficiency, covering directory organization, configuration, data management, and code structure.
This paper introduces projection agents for graph combinatorial optimization using reinforcement learning and graph neural networks, operating in a continuous action embedding space to improve generalization and scalability, and releases the LaGCO-RL library.
This paper presents a microservice architecture for production document AI pipelines that combine classification, OCR, and LLM extraction, sharing design decisions and batch profiling insights that reveal OCR, not LLM parsing, dominates latency.
The article argues that the real challenge in AI isn't just building smarter models but making them cost-efficient at scale, highlighting the importance of reducing token usage, improving speed, and optimizing infrastructure.
This paper presents the first large-scale empirical analysis of 118 transformer models, revealing critical performance walls where success rates drop from 88.1% at 512 tokens to 0% at 2048 tokens, challenging prevailing scaling assumptions.
This paper presents NSPI, a neuro-symbolic framework that combines LLMs and symbolic computation to prove polynomial inequalities. It uses LLM-generated sum-of-squares conjectures, refines them symbolically, and formally verifies the proofs in Lean, demonstrating scalability on polynomials with up to 10 variables.
The author created a repository called agent-automation-creator, a framework for building and evaluating reliable, scalable AI automations, and is seeking community feedback.
Teams scaling OpenAI usage face challenges in understanding cost drivers per feature, team, and customer, often relying on manual logging or tools like Finout for cost allocation and anomaly detection.
Browser Use describes two patterns for isolating AI agents that execute code: isolating the tool vs isolating the agent. They implemented the agent isolation pattern using Unikraft micro-VMs on AWS, achieving secure, scalable, and disposable sandboxes.
This article presents a new paper on Elastic Attention Cores for Vision Transformers, proposing a core-periphery block-sparse attention structure that improves scalability and accuracy compared to dense self-attention methods like DINOv3.
This paper introduces LC-MAPF, a pre-trained model with a learnable communication module for multi-agent pathfinding that improves coordination and outperforms existing learning-based solvers while maintaining scalability.
Interfaze introduces a hybrid AI model architecture combining CNN/DNN specialization with transformer capabilities, achieving superior accuracy on deterministic tasks like OCR and translation while maintaining cost efficiency at scale.
The author notes that the Every team is highly focused on AGI and identifies infrastructure as a critical bottleneck, predicting it will become even more severe as models like Claude advance.
Ex-Google engineers published a map of Google's internal tools and their open-source equivalents, providing a cheat code for building scalable infrastructure.
This paper proposes Node-Edge Policy Factorization (NEPF) to address scalability issues in solving Vehicle Routing Problems on multigraphs. It combines pre-encoding edge aggregation with a hierarchical reinforcement learning method to achieve state-of-the-art solution quality with faster training and inference.
The article analyzes the scalability limitations of using PostgreSQL as a job queue, specifically highlighting performance bottlenecks caused by MultiXact SLRU contention under high concurrency. It explains why this architecture fails in production despite working well in development and suggests considering alternatives.
Ben Dicken emphasizes that sharding is essential for building scalable databases and architecting data-intensive applications.
The article discusses the growing importance of reliability, security, and user protections as AI models become more capable and personalized.
OpenAI shares how it reimagined its support operations using AI to handle millions of requests annually by creating an operating model where every interaction improves the next. The approach combines chat/email/phone surfaces, continuously improving knowledge bases, and human-AI evaluation loops that empower support reps to act as builders and inform product improvements.