Tag
Step 3.7 Flash is a compact model that handles vision, live data retrieval, and code generation to autonomously build a working dashboard from a screenshot in minutes, costing about 50 cents per session.
A comparison claiming that Google's Gemini outperforms Anthropic's Claude in vision and world knowledge tasks.
Mistral OCR 4 has been released with bounding boxes, a highly requested feature. The user tested it for form filling and found it works well, though not perfectly.
This post presents the second update of a benchmark for local vision language models, comparing 23 models across 30 images with revised settings, and provides performance recommendations for different VRAM tiers. Key findings include that thinking mode hurts vision performance and that MoE models underperform dense models for perception tasks.
DeepSeek announces a new vision capability, likely a vision-language model, expanding its AI offerings.
Introduces Temporal Difference in Vision (TDV), a new paradigm for representation learning that relies solely on causality, eliminating the need for augmentations, masking, or cropping, and matches state-of-the-art methods like DINO and iBOT on dense spatial tasks.
Introduces Temporal Difference in Vision (TDV), a novel visual representation learning paradigm that learns useful representations without augmentations, masking, cropping, or reconstruction, and matches state-of-the-art methods on dense spatial tasks.
Claude Fable has matched GPT's performance on the challenging ZeroBench vision benchmark, with comparable pass@5 and pass^5 scores.
Anthropic releases Fable 5, claiming it is state-of-the-art on key benchmarks in software engineering, science, knowledge work, and vision, exceeding all previously available models.
The author announces a personal project to build a parallel internet called The Thinnernet, drawing inspiration from Steve Jobs and previous work on knowledge bases and low-power operating systems.
OpenAI outlines its plan to make AI broadly beneficial, drawing parallels to the transformative impact of electricity. The company emphasizes building AI that empowers people, distributes power, and remains aligned with human intent.
MaskAlign proposes a token-subset representation alignment method that improves diffusion transformer training by reducing reliance on complete token sets and maintaining stable alignment under perturbations.
A recap of an extraordinary week in open AI, featuring over 25 open-weight model releases across LLMs, image generation, audio/speech, vision, and video/3D, with notable contributions from NVIDIA, Google, and others.
A merged PR in llama.cpp implements a new 'Gemma 4 Unified' model type, suggesting Google's upcoming release with a transformer-less vision tower.
NEPA is a new method for visual self-supervised learning and generative pretraining that predicts the next embedding autoregressively, and has been added to a benchmark for evaluation.
Llama.cpp release B9406 fixes a crash (GGML_ASSERT) when using MTP with MoE vision models like Qwen3.6-35B-A3B.
ChainzRule introduces a neural architecture with learnable polynomial layers and differential regularization, achieving sample-efficient, robust performance across tabular, NLP, and vision tasks with results on Pima Diabetes, SST-5, Yelp Full, and CIFAR-10-C.
The author praises Lu Qi for his insights on sandbox/container security from a year ago, which have since been validated, emphasizing the core role of sandboxes in observing reward hacking.
Elon Musk's original goal for SpaceX was to increase NASA's budget, not to start a launch company.
Tesla states that the legacy of Model S and Model X will continue in its autonomy vision.