caching

#caching

Cache-Aware Prompt Compression:A Two-Tier Cost Model for LLM API Caching

arXiv cs.AI ↗ · yesterday Cached

Proposes Cache-Aware Prompt Compression (CAPC), a method that combines query-agnostic compression with caching to reduce LLM API costs, demonstrating significant savings over existing approaches on Anthropic's Sonnet API and production workloads.

0 favorites 0 likes

#caching

Model Routing Is Simple. Until It Isn’t.

Hugging Face Blog ↗ · 5d ago Cached

IBM Research explains why model routing in agentic systems is more complex than a simple classification problem, highlighting how caching and hidden factors like actual workload cost and task difficulty estimation make routing a systems optimization challenge.

0 favorites 0 likes

#caching

Using uvx in GitHub Actions in a cache-friendly way

Simon Willison's Blog ↗ · 2026-07-14 Cached

Simon Willison shares a cache-friendly approach to using uvx in GitHub Actions by setting an environment variable and incorporating it into the cache key to avoid repeated PyPI downloads.

0 favorites 0 likes

#caching

Claude Code sends 33k tokens before reading the prompt; OpenCode sends 7k

Hacker News Top ↗ · 2026-07-12

A study comparing Claude Code and OpenCode reveals that Claude Code sends 33k tokens before reading the prompt while OpenCode sends only 7k, highlighting significant inefficiency in Claude Code's cache strategy and token usage.

0 favorites 0 likes

#caching

llama.cpp b9966 for sm-tensor

Reddit r/LocalLLaMA ↗ · 2026-07-11

llama.cpp b9966 introduces a fix for the -sm tensor mode that caches regex patterns, eliminating 29 recompilations per tensor per token on the decode thread, resulting in significantly reduced CPU overhead.

0 favorites 0 likes

#caching

Show HN: Reame – a CPU inference server that gets faster as it runs

Hacker News Top ↗ · 2026-07-11 Cached

Reame is an LLM inference server built on llama.cpp that optimizes for CPU hardware by caching prompt prefixes and generated n-grams, becoming faster with repeated use. It is designed for cheap hardware like shared vCPUs and free tiers, targeting repetitive AI workloads such as document extraction and batch pipelines.

0 favorites 0 likes

#caching

Speculative cache warming: warms your cache while you type your prompt, save 10-20s of wait time

Reddit r/LocalLLaMA ↗ · 2026-07-10

Speculative cache warming pre-processes the system prompt and tools array while the user types their prompt, saving 10-20 seconds of wait time on local LLM inference. This feature is part of the open-source OpenFox harness for local AI, improving interactivity without breaking cache consistency.

0 favorites 0 likes

#caching

@CycleDecoded: Unbelievable! Sohu just open-sourced its prized Redis cloud platform. This level of automation is practically taking over backend engineers' jobs. The project is called CacheCloud, a monster that has handled 80 billion requests per day and 18TB of memory within Sohu Video. Now it's up on GitHub, racking up nearly...

X AI KOLs Timeline ↗ · 2026-07-09 Cached

Sohu has open-sourced its internal Redis cloud management platform, CacheCloud, which supports standalone, sentinel, and cluster modes. It offers one-click setup, monitoring alerts, elastic scaling, and more. The project has garnered nearly 9,000 stars on GitHub and is licensed under Apache-2.0.

0 favorites 0 likes

#caching

@akshay_pachaar: https://x.com/akshay_pachaar/status/2074502882812952666

X AI KOLs Timeline ↗ · 2026-07-07 Cached

A practitioner's guide to KV cache management, introducing the open-source LMCache architecture that cuts input token costs by 90% and speeds up LLM inference by up to 14x by eliminating redundant context processing in agentic workflows.

0 favorites 0 likes

#caching

Why false sharing alignment should be 128 bytes on x64

Lobsters Hottest ↗ · 2026-07-07 Cached

The article explains why false sharing alignment on x64 should be 128 bytes instead of the typical 64 bytes, due to Intel's Sandy Bridge spatial prefetcher that loads cache lines in pairs. It provides reasoning and a benchmark demonstrating the improvement.

0 favorites 0 likes

#caching

Training Hybrid Block Diffusion Language Models with Partial Bidirectionality

arXiv cs.LG ↗ · 2026-07-07 Cached

This paper proposes a hybrid Mamba-attention architecture for block diffusion language models that restricts reverse Mamba scans to the active denoising block, enabling exact caching across blocks and achieving high throughput for long-context generation.

0 favorites 0 likes

#caching

Do you need separate systems when you already have Postgres?

Hacker News Top ↗ · 2026-07-06 Cached

A comprehensive argument that PostgreSQL alone is sufficient for most application needs, including caching, search, job queues, and document storage, before reaching for additional specialized systems.

0 favorites 0 likes

#caching

@no_stp_on_snek: ok read it over coffee, solid work. a few things that match what i keep hitting: diversity is independence times compet…

X AI KOLs Following ↗ · 2026-07-06 Cached

A detailed technical report on Hermes Mixture-of-Agents (MoA) findings, including benchmark results, caching economics, GPU topology studies, and a roadmap for future development.

0 favorites 0 likes

#caching

@dangerm00se: The main thing I had fable doing was routing moa and rlm experiments spanning local api and cerebras. Get your agent to…

X AI KOLs Following ↗ · 2026-07-06 Cached

The author shares findings from Hermes Mixture-of-Agents experiments, including voter upgrades, GPU topology, and caching economics, showing that local prefix caching can make long agent sessions nearly free and that two independent GPU instances outperform a single partitioned one.

0 favorites 0 likes

#caching

FreeBSD Ate My RAM

Hacker News Top ↗ · 2026-07-03 Cached

An article explaining why FreeBSD appears to use a lot of RAM, attributing it to disk caching and virtual memory management, similar to Linux's 'ate my RAM' phenomenon.

0 favorites 0 likes

#caching

Improving token efficiency for GitHub Copilot in VS Code

Lobsters Hottest ↗ · 2026-07-02 Cached

The VS Code team details recent optimizations to GitHub Copilot's agentic harness, such as prompt caching and tool search, to improve token efficiency and reduce costs under usage-based billing.

0 favorites 0 likes

#caching

The agent failure mode no eval catches: acting on a fact that was true when it was cached and wrong when it was used

Reddit r/AI_Agents ↗ · 2026-07-01

Discusses a blind spot in AI agent reliability: cached facts that were true when ingested but become stale by the time they are used, leading to coherent but incorrect actions. Proposes separating consistency (match with source) from currency (source still true now), and asks how the community handles this.

0 favorites 0 likes

#caching

What’s your actual agentic web research stack? (fully local, no cloud APIs)

Reddit r/LocalLLaMA ↗ · 2026-07-01

The author details a fully local, no-cloud-API web research stack for AI agents, using self-hosted SearXNG, a persistent cache, TLS-fingerprinted fetching, headless browser fallback, and a local reranker, inviting community discussion on similar setups.

0 favorites 0 likes

#caching

OTCache: Optimal Transport for Geometry-Aware Caching in Diffusion Models

arXiv cs.LG ↗ · 2026-07-01 Cached

OTCache is a training-free framework that uses optimal transport to predict caching schedules for diffusion models, achieving up to 4.7x acceleration on FLUX.1, Qwen-Image, and HunyuanVideo while improving generation fidelity.

0 favorites 0 likes

#caching

@divaagurlxw: Inference optimizations I’d study if I wanted sub-second LLM responses: 1.KV-Caching 2.Speculative Decoding 3.FlashAtte…

X AI KOLs Timeline ↗ · 2026-06-29 Cached

A tweet listing 16 inference optimization techniques for achieving sub-second LLM responses, including KV-caching, speculative decoding, FlashAttention, and various parallelism methods.

0 favorites 0 likes

caching

Submit Feedback