latency-reduction

#latency-reduction

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

arXiv cs.CL ↗ · yesterday Cached

LatentRAG is a novel framework that shifts reasoning and retrieval for agentic RAG into continuous latent space, reducing inference latency by approximately 90% while maintaining performance comparable to explicit methods.

0 favorites 0 likes

#latency-reduction

Speeding up agentic workflows with WebSockets in the Responses API

OpenAI Blog ↗ · 2026-04-22 Cached

OpenAI details how WebSockets and API optimizations reduced latency by 40% for agentic workflows, enabling GPT-5.3-Codex-Spark to reach near 1,000 tokens per second.

0 favorites 0 likes

#latency-reduction

Prompt Caching in the API

OpenAI Blog ↗ · 2024-10-01 Cached

OpenAI introduces Prompt Caching, an automatic feature that reduces API costs by 50% and improves latency by reusing recently cached input tokens on GPT-4o, GPT-4o mini, o1-preview, and o1-mini models. The feature automatically applies to prompts longer than 1,024 tokens without requiring developer integration changes.

0 favorites 0 likes

latency-reduction

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

Speeding up agentic workflows with WebSockets in the Responses API

Prompt Caching in the API

Submit Feedback