Tag
Microsoft's FastContext is a trending paper introducing a small 4B model for efficient code retrieval paired with coding agents, rivaling closed-source systems on SWE-Bench Multilingual.
Semble is an Agent-oriented code search tool that supports natural language queries, accurately returns semantically complete code snippets, saves 98% token consumption compared to traditional grep+read methods, and features intelligent chunking, dual-path retrieval, and code-aware re-ranking.
The authors detail their experience building a code indexing system, concluding that graph-based retrieval with LLM-generated semantics outperforms vector embeddings and pure AST parsing. They open-sourced the system, Bytebell, which uses Neo4j to store semantic context for efficient and precise code retrieval.
This research paper investigates text rewriting strategies for code retrieval, finding that full natural language rewriting offers the greatest performance gains. It introduces entropy-based diagnostics to help determine when costly LLM rewrites are beneficial.