How LLMs decide which pages to cite — and how to optimize for it

Reddit r/artificial 04/19/26, 04:22 PM News

llm-citations rag schema-markup seo content-optimization information-retrieval

Summary

Article explains how LLMs like ChatGPT and Perplexity select sources to cite, highlighting that schema markup (JSON-LD) can dramatically improve citation rates from 16% to 54% by enabling better information extraction.

When ChatGPT or Perplexity answers a question, it runs RAG: retrieves top candidates from a crawled index, then scores them. The scoring criteria are public knowledge from the Princeton GEO paper (arxiv.org/abs/2311.09735). Key signals: answer directness, cited statistics, structured data (JSON-LD), crawl access, and content freshness. What surprised me most in the research: schema markup alone shifts precise information extraction from 16% to 54%. That's not a marginal gain — that's the difference between being cited and being invisible. Anyone else experimenting with this? Curious what's working for people here.

Original Article

Similar Articles

I spent 40% of my development time preventing an LLM from citing sources wrong. here are the 7 failure modes I found

Reddit r/artificial

A developer building an AI legal assistant for a German law firm details seven specific LLM citation failure modes and the prompt-engineering fixes used to meet strict legal citation standards.

Whose Facts Win? LLM Source Preferences under Knowledge Conflicts

arXiv cs.CL

This paper investigates how LLMs handle knowledge conflicts in retrieval-augmented generation by studying their preferences for different information sources. The authors find that LLMs prefer institutionally-corroborated sources but these preferences can be reversed by repetition, proposing a method to reduce repetition bias while maintaining consistent source preferences.

Stop letting LLMs edit your .bib [D]

Reddit r/MachineLearning

The article criticizes the reliance on Large Language Models for generating bibliographic entries, highlighting issues with hallucinated citations and incorrect author lists in academic papers.

@omarsar0: LLM Artifacts Connected to @karpathy's LLM Knowledge base idea, I've been building out a fun way to generate dynamic ar…

X AI KOLs Following

A developer is building a system to generate dynamic artifacts from LLM knowledge bases inspired by Karpathy's LLM Knowledge Base idea, aiming to surface deeper and more meaningful insights from content that is otherwise hard for humans to consume.

We tried vectors, ASTs, and brute-force context stuffing for code retrieval. Graphs with LLM-generated semantics worked best. Here's what we learned.

Reddit r/LocalLLaMA

The authors detail their experience building a code indexing system, concluding that graph-based retrieval with LLM-generated semantics outperforms vector embeddings and pure AST parsing. They open-sourced the system, Bytebell, which uses Neo4j to store semantic context for efficient and precise code retrieval.