How LLMs decide which pages to cite — and how to optimize for it

Reddit r/artificial News

Summary

Article explains how LLMs like ChatGPT and Perplexity select sources to cite, highlighting that schema markup (JSON-LD) can dramatically improve citation rates from 16% to 54% by enabling better information extraction.

When ChatGPT or Perplexity answers a question, it runs RAG: retrieves top candidates from a crawled index, then scores them. The scoring criteria are public knowledge from the Princeton GEO paper (arxiv.org/abs/2311.09735). Key signals: answer directness, cited statistics, structured data (JSON-LD), crawl access, and content freshness. What surprised me most in the research: schema markup alone shifts precise information extraction from 16% to 54%. That's not a marginal gain — that's the difference between being cited and being invisible. Anyone else experimenting with this? Curious what's working for people here.
Original Article

Similar Articles

Whose Facts Win? LLM Source Preferences under Knowledge Conflicts

arXiv cs.CL

This paper investigates how LLMs handle knowledge conflicts in retrieval-augmented generation by studying their preferences for different information sources. The authors find that LLMs prefer institutionally-corroborated sources but these preferences can be reversed by repetition, proposing a method to reduce repetition bias while maintaining consistent source preferences.

Stop letting LLMs edit your .bib [D]

Reddit r/MachineLearning

The article criticizes the reliance on Large Language Models for generating bibliographic entries, highlighting issues with hallucinated citations and incorrect author lists in academic papers.