Tag
This paper tests the assumption that LLMs judge better than they generate in in-context QA, finding generation accuracy exceeds self-evaluation on most benchmarks, with evaluation attending less to context. The findings challenge core assumptions in self-evaluation pipelines.
This paper investigates the cause of cross-lingual retrieval asymmetry in multilingual embedding models. The authors propose and test the hub-mediation hypothesis, finding that hubness, not anisotropy, is the dominant cause, and recommend using CSLS instead of cosine similarity.