Tag
MemoryDocDataSet is a new synthetic benchmark of 50 micro-worlds and 1,000 QA pairs designed to evaluate AI systems on the joint task of conversational memory and long-document reasoning simultaneously. The best baseline (RAG-Both) achieves only 0.358 overall F1, highlighting a significant gap in current systems' ability to unify conversational memory with long-document navigation.