SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research
Summary
SciAtlas is a large-scale, multi-disciplinary academic knowledge graph containing over 43 million papers and 3 billion triplets, designed to provide structured knowledge for AI-driven automated scientific research with a neuro-symbolic retrieval algorithm.
View Cached Full Text
Cached at: 05/25/26, 08:55 AM
# SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research
Source: [https://arxiv.org/html/2605.22878](https://arxiv.org/html/2605.22878)
\\setheadertext
Preprint\\correspondingemail\\emailiconshuofei@zju\.edu\.cn, zhangningyu@zju\.edu\.cn, huajunsir@zju\.edu\.cn ∗\*Equal Contribution†Corresponding Author\.\\githublinkhttps://github\.com/zjunlp/SciAtlas\\setheadertitleSciAtlas: A Large\-Scale Knowledge Graph for Automated Scientific Research
Yunxiang Wei1∗Jiazheng Fan1Bin Wu2Busheng Zhang1Mengru Wang1Yuqi Zhu1Ningyu Zhang1†Keyan Ding1Qiang Zhang1Huajun Chen1† 1Zhejiang University2University College London
###### Abstract
The exponential growth of global academic output has confronted researchers and AI agents with an unprecedented “information explosion,” where fragmented and unstructured knowledge organization impedes deep interdisciplinary integration\. Current academic retrieval tools predominantly rely on superficial keyword matching or vector\-space semantic retrieval, which lack the topological reasoning capabilities required to navigate complex logical connections\. Agentic deep\-research\-based frameworks are often prone to logical hallucinations and consuming high inference costs\. To bridge this gap, in this report, we introduceSciAtlas, a large\-scale, multi\-disciplinary, heterogeneous academic resource knowledge graph designed as a panoramic scientific evolution network\. By integratingover 43M papers from 26 disciplines, and a total of 157M entities and 3B triplets,SciAtlasprovides a structured topological cognitive substrate that dismantles disciplinary barriers and furnishes AI agents with a global perspective\. Furthermore, we develop a neuro\-symbolic retrieval algorithm featuring tri\-path collaborative recall and graph reranking, achieving a seamless transition from simple semantic matching to deterministic association discovery\. We also present key application directions ofSciAtlas, including literature review, automated research trend synthesis, idea positioning, and academic trajectory exploration, to demonstrate thatSciAtlascan serve as an effective “cognitive map” to empower the full loop of automated scientific research while significantly reducing reasoning costs\. We have released the interfaces for KG retrieval and various downstream tasks in our GitHub repo\.
Figure 1:Discipline Distribution inSciAtlas\.SciAtlasis a large\-scale scientific knowledge graph containing 26 disciplines with over 43M academic papers and other heterogeneous entities\.## 1Introduction
Automated Scientific Research driven by Large Language Models \(LLMs\) has emerged as one of the most cutting\-edge focal points in the field of artificial intelligence\[ai4research\-survey,ai\-scientist,omniscientist\]\. With the exponential growth of global academic output, researchers and AI agents are jointly confronted with an unprecedented “information explosion” challenge\. Precise literature retrieval and effective knowledge integration not only constitute the logical starting point of the research loop but also serve as the core cornerstone determining the success of subsequent innovation generation and experimental design\[innoeval,scholareval,opennovelty,ai\-researcher\]\. However, current academic retrieval tools are generally plagued by two major issues\.
First is theorganizational form of academic knowledge\. Currently, vast amounts of research achievements are scattered across the internet in unstructured textual formats, lacking unified organizational paradigms and association mechanisms\. This “knowledge island” phenomenon not only impedes deep interdisciplinary integration but also renders the intrinsic logical connections between entities latent and inaccessible\. Novice researchers and AI agents struggle to transcend disciplinary barriers to perceive the global topological structure of scientific knowledge, resulting in cognitive dimensional deficits when addressing cutting\-edge interdisciplinary topics\[scikg\]\.
Second is theretrieval paradigm of academic knowledge\. Existing retrieval tools primarily rely on superficial keyword matching or vector\-space\-based semantic retrieval\[scholareval,innoeval,ai\-researcher,automind\], both of which are essentially flattened feature comparisons and cannot support genuine topological reasoning\. Some deep\-research\-based agentic frameworks attempt to compensate for the deficiency of structured information through iterative knowledge search and integration\[wispaper,deepxiv,alphaxiv,opensholar\]\. However, this approach not only incurs high computational costs and response latency but also, due to the absence of deterministic cognitive maps as anchors for LLMs, renders them highly susceptible to logical hallucinations within complex exploratory trajectories\.
We introduceSciAtlas111This project is part of the SciGraph project \([http://scigraph\.openkg\.cn/](http://scigraph.openkg.cn/)\) under SciGraph\-Scholar\., a large\-scale, multi\-disciplinary, heterogeneous academic resource knowledge graph designed to provide a topological cognitive substrate for accelerating scientific discovery\. In terms of organizational structure,SciAtlasfeatures a sophisticated schema \(see Fig\.[2](https://arxiv.org/html/2605.22878#S2.F2)\) encompassing 9 categories of entity nodes, including papers, authors, institutions, keywords, research fields, etc\. Each node type is endowed with comprehensive attribute information \(e\.g\., paper abstracts and PDF URLs, author citations\), as well as 12 categories of relational edges, including citations, authorship, co\-authorship, keyword co\-occurrence, etc\. This organizational paradigm weaves fragmented knowledge into a self\-explanatory, panoramic scientific evolution network\. Such structured formalization can dismantle disciplinary barriers, elevating scientific research into an interconnected logical topology that furnishes AI agents with a global cognitive perspective for observing scientific advancement\.
Building onSciAtlas, we develop a neuro\-symbolic retrieval algorithm that achieves the transition from semantic matching to topological reasoning\. By integrating lexical matching, vector retrieval, and well\-developed graph propagation algorithms\[rwr\], we establish a tri\-path collaborative recall and graph reranking mechanism, which enables deep fusion of the semantic relevance of papers, graph topological support, and importance metrics based on global citations, thereby providing deterministic deep association discovery without requiring frequent iterations of LLMs and high reasoning costs\. Furthermore, we propose several potential downstream application directions ofSciAtlasfor automated scientific research, including literature review, differentiated positioning and similarity detection of research ideas, idea generation, automated research trend predicting, retrieval of highly relevant academic authors, and academic trajectory exploration for researchers\.
Our main contributions are as follows:
- •We introduceSciAtlas, a large\-scale, multi\-disciplinary knowledge graph that organizes fragmented academic resources into a structured logical topology\. It serves as a comprehensive, panoramic scientific network that provides AI agents with a global cognitive perspective\.
- •We develop an efficient neuro\-symbolic retrieval algorithm featuring tri\-path collaborative recall and graph reranking, achieving the transition from surface\-level semantic matching to deterministic topological reasoning\.
- •We propose application directions forSciAtlas, including research trend synthesis, idea positioning, and academic trajectory exploration, etc\. These applications demonstrateSciAtlas’s capability as a “cognitive map” to empower the entire loop of automated scientific research\.
## 2SciAtlas
Figure 2:Schema ofSciAtlas\. By integrating 9 kinds of entity nodes and 12 kinds of relational edges,SciAtlasprovides a structured topological cognitive substrate that dismantles disciplinary barriers and furnishes AI agents with a global perspective\. The complete schema \(including entities, relations, attributions\) ofSciAtlascan be found in Appx\.[A](https://arxiv.org/html/2605.22878#A1)\.### 2\.1Overview ofSciAtlas
##### Schema\.
In Fig\.[2](https://arxiv.org/html/2605.22878#S2.F2), we present the complete schema ofSciAtlas\.SciAtlasis constructed with academic literature as its core, encompassing entities such asAuthor,Institution,Keyword,Source,Topic,Field,Subfield, andDomaincentered around thePaperentity\. With the help of these hybrid entities, the papers are organized directly or indirectly in four levels:
- •Semantic level\. The citation relationship \(CITES\) and relevance relationship \(RELATED\_TO\) establish direct semantic connections between papers\.
- •Conceptual level\. Each paper is associated with its most salient keywords, and theCOOCCURrelationships among keywords within papers indirectly link papers at the conceptual level\.
- •Direction level\. Different domains, fields, subfields, and topics organize papers into hierarchical structures at the disciplinary and research direction levels\.
- •Social level\.COAUTHORrelationships among authors andAUTHOREDrelationships between authors and papers, together with theAFFILIATED\_WITHrelationships between authors and institutions, form indirect relationships between papers at the social organizational level\.
These multi\-level organizational structures constitute a complex paper relationship network, providing a robust structural foundation for deep retrieval and reasoning overSciAtlas\.
Table 1:Statistics ofSciAtlas\. SciMap comprises a total node count reaching tens of millions, with the aggregate number of edges scaling to billions\.Entity\(Total: 157M\)Relation\(Total: 3B\)TypeNumTypeNumTypeNumTypeNumPaper43\.30MAuthor109\.70M\(Paper,CITES,Paper\)213\.88M\(Paper,HAS\_KEYWORD,Keyword\)101\.38MKeyword3\.76MInstitution0\.12M\(Paper,HAS\_TOPIC,Topic\)105\.89M\(Author,AFFILIATED\_WITH,Instit\)195\.94MTopic4\.52KSubfield252\(Author,AUTHORED,Paper\)149\.00M\(Author,COAUTHOR,Author\)2\.06BField26\(Keyword,COOCCUR,Keyword\)60\.37M\(Field,DOMAIN\_OF,Domain\)26Source0\.28M\(Subfield,FIELD\_OF,Field\)252\(Paper,RELATED\_TO,Paper\)68\.38MDomain4\(Topic,SUBFILED\_OF,Subfield\)4\.52K\(Paper,PUBLISH\_IN,Source\)40\.90M
##### Statistics\.
SciAtlascovers 26 academic disciplines \(see Fig\.[1](https://arxiv.org/html/2605.22878#S0.F1)\) with a total of 43\.30 million papers\. Medicine holds the largest share \(18\.56%\), followed by Social Sciences \(10\.70%\), Engineering \(9\.43%\), Biochemistry, Genetics and Molecular Biology \(6\.44%\), and Computer Science \(6\.29%\)\. The five disciplines above collectively account for 51\.43% of the total paper volume, reflecting the concentration of core disciplines\. The remaining fields range from Arts and Humanities \(3\.33%\) to Veterinary \(0\.16%\), ensuring broad disciplinary representation\. In terms of scale, in Tab\.[1](https://arxiv.org/html/2605.22878#S2.T1),SciAtlascontains 109\.70 million authors, 3\.76 million keywords, and 0\.12 million institutions, connected by billions of relational edges across 11 relationship types\. This combination of comprehensive disciplinary coverage and massive entity volume positionsSciAtlasas a large‑scale, multi‑disciplinary knowledge graph for topological scientific search\.
### 2\.2SciAtlasConstruction
The primary data source for our knowledge graph is from OpenAlex222[https://openalex\.org/](https://openalex.org/)\., a fully open\-source library of scholarly resources encompassing over 480 million academic publications\. Each paper contains rich metadata, including authors, abstracts, institutions, publication dates, venues, references, citation counts, topics, open\-access status, PDF URL, etc\. Building upon this foundation, we construct our knowledge graph through the following primary steps:
##### Data Restructuring and Filtering\.
First, we extract different entity types from OpenAlex and preserve only key attributes for each entity\. Subsequently, since OpenAlex data is also sourced from the internet and contains substantial noise, we normalize and deduplicate the names of various entities \(e\.g\., paper titles, institution names\) after standardization\. Notably, we do not deduplicate authors due to the prevalence of name duplication and ambiguity\. We also discard entities lacking critical attributes \(e\.g\., paper PDF URLs\)\. We then filter out non\-English papers and papers with very short abstracts to ensure high\-quality\. Next, we establish edges based on the inter\-entity information stored within each entity \(e\.g\., authors and references contained in papers\)\. Since OpenAlex assigns a unique ID to each entity, we directly utilize these IDs to match corresponding entities and construct relationships\.
##### Keyword Extraction\.
Although OpenAlex includes aConceptentity type as the core concept of papers, it is excessively sparse \(only 65K entries, far fewer than the 480M paper corpus\) and more critically, these concepts remain at a macroscopic and superficial level \(e\.g\., “artificial intelligence”\), failing to genuinely represent the core concepts and terms within individual papers\. These limitations are insufficient for complex academic relational reasoning in KG, motivating us to construct denser and truly useful keywords\. Specifically, we employ a lightweight open\-source LLM \(Qwen3\-30B\-A3B\-Instruct\-2507\[qwen3\]\) as an extractor to identify keywords from paper abstracts\. Recognizing that many contemporary papers tend to emphasize narrative packaging, which often obscures their academic essence, and the same concept may be expressed differently across distinct domains, we deliberately instruct the LLM to avoid paper\-specific terminology or system names, as well as highly customized or marketing\-style expressions\. Instead, we prioritize those fundamental phrases that are reusable across numerous papers\. For each paper, we extract 3\-8 core keywords to constitute theKeywordentity\. The LLM will also assign animportance scoreto each keyword, which serves as the attribute for theHAS\_KEYWORDedge\. Please see Appx\.[B\.1](https://arxiv.org/html/2605.22878#A2.SS1)for the detailed prompt of keyword extraction\. To capture associations among keywords, we establishCOOCURrelations between keywords appearing in the same paper, withco\-occurrence frequencyserving as edge weights to indicate the strength of association between keywords\.
Examples of Good and Bad KeywordsGood Keywords:protein structure prediction, idea evaluation, wireless communication, energy optimization, fault detection, monto carlo simulationBad Keywords:hierarchical dual\-path adaptive learning framework, multi\-stage cross\-modal feature fusion architecture, novel high\-performance prototype system, AlphaEvolve
##### Semantic Embedding\.
To support hybrid and efficient KG retrieval, we incorporate pre\-computed semantic vectors intoSciAtlasin addition to plain text\. Specifically, we select the three most semantically rich fields: paper title, paper abstract, and keyword\. We first normalize each field \(format and case\), then employbge\-large\-en\-v1\.5\[bge\]as the embedding model\. The semantic vectors derived from the titles and abstracts are integrated as paper attributes, while those derived from the keywords are incorporated as keyword attributes\.
Finally, we organize all entities, attributes, and edges together and deploySciAtlasusing Neo4j333[https://neo4j\.com/](https://neo4j.com/)\.\.
### 2\.3SciAtlasUpdate
To accommodate rapid knowledge iteration, we propose several approaches for SciMap updates:
##### Using with Online Resources\.
OpenAlex provides daily\-updated API endpoints444[https://developers\.openalex\.org/api\-reference/introduction](https://developers.openalex.org/api-reference/introduction)\.supporting daily updates for entities such as papers, authors, and institutions\. Users can retrieve information for desired papers directly through the API, follow the pipeline described in §[2\.2](https://arxiv.org/html/2605.22878#S2.SS2)to extract keywords, compute semantic embeddings, and extract inter\-entity relationships aligned with theSciAtlasschema, and finally import them into the database via Neo4j Cypher language\. Although OpenAlex encompasses the vast majority of literature available on the internet, rare cases of absent papers may occur\. For such scenarios, we recommend GROBID555[https://github\.com/grobidOrg/grobid](https://github.com/grobidOrg/grobid)\., a very lightweight information extraction tool specifically designed for technical and scientific publications, which can rapidly extract metadata, including titles, authors, abstracts, and references, from paper’s PDF file, serving as an efficient alternative to the OpenAlex API\. We will open our KG construction code to support the evolution\.
##### Periodic Update\.
OpenAlex compiles changefiles666[https://developers\.openalex\.org/download/changefiles](https://developers.openalex.org/download/changefiles)\.of the latest updates every two months compared to the previous version\. Our team will periodically update our knowledge graph based on these releases\. Users who have already deployed the system locally can also maintain their knowledge graph periodically\. Our pipeline supports one\-click import from OpenAlex downloaded files toSciAtlas\.
## 3Neuro\-Symbolic Retrieval
In this section, we introduce a neuro\-symbolic retrieval algorithm featuring tri\-path collaborative entity recall and achieve deep topological reasoning through graph traversal\. It can also serve as a fundamental retrieval algorithm adaptable to various downstream tasks in §[4](https://arxiv.org/html/2605.22878#S4)\.
### 3\.1Node Matching
Our retrieval system supports arbitrary query formats, including keywords, scientific questions, abstracts, idea texts, and even complete papers\. Given a queryqq, we map it into KG nodes through three distinct ways\.
##### Keyword Matching\.
We use an LLM to extract keywords fromqqand assign each keyword with an importance score, forming a keyword list𝒦=\{\(ki,sillm\)\}i=1m\\mathcal\{K\}=\\\{\(k\_\{i\},s\_\{i\}^\{\\text\{llm\}\}\)\\\}\_\{i=1\}^\{m\}, wherekik\_\{i\}is theii\-th extracted keyword with text normalization andsillm∈\[0,1\]s\_\{i\}^\{\\text\{llm\}\}\\in\[0,1\]represents its normalized importance score\. The maximum number of keywords extracted by the LLM ismm\. Then, we first perform exact text matching ofkik\_\{i\}in the KG\. For each matched keyword nodegg, we assign it an exact match score:
scoreexact\(ki,g\)=sillm\.\\displaystyle\\text\{score\}\_\{exact\}\(k\_\{i\},g\)=s\_\{i\}^\{\\text\{llm\}\}\.\(1\)Second, we perform vector matching\. After encoding eachkik\_\{i\}into a semantic vector, we compute semantic similarity based on the pre\-calculated keyword text embeddings in the KG\. Nodes with similarity scores exceeding the thresholdθkw\\theta\_\{kw\}\(default to0\.70\.7\) are retained, with their scores as:
scorevec\(ki,g\)=sillm⋅sim\(ki,g\)\.\\displaystyle\\text\{score\}\_\{vec\}\(k\_\{i\},g\)=s\_\{i\}^\{\\text\{llm\}\}\\cdot\\text\{sim\}\(k\_\{i\},g\)\.\(2\)If multiple nodes surpass the threshold, we select only the top\-3 nodes for eachkik\_\{i\}\. The same keyword nodeggmay be matched by multiple input keywords or simultaneously by both exact and vector matching\. We take the maximum of all its scores as the node’s final weight:
wgkw=maxi\(𝟙\[ki=g\]⋅sillm,𝟙\[sim\(ki,g\)≥θkw\]⋅sillmsim\(ki,g\)\)\\displaystyle w\_\{g\}^\{kw\}=\\max\_\{i\}\\left\(\\mathbb\{1\}\[k\_\{i\}=g\]\\cdot s\_\{i\}^\{\\text\{llm\}\},\\mathbb\{1\}\[\\text\{sim\}\(k\_\{i\},g\)\\geq\\theta\_\{kw\}\]\\cdot s\_\{i\}^\{\\text\{llm\}\}\\text\{sim\}\(k\_\{i\},g\)\\right\)\(3\)The final set of keyword\-matching nodes is denoted as𝒦seed=\{\(g,wgkw\)\}\\mathcal\{K\}\_\{seed\}=\\\{\(g,w\_\{g\}^\{kw\}\)\\\}\.
##### Semantic Matching\.
We embed queryqqto obtain vector𝐞q\\mathbf\{e\}\_\{q\}\(Here, if the input is an entire paper, we only extract its abstract for embedding\.\), which is then used to retrieve the top\-6060papers from the KG based on title embeddings and abstract embeddings, respectively\. We then employ a reranker \(bge\-reranker\-large\[bge\]\) to re\-rank the retrieved papers, retaining the top\-1515papers for title and abstract\. Given a retrieved paperpp, we definesptitles\_\{p\}^\{title\}andspabss\_\{p\}^\{abs\}as its retrieval scores through title or abstract matching, and compute a weighted combination of the two scores:
spemb=0\.4⋅sptitle\+0\.6⋅spabs0\.4⋅𝟙\[∃sptitle\]\+0\.6⋅𝟙\[∃spabs\]\.\\displaystyle s\_\{p\}^\{emb\}=\\frac\{0\.4\\cdot s\_\{p\}^\{title\}\+0\.6\\cdot s\_\{p\}^\{abs\}\}\{0\.4\\cdot\\mathbb\{1\}\[\\exists s\_\{p\}^\{title\}\]\+0\.6\\cdot\\mathbb\{1\}\[\\exists s\_\{p\}^\{abs\}\]\}\.\(4\)Here, it is set to0ifsptitles\_\{p\}^\{title\}orspabss\_\{p\}^\{abs\}does not exist\. The final candidate paper nodes from semantic matching are denoted as𝒫emb\{\(p,spemb\)\}\\mathcal\{P\}^\{emb\}\\\{\(p,s\_\{p\}^\{emb\}\)\\\}\.
##### Title Matching\.
Since titles encapsulate the most critical information of papers and are highly beneficial for paper retrieval, we specifically perform title matching for queriesqqthat contain titles\. We use GROBID to extract all titles \(including the paper’s title and its references’ titles\) from the idea or paper and employ an LLM to assign a confidence scorecjc\_\{j\}to each titletjt\_\{j\}\. We retain the top\-1010titles with the highest confidence scores and normalize them \(removing non\-alphabetic characters and converting to lowercase\) to obtain the title set𝒯=\{\(tj,cj\)\}j=1n\\mathcal\{T\}=\\\{\(t\_\{j\},c\_\{j\}\)\\\}\_\{j=1\}^\{n\}\. We then perform text matching of titles in the KG\. If an exact match is found, a matching score ofm\(tj,p\)=1\.0m\(t\_\{j\},p\)=1\.0is assigned; otherwise, we compute the fuzzy similarity between two titles based on the following formula:
m\(tj,p\)=0\.65⋅seq\(tj,p\)\+0\.35⋅token\_overlap\(tj,p\),\\displaystyle m\(t\_\{j\},p\)=0\.65\\cdot\\text\{seq\}\(t\_\{j\},p\)\+0\.35\\cdot\\text\{token\\\_overlap\}\(t\_\{j\},p\),\(5\)whereseq\(a,b\)\\text\{seq\}\(a,b\)is based on the Longest Common Subsequence \(LCS\) ofaaandbb, andtoken\_overlapcomputes the Jaccard overlap ratio of the token sets ofaaandbb\. Candidates with similarity belowθtitle\\theta\_\{title\}\(default to0\.880\.88\) are directly discarded\. For paperppmatched by titletjt\_\{j\}, we assign it a score:
sj,ptitle=cj⋅m\(tj,p\)\.\\displaystyle s\_\{j,p\}^\{title\}=c\_\{j\}\\cdot m\(t\_\{j\},p\)\.\(6\)If the same paper is matched by multiple titles, we take the maximum scoresptitle=maxjsj,ptitles\_\{p\}^\{title\}=\\max\_\{j\}s\_\{j,p\}^\{title\}\. Each input title retains at most the top\-5 papers, and all papers constitute𝒫title=\{\(p,sptitle\)\}\\mathcal\{P\}^\{title\}=\\\{\(p,s\_\{p\}^\{title\}\)\\\}\.
##### Node Merging\.
We obtain two candidate paper node sets through the semantic and title pathways\. Then we need to merge them into𝒫seed\\mathcal\{P\}\_\{seed\}and unify their weights\. For each candidate paperpp, we compute the dot product with vector𝐞q\\mathbf\{e\}\_\{q\}and apply weighting according to the ratio specified in Eq\.[4](https://arxiv.org/html/2605.22878#S3.E4):
s¯pemb=combine\(simptitle,simpabs\),simptitle=𝐞q⊤𝐞ptitle,simpabs=𝐞q⊤𝐞pabs\.\\displaystyle\\bar\{s\}\_\{p\}^\{emb\}=\\text\{combine\}\(\\text\{sim\}\_\{p\}^\{title\},\\text\{sim\}\_\{p\}^\{abs\}\),\\quad\\text\{sim\}\_\{p\}^\{title\}=\\mathbf\{e\}\_\{q\}^\{\\top\}\\mathbf\{e\}\_\{p\}^\{title\},\\quad\\text\{sim\}\_\{p\}^\{abs\}=\\mathbf\{e\}\_\{q\}^\{\\top\}\\mathbf\{e\}\_\{p\}^\{abs\}\.\(7\)We then perform MinMax normalization:
s~pemb=MinMaxNorm\(s¯pemb\),s~ptitle=MinMaxNorm\(sptitle\),MinMaxNorm\(xp\)=xp−xminxmax−xmin\\displaystyle\\widetilde\{s\}\_\{p\}^\{emb\}=\\text\{MinMaxNorm\}\(\\bar\{s\}\_\{p\}^\{emb\}\),\\quad\\widetilde\{s\}\_\{p\}^\{title\}=\\text\{MinMaxNorm\}\(s\_\{p\}^\{title\}\),\\quad\\text\{MinMaxNorm\}\(x\_\{p\}\)=\\frac\{x\_\{p\}\-x\_\{min\}\}\{x\_\{max\}\-x\_\{min\}\}\(8\)Finally, we define the unified paper weight:
sppre=λembs~pemb\+λtitles~ptitle\+bppre,bppre=\{0\.35,exact title hit0\.10,fuzzy title hit0,otherwise,\\displaystyle s\_\{p\}^\{pre\}=\\lambda\_\{emb\}\\widetilde\{s\}\_\{p\}^\{emb\}\+\\lambda\_\{title\}\\widetilde\{s\}\_\{p\}^\{title\}\+b\_\{p\}^\{pre\},\\quad b\_\{p\}^\{pre\}=\\begin\{cases\}0\.35,&\\text\{exact title hit\}\\\\ 0\.10,&\\text\{fuzzy title hit\}\\\\ 0,&\\text\{otherwise\}\\end\{cases\},\(9\)wherebppreb\_\{p\}^\{pre\}denotes the title bonus, andλemb\\lambda\_\{emb\}\(default to0\.30\.3\) andλtitle\\lambda\_\{title\}\(default to0\.80\.8\) represent the importance weights for semantic and title pathways, respectively\.
### 3\.2Weight Setting
Taking𝒦seed\\mathcal\{K\}\_\{seed\}and𝒫seed\\mathcal\{P\}\_\{seed\}as starting points, we perform a 2\-hop subgraph propagation, where all edges are treated as undirected during the propagation process\. To prevent subgraph explosion, we select at most500500nodes per hop for each entity type\. For each paperppin the local subgraph, we compute its importance based on its citation countcpc\_\{p\}\. LetCCdenote total citation counts for all papers in the subgraph\. The paper’s importance is defined as:
imp\(p\)=min\(1,log\(1\+cp\)log\(1\+max\(1,C\)\)\)\.\\displaystyle\\text\{imp\}\(p\)=\\min\\left\(1,\\frac\{\\log\(1\+c\_\{p\}\)\}\{\\log\(1\+\\max\(1,C\)\)\}\\right\)\.\(10\)Here, the importance can be tailored to the downstream task: if the task emphasizes paper quality, it can be computed according to Eq\.[10](https://arxiv.org/html/2605.22878#S3.E10); if the focus is solely on relevance, all papers can be forced toimp\(p\)=1\\text\{imp\}\(p\)=1\. For each seed paperpp, we define its unnormalized weight as:
wpseed=sppre⋅\(1\+γ⋅imp\(p\)\),\\displaystyle w\_\{p\}^\{seed\}=s\_\{p\}^\{pre\}\\cdot\(1\+\\gamma\\cdot\\text\{imp\}\(p\)\),\(11\)whereγ\\gammais the control factor for importance \(default to0\.50\.5\)\. For each seed keywordgg, we define its unnormalized weight aswgseed=wgkww\_\{g\}^\{seed\}=w\_\{g\}^\{kw\}\. We define the distribution𝐬\\mathbf\{s\}over all nodes in the graph as:
sv=\{wvseedZ,v∈S0,v∉S,Z=∑v∈Swvseed,S=𝒫seed∪𝒦seed\.\\displaystyle s\_\{v\}=\\begin\{cases\}\\dfrac\{w\_\{v\}^\{seed\}\}\{Z\},&v\\in S\\\\ 0,&v\\notin S\\end\{cases\},\\quad Z=\\sum\_\{v\\in S\}w\_\{v\}^\{seed\},\\quad S=\\mathcal\{P\}\_\{seed\}\\cup\\mathcal\{K\}\_\{seed\}\.\(12\)
Table 2:Definitions of Unnormalized Edge Weights \(ω\(u,v\)\\omega\(u,v\)\)\.Edge TypeWeight Formula\(s\)Parameter DescriptionHAS\_KEYWORDωHK\(p,g\)=βhk⋅κ\(g\)⋅relp,gκ\(g\)=\{wgkw,ifgis a seedϵkw,otherwise\\begin\{aligned\} &\\omega\_\{\\text\{HK\}\}\(p,g\)=\\beta\_\{hk\}\\cdot\\kappa\(g\)\\cdot\\text\{rel\}\_\{p,g\}\\\\ &\\kappa\(g\)=\\begin\{cases\}w\_\{g\}^\{kw\},&\\text\{if \}g\\text\{ is a seed\}\\\\ \\epsilon\_\{kw\},&\\text\{otherwise\}\\end\{cases\}\\end\{aligned\}βhk\\beta\_\{hk\}: Base weight for keyword association \(default1\.201\.20\)\.
relp,g\\text\{rel\}\_\{p,g\}: Importance score from\(p,g\)\(p,g\)\.
κ\(g\)\\kappa\(g\): Prior weight modulator for the keyword node\.
wgkww\_\{g\}^\{kw\}: Initial matching score for seed keywords\.
ϵkw\\epsilon\_\{kw\}: Smoothing factor for non\-seed keywords \(default0\.250\.25\)\.CITESωCITES\(u,v\)=βcite\\omega\_\{\\text\{CITES\}\}\(u,v\)=\\beta\_\{cite\}βcite\\beta\_\{cite\}: Base weight for paper citation relation \(default1\.001\.00\)\.RELATED\_TOωRELATED\(u,v\)=βrel\\omega\_\{\\text\{RELATED\}\}\(u,v\)=\\beta\_\{rel\}βrel\\beta\_\{rel\}: Base weight for paper relatedness \(default0\.900\.90\)\.AUTHOREDωAUTHORED\(u,v\)=βauth\\omega\_\{\\text\{AUTHORED\}\}\(u,v\)=\\beta\_\{auth\}βauth\\beta\_\{auth\}: Base weight for authorship relation \(default0\.800\.80\)\.COAUTHORωCOA\(u,v\)=βcoauth⋅max\(1,ϕ\(nuv\)\)ϕ\(nuv\)=min\(cmax,log\(1\+nuv\)\)\\begin\{aligned\} &\\omega\_\{\\text\{COA\}\}\(u,v\)=\\beta\_\{coauth\}\\cdot\\max\(1,\\phi\(n\_\{uv\}\)\)\\\\ &\\phi\(n\_\{uv\}\)=\\min\(c\_\{max\},\\log\(1\+n\_\{uv\}\)\)\\end\{aligned\}βcoauth\\beta\_\{coauth\}: Base weight for co\-authorship \(default0\.600\.60\)\.
nuvn\_\{uv\}: Co\-authoring frequency\.
ϕ\(⋅\)\\phi\(\\cdot\): Frequency smoothing function\.
cmaxc\_\{max\}: Logarithmic cap to prevent infinite weight magnification \(default2\.02\.0\)\.COOCCURωCOO\(u,v\)=βcooc⋅max\(1,ϕ\(muv\)\)ϕ\(muv\)=min\(cmax,log\(1\+muv\)\)\\begin\{aligned\} &\\omega\_\{\\text\{COO\}\}\(u,v\)=\\beta\_\{cooc\}\\cdot\\max\(1,\\phi\(m\_\{uv\}\)\)\\\\ &\\phi\(m\_\{uv\}\)=\\min\(c\_\{max\},\\log\(1\+m\_\{uv\}\)\)\\end\{aligned\}βcooc\\beta\_\{cooc\}: Base weight for keyword co\-occurrence \(default0\.600\.60\)\.
muvm\_\{uv\}: Co\-occurrence frequency\.
ϕ\(⋅\),cmax\\phi\(\\cdot\),c\_\{max\}: Same smoothing function and cap definition asCOAUTHOR\.For an edgee=\(u,v,r\)e=\(u,v,r\)in the graph, we define its unnormalized weightω\(u,v\)\\omega\(u,v\)based on the edge type, as specified in Tab\.[2](https://arxiv.org/html/2605.22878#S3.T2)\.
### 3\.3Random Walk with Restart
To more deeply explore the topological relationships between nodes and enable deep reasoning within the graph, we perform random walks on the graph based on seed nodes and edge weights\. For any nodeuu, let its neighbor set beN\(u\)N\(u\)\. The transition probability fromuuto its neighborvvis defined as:
P\(v∣u\)=ω\(u,v\)∑x∈N\(u\)ω\(u,x\)\.\\displaystyle P\(v\\mid u\)=\\frac\{\\omega\(u,v\)\}\{\\sum\_\{x\\in N\(u\)\}\\omega\(u,x\)\}\.\(13\)Assuming the node score vector at iterationttis𝐫\(t\)\\mathbf\{r\}^\{\(t\)\}, we initialize𝐫\(0\)=𝐬\\mathbf\{r\}^\{\(0\)\}=\\mathbf\{s\}\. For any nodevv, its score in the next iteration is:
rv\(t\+1\)=αsv\+\(1−α\)∑uru\(t\)P\(v∣u\),\\displaystyle r\_\{v\}^\{\(t\+1\)\}=\\alpha s\_\{v\}\+\(1\-\\alpha\)\\sum\_\{u\}r\_\{u\}^\{\(t\)\}P\(v\\mid u\),\(14\)whereα\\alphadenotes the restart probability\. If a nodeuuhas no neighbors, we preserve its own mass by directly adding\(1−α\)ru\(t\)\(1\-\\alpha\)r\_\{u\}^\{\(t\)\}back touuitself\. The iteration terminates when:
‖𝐫\(t\+1\)−𝐫\(t\)‖1<ε,\\displaystyle\\\|\\mathbf\{r\}^\{\(t\+1\)\}\-\\mathbf\{r\}^\{\(t\)\}\\\|\_\{1\}<\\varepsilon,\(15\)whereε=10−6\\varepsilon=10^\{\-6\}, or when the maximum number of iterationsTmax=50T\_\{\\max\}=50is reached\. The final graph score of nodevvis given byrv=rv\(t⋆\)r\_\{v\}=r\_\{v\}^\{\(t^\{\\star\}\)\}, wheret⋆t^\{\\star\}denotes the stopping iteration\.
### 3\.4Final Ranking
Upon completing the graph propagation, the system derives a set of global node scores\{rv\}v∈V′\\\{r\_\{v\}\\\}\_\{v\\in V^\{\\prime\}\}across the local subgraph\. For the purpose of paper retrieval, we isolate the scores of paper nodes:
spgraph=rp,p∈V′∩Paper\\displaystyle s\_\{p\}^\{graph\}=r\_\{p\},\\quad p\\in V^\{\\prime\}\\cap\\texttt\{Paper\}\(16\)Crucially, this stage allows for the inclusion of newly discovered paper nodes that are not part of the initial𝒫seed\\mathcal\{P\}\_\{seed\}sets but are reached during graph expansion\.
To account for the academic impact of candidates within the retrieved context, we re\-calculate the paper importanceimpfinal\(p\)\\text\{imp\}\_\{final\}\(p\)based on the citation distribution of the final candidate set\. We utilize the logarithmic scaling defined in Eq\.[10](https://arxiv.org/html/2605.22878#S3.E10), using total citations within the current pool to ensure a robust relative metric\. To prevent the graph diffusion from over\-promoting distant nodes, we introduce a graph support factorgpg\_\{p\}, which acts as a gating mechanism based on the initial retrieval strength:
gp=max\(0\.25,s~ppre\),\\displaystyle g\_\{p\}=\\max\(0\.25,\\tilde\{s\}\_\{p\}^\{pre\}\),\(17\)wheres~ppre\\tilde\{s\}\_\{p\}^\{pre\}is the MinMax\-normalized pre\-graph scoresppres\_\{p\}^\{pre\}in Eq\.[9](https://arxiv.org/html/2605.22878#S3.E9)\. This ensures that while graph\-discovered papers can achieve high ranks, those with zero initial semantic relevance must demonstrate exceptionally strong topological support to surpass primary candidates\. We then normalize the graph scorespgraphs\_\{p\}^\{graph\}with MinMax to obtains~pgraph\\tilde\{s\}\_\{p\}^\{graph\}\.
The comprehensive final scorespfinals\_\{p\}^\{final\}is defined as a weighted linear combination of three normalized components and a title\-matching bonus:
spfinal=min\(1,λpres~ppre\+λgraphs~pgraphgp\+λimpimpfinal\(p\)\),\\displaystyle s\_\{p\}^\{final\}=\\min\\left\(1,\\lambda\_\{pre\}\\tilde\{s\}\_\{p\}^\{pre\}\+\\lambda\_\{graph\}\\tilde\{s\}\_\{p\}^\{graph\}g\_\{p\}\+\\lambda\_\{imp\}\\text\{imp\}\_\{final\}\(p\)\\right\),\(18\)whereλpre\\lambda\_\{pre\}\(default to0\.350\.35\) is the initial relevance,λgraph\\lambda\_\{graph\}\(default to0\.450\.45\) is the topological support from graph, andλimp\\lambda\_\{imp\}\(default to0\.200\.20\) is the citation importance\. We finally return thetop\-2020papers, accompanied by adetailed score breakdownandpath\-based explanationsto provide researchers with a transparent and deterministic “cognitive map” of the retrieval results\. The entire retrieval process can be completed within 2 minutes, significantly shorter than LLM\-based deep research frameworks, while still delivering high\-relevance results with in\-depth topological reasoning\.
## 4Downstream Application ofSciAtlas
Building uponSciAtlasand our search algorithms, in this section, we propose several potential downstream applications ofSciAtlasto facilitate researchers’ scientific endeavors and accelerate automated scientific research\. The detailed prompt used in this section can be found in Appx\.[B\.2](https://arxiv.org/html/2605.22878#A2.SS2)\.
### 4\.1Literature Review
One of the most fundamental applications of scientific search is literature review, which essentially involves retrieving papers relevant to a given research direction and synthesizing a review report\. We present a basic retrieval pipeline in §[3](https://arxiv.org/html/2605.22878#S3), where users can customize retrieval based on their specific requirements for the retrieved literature\. For instance, 1\) if the retrieved papers are required to be published in top\-tier conferences or journals, venue information can be incorporated into the importance score calculation of papers; 2\) If author authority is emphasized, the citation count of authors can be reflected in the weight ofAUTHOREDedges; 3\) If institutional authority is emphasized, corresponding weights can be assigned toAFFILIATED\_WITHedges based on the reputation of institutions\. Our algorithm provides flexible hyperparameter selection and functional adaptation to accommodate diverse retrieval focus requirements\. We will progressively open configuration permissions for various hyperparameters of the retrieval algorithm to support user\-customized retrieval\. With the retrieved paper collection, it can be adapted to various LLM\-based automated literature review methods\[autosurvey,deepreview,surveyforge,surveyx\]\.
### 4\.2Idea Grounding and Evaluation
##### Idea Grounding\.
By using the idea or paper as the query, we can retrieve a set of highly relevant papers from the KG and segment the full texts of these papers into finer\-grained paragraphs\. Subsequently, we employ an LLM to extract more refined queries or claims from the idea across multiple dimensions, including motivation, methodology, and experimental design, and use these refined queries to retrieve relevant paragraphs\. Then, through LLM\-based analysis, we identify the similarities and differences between the idea and the retrieved paragraphs\. Through this entire pipeline, we can determine whether prior similar work exists for the idea, find evidence to support it, or identify its real innovative aspects\. Since grounding may prioritize paper relevance, we can relax the emphasis on paper citations in Eq\.[11](https://arxiv.org/html/2605.22878#S3.E11)&[18](https://arxiv.org/html/2605.22878#S3.E18)\. We use\[innoeval\]as a running example in the following:
An Example of Idea GroundingTarget Idea or Paper:InnoEval: On Research Idea Evaluation as a Knowledge\-Grounded, Multi\-Perspective Reasoning Problem\.Target Query or Claim from the Idea:Mainstream approaches directly using LLM\-as\-a\-Judge fossilize the models’ inherent biases into de facto evaluation criteria, failing to emulate the deliberation among distinct perspectives needed for fair scientific evaluation\.Evidence Paper:Evaluating LLMs’ Divergent Thinking Capabilities for Scientific Idea Generation with Minimal Context\[divergent\]\.Evidence Paragraph:Furthermore, we recognize a fundamental challenge in the reliability of LLM\-as\-a\-Judge approaches\. When evaluating scientific ideas containing concepts outside the judge models’ knowledge boundaries, these models might misunderstand novel concepts and consequently misjudge their originality or feasibility\. While our use of a dynamic panel of state\-of\-the\-art judge models likely provides broader…Matching Aspect:Limitations of LLM\-as\-a\-Judge approaches\.Similar Point:Both identify a fundamental challenge or limitation with LLM\-as\-a\-Judge approaches for evaluating scientific ideas\. Both imply that LLM judges may produce unreliable or biased evaluations due to inherent model limitations\.Different Point:Target idea emphasizes failure to emulate multi\-perspective deliberation for fair scientific evaluation; evidence focuses on bias and score caused by limited knowledge without addressing deliberation or multi\-perspective aspects\.
##### Idea Evaluation\.
With the grounding results, we can evaluate the idea by assessing itsnoveltybased on the existence of prior similar work, itsfeasibilitybased on theoretical evidence, and itssoundnessbased on the experimental designs of related studies\. The focus of the grounding stage can be adjusted according to the criteria of downstream evaluation\. This process can serve as a decision\-making reference for human experts or be replaced by LLM\-as\-a\-judge, becoming a critical evaluation signal for idea iteration in automated scientific discovery\.
### 4\.3Idea Generation
We can use a research direction as the query, or an idea or a paper as the anchor, where retrieval in the KG functions as a knowledge collection process\. The collected papers can be utilized for a literature review to identify gaps and propose new ideas, or to synthesize concepts from different domains and generate interdisciplinary ideas\. Noting that the emergence of novel ideas typically stems from the fusion and refinement of two relatively distant concepts, we can relax the constraints on distant nodes in Eq\.[17](https://arxiv.org/html/2605.22878#S3.E17)during the search process to make the search more exploratory and the retrieved papers more diverse, thereby enhancing the novelty of generated ideas\. Here we show an example by using “Knowledge Editing” as the query:
An Example of Idea GenerationIdea:Federated and Privacy\-Preserving Knowledge EditingDescription :Design a knowledge editing framework suitable for a federated learning setting, where edits \(e\.g\., corrections from user feedback\) are computed locally on client devices and then aggregated to update a central model without exposing raw user data or the specific edits requested by individuals\.Novelty:All existing editing methods assume centralized access to the full model and edit dataset\. This idea introduces the constraints of federated learning—data decentralization, privacy, and communication efficiency—to the knowledge editing problem, a combination not yet explored\.Significance:Enables large\-scale, privacy\-respecting model updates from distributed user interactions\. This is crucial for applications like personal AI assistants on mobile devices, where users want to correct model errors without compromising their private data or queries\.Key References: \- \[2024\] Knowledge Editing on Black\-box Large Language Models\- \[2023\] EasyEdit: An Easy\-to\-use Knowledge Editing Framework for Large Language Models\- \[2025\] Massive Editing for Large Language Models Based on Dynamic Weight Generation
### 4\.4Research Trend Predicting
For trend prediction in a specific research direction, the most critical aspect is understanding the current development status of that direction, which aligns with the objective of idea generation\. The distinction lies in the fact that trend prediction emphasizes paper influence, as more impactful papers typically signify greater evolution in the research direction\. Therefore, in this task, we can increase the importance weight of paper citations in Eq\.[11](https://arxiv.org/html/2605.22878#S3.E11)&[18](https://arxiv.org/html/2605.22878#S3.E18)during the search\. Furthermore, to achieve a more comprehensive understanding of the field, we can relax the constraints on the number of papers retained during the search process and in the final results, making the retrieval outcomes more general\. We can sort the retrieved papers chronologically and employ an LLM to summarize the developmental trajectory of the research direction, focusing on the discussion or limitation sections of papers to identify critical problems that need to be addressed and propose potential research directions for the future\. Here is an example of research trend predicting by LLM:
An Example of Research Trend PredictingResearch Direction:Biologically plausible learning in spiking neural networks\.Stage Summary: 1\.2006\-2014: Foundational Mechanisms\. Early exploration of biologically plausible learning rules for spiking networks, focusing on gradient estimation through dynamic perturbation, unsupervised learning via STDP, and basic cognitive function implementation\.2\.2015\-2019: Cognitive and Sequence Learning\. Application of biologically plausible rules to more complex tasks, including goal\-directed decision making, sequence learning, and pattern recognition, with growing emphasis on combining multiple plasticity mechanisms\.3\.2020\-2022: Systematic Framework Development\. Concerted effort to develop learning frameworks as alternatives to backpropagation, addressing credit assignment problems and creating unified approaches that maintain biological plausibility while improving performance\.4\.2023\-2025: Integration and Efficiency\. Focus on optimizing for efficiency and scalability, incorporating event\-driven computation, exploring different neuron models, and developing novel mechanisms like bidirectional distillation for competitive performance\.Future Directions: 1\. Development of fully event\-driven, large\-scale learning systems\.2\. Integration of neuromodulation and attention mechanisms into learning frameworks\.3\. Co\-design of algorithms and neuromorphic hardware for optimal efficiency\.4\. Exploration of meta\-learning and continual learning in biologically plausible SNNs\.5\. Bridging the gap between computational models and experimental neuroscience findings\.
### 4\.5Related Author Retrieval
Given a research direction, retrieving relevant authors in that field can be as straightforward as simply replacing Eq\.[16](https://arxiv.org/html/2605.22878#S3.E16)with:
sagraph=ra,a∈V′∩Author\\displaystyle s\_\{a\}^\{graph\}=r\_\{a\},\\quad a\\in V^\{\\prime\}\\cap\\texttt\{Author\}\(19\)Subsequently, factors such as the authors’ citation counts can serve as critical references for final ranking and filtering\. During the retrieval process, to emphasize the contribution of authors to papers, we can adjust the weights ofAUTHOREDedges based on author order \(e\.g\., increasing the transition probabilities for first and last authors relative to other authors\) to retrieve the most relevant authors\.
### 4\.6Researcher Background Review
Given an author, we can directly match his/her name to the graph node and collect all the author’s published papers from the graph and summarize the author’s academic background using an LLM\. Since an author may simultaneously work on multiple research directions, we can first cluster the collected papers and have the LLM summarize within each cluster, then integrate them into a unified report\. An LLM\-generated researcher profile is shown below:
Researcher Background Review\*\*\* is a prolific researcher with an academic trajectory spanning Natural Language Processing, Artificial Intelligence, and Large Language Models\. His work demonstrates a strong emphasis on bridging symbolic knowledge \(knowledge graphs\) with statistical learning \(large language models\), particularly in the areas of information extraction, reasoning, and agentic systems\. A significant recent pivot involves developing methods to understand, control, and align the internal mechanisms and behaviors of large\-scale AI models, moving from application\-focused to fundamental model analysis and intervention\.Research Trajectory: 1\.Knowledge\-Enhanced Language Models & Information Extraction\(2018\-2023\)Early and sustained focus on integrating structured knowledge \(ontologies, knowledge graphs\) into NLP models\. This includes pioneering work on prompt\-tuning for relation extraction \(KnowPrompt\), generative models for knowledge graph completion \(GenKGC\), and multimodal knowledge graphs\. The goal is to make models more data\-efficient, interpretable, and grounded in factual knowledge\.2\.Reasoning, Planning, and Agentic AI Systems\(2023\-2026\)A major shift towards enabling LLMs to perform complex, multi\-step reasoning and act as autonomous agents\. Research focuses on augmenting agents with actionable knowledge \(KnowAgent\), refining their planning capabilities, and developing benchmarks for evaluation\. This theme explores how to equip LLMs with procedural knowledge and reliable world models for task execution\.3\.Model Analysis, Control, and Alignment\(2023\-2026\)A cutting\-edge direction focused on diagnosing and steering the internal dynamics of LLMs\. Work includes developing unified frameworks for understanding parameter dynamics from fine\-tuning to activation steering, diagnosing truthfulness via consistency under perturbation, and predicting unintended behaviors from data\. This represents foundational research into model interpretability and safety\.
## 5Limitations and Future Work
OurSciAtlasis under continuous maintenance and updates\. To further facilitate automated scientific discovery, we enumerate several important directions for future work\.
##### CLI and Skills\.
Currently, our KG is primarily accessed through the Neo4j interface\. Although we provide usage guidelines, users are still required to write Neo4j queries if conducting secondary development\. To facilitate user adoption, particularly for integration with AI agents, we will encapsulate various KG retrieval and invocation functionalities into Command Line Interfaces \(CLI\)\. Additionally, for downstream tasks, we will distill the best practices identified during our experimental process into agentic skills, enabling one\-click loading when utilizing agents for automated scientific discovery\.
##### Integrating More Knowledge Forms\.
Currently, the scientific knowledge in our KG primarily encompasses papers, keywords, authors, and other paper\-centric entities\. However, the complete research workflow extends beyond these elements to include atomic knowledge, theorems and standards, experimental experiences, datasets and code, among others\. How to acquire such knowledge and establish its associations with papers to form a more extensive and well\-organized knowledge network that facilitates agentic utilization and reasoning constitutes a crucial research direction for our future work\.We argue that KG is an indispensable knowledge organization form for scientific discoverybecause, although LLMs have achieved remarkable advancements in semantic understanding, they still exhibit substantial deficiencies in capturing logical relationships among knowledge entities, a capability of paramount importance for scientific research that transcends mere semantic associations\.
##### Benchmark and Evaluation\.
Benchmarks serve as a critical engine driving scientific progress\. Although automated scientific research has gained considerable popularity, numerous stages within this domain still lack high\-quality benchmarks and evaluation metrics that faithfully simulate real\-world research scenarios\. Furthermore, many scientific tasks involve long\-form responses, and the evaluation of such outputs is often ambiguous, making it difficult to establish definitive verifiers\. KGs, as symbolic knowledge repositories, can provide essential reference points for such verification processes\. Additionally, the knowledge stored within KGs can serve as valuable data sources for benchmark construction\. In this paper, we merely present running examples of downstream tasks, remaining at the qualitative analysis level\. In future work, we will develop dedicated benchmarks based onSciAtlasto quantitatively assess the downstream application capabilities of agent scientists\.
##### Dynamic Update\.
Currently, our KG updates primarily rely on periodic manual execution of fixed scripts\. Although we support user\-initiated updates, automated real\-time updates are essential to keep pace with the rapidly evolving knowledge landscape\. In future work, we will systematize the real\-time update strategies mentioned in §[2\.3](https://arxiv.org/html/2605.22878#S2.SS3.SSS0.Px1)to support daily KG update mechanisms\.
## 6Related Work
### 6\.1Automated Scientific Research
Recent breakthroughs in LLMs\[reasoning\-survey,long\-cot\-survey\]have propelled them into a central position within the domain of Automated Scientific Discovery\[dsgym,agent\-laboratory,ai\-scientist,how\-far\-ai\-sci,datamind\]\. The complete workflow of automated scientific discovery comprises five consecutive phases:i\) Literature Reviewing, during which LLMs search for papers on designated topics across the internet or specialized collections and consolidate them into organized surveys\[autosurvey,surveyx,opensholar,surveyforge,litllms\];ii\) Hypothesis Generation, where LLMs leverage both their inherent parametric knowledge and acquired external information to formulate feasible research concepts\[chain\-of\-ideas,virsci,deep\-ideation,researchagent,scipip\];iii\) Method Implementation, wherein LLMs convert the generated hypotheses into functional code, verify them via rigorous experimentation, and conduct statistical evaluation\[alphaevolve,automind,ml\-master,alpharesearch,aide\];iv\) Manuscript Writing, in which LLMs document the research rationale, technical approach, and experimental outcomes in the form of academic papers or reports\[overleafcoplilot,xtragpt\]; andv\) Peer Reviewing, where LLMs assume the responsibilities of reviewers to assess manuscripts from multiple perspectives\[cycleresearcher,agentreview,reviewer2,deepreview\]\. The entire workflow of automated scientific discovery is an extremely knowledge\-intensive process, in which literature review serves as the primary source of external knowledge beyond model parametric knowledge\. Consequently, a precise scientific search is of paramount importance for the whole workflow\.
### 6\.2Scientific Search and Discovery
Human scientists typically conduct scientific retrieval through general\-purpose academic search platforms such as Google Scholar and Semantic Scholar, domain\-specific preprint servers including arXiv, ChemRxiv, and PubMed, or official publisher platforms for journals and conferences\. In the domain of automated scientific research, early efforts primarily relied on keyword or vector\-based retrieval within local paper collections\[researchagent,virsci,rnd,scipip\]\. With the agentic advancement of LLMs, web\-based literature resources have become accessible through API calling\[can\-llm\-gen\-novel,chain\-of\-ideas,deep\-ideation,ai\-researcher,internagent,innoeval\]\. Deep research agent frameworks can further leverage the semantic understanding and reasoning capabilities of LLMs to enable in\-depth literature retrieval\[deepxiv,wispaper,opennovelty\]\. However, these approaches not only incur high computational costs and response latency but also, due to the absence of deterministic cognitive maps as anchors for LLMs, render them highly susceptible to logical hallucinations within complex exploratory trajectories\. So we argue that KG is an indispensable knowledge organization form for scientific discovery because, although LLMs have achieved remarkable advancements in semantic understanding, they still exhibit substantial deficiencies in capturing logical relationships among knowledge entities, a capability of paramount importance for scientific research that transcends mere semantic associations\. A recent related work, OmniScientist\[omniscientist\], has also proposed a research knowledge base\. However, it lacks the integration of core keywords for paper interconnection and semantic vectors\. Furthermore, its Elasticsearch\-based search algorithm merely relies on simple propagation through citation and reference relationships, without performing structured traversal and deep topological reasoning over heterogeneous subgraphs to uncover potentially relevant literature\.
## 7Conclusion
In this report, we introduceSciAtlas, a large\-scale, multi\-disciplinary, heterogeneous academic knowledge graph designed as a panoramic scientific evolution network\. By integrating 9 categories of entity nodes, 12 categories of relational edges, and over 43M papers,SciAtlasprovides a structured topological cognitive substrate that dismantles disciplinary barriers and furnishes AI agents with a global perspective\. Furthermore, we develop a neuro\-symbolic retrieval algorithm featuring tri\-path collaborative recall and graph reranking, achieving a seamless transition from simple semantic matching to deterministic association discovery\. We also present key application directions ofSciAtlas, including automated research trend synthesis, idea positioning, and academic trajectory exploration, to demonstrate thatSciAtlascan serve as an effective “cognitive map” to empower the full loop of automated scientific research while reducing reasoning costs\.
## References
## Appendix AFull Schema ofSciAtlas
Table 3:Node types and attributes in the Neo4j schema\.Node TypeAttributeTypeAuthoridIDAuthorlabelstringAuthordisplay\_namestringAuthororcidstringAuthorworks\_countintAuthorcited\_by\_countintAuthorh\_indexintAuthori10\_indexintAuthormean\_citedness\_2yfloatAuthorcreated\_datestringAuthorupdated\_datestringDomainidIDDomainlabelstringDomaindisplay\_namestringDomaindescriptionstringDomainworks\_countintDomaincited\_by\_countintDomaincreated\_datestringDomainupdated\_datestringFieldidIDFieldlabelstringFielddisplay\_namestringFielddescriptionstringFieldworks\_countintFieldcited\_by\_countintFieldcreated\_datestringFieldupdated\_datestringInstitutionidIDInstitutionlabelstringInstitutiondisplay\_namestringInstitutionrorstringInstitutioncountry\_codestringInstitutioncountrystringInstitutioncitystringInstitutiontypestringInstitutionworks\_countintInstitutioncited\_by\_countintInstitutionh\_indexintInstitutionhomepage\_urlstringInstitutioncreated\_datestringInstitutionupdated\_datestringKeywordidIDKeywordlabelstringKeywordtextstringKeywordtext\_normalizedstringKeywordfrequencyintKeywordtext\_embeddingfloat\[\]Papercreated\_datestringPaperupdated\_datestringPaperpdf\_urlstringPaperpdf\_source\_idstringPaperpdf\_source\_display\_namestringPaperpdf\_source\_typestringPaperpdf\_is\_oabooleanPaperpdf\_is\_publishedbooleanPaperpdf\_versionstringPapervenue\_source\_idstringPapervenue\_source\_display\_namestringPapervenue\_source\_typestringPapervenue\_raw\_source\_namestringSourceidIDSourcelabelstringSourcedisplay\_namestringSourcetypestringSourceissn\_lstringSourceis\_oabooleanSourceis\_corebooleanSourceworks\_countintSourcecited\_by\_countintSourcecreated\_datestringSourceupdated\_datestringSubfieldidIDSubfieldlabelstringSubfielddisplay\_namestringSubfielddescriptionstringSubfieldworks\_countintSubfieldcited\_by\_countintSubfieldcreated\_datestringSubfieldupdated\_datestringTopicidIDTopiclabelstringTopicdisplay\_namestringTopicdescriptionstringTopickeywordsstring\[\]Topicworks\_countintTopiccited\_by\_countintTopicdomain\_idstringTopicfield\_idstringTopicsubfield\_idstringTopiccreated\_datestringTopicupdated\_datestringTable 4:Relationship types in the Neo4j schema\.TypeSourceTargetPropertiesAFFILIATED\_WITHAuthorInstitutionis\_current \(boolean\)AUTHOREDAuthorPaperposition \(int\), is\_corresponding \(boolean\), raw\_name \(string\)CITESPaperPapernoneCOAUTHORAuthorAuthorcount \(int\)COOCCURKeywordKeywordcount \(int\)DOMAIN\_OFFieldDomainnoneFIELD\_OFSubfieldFieldnoneHAS\_KEYWORDPaperKeywordrelevance\_score \(float\)HAS\_TOPICPaperTopicscore \(float\), is\_primary \(boolean\)RELATED\_TOPaperPapernoneSUBFIELD\_OFTopicSubfieldnoneTable 5:Indexes in the Neo4j schema\.Index NameTypeDefinitionpaper\_title\_normalized\_idxRANGE:Paper\(title\_normalized\)paper\_text\_ftFULLTEXT:Paper\(title, abstract\)paper\_title\_ftFULLTEXT:Paper\(title\)paper\_abstract\_ftFULLTEXT:Paper\(abstract\)keyword\_text\_ftFULLTEXT:Keyword\(text, text\_normalized\)Table 6:Vector indexes in the Neo4j schema\.Index NameNodeConfigurationpaper\_title\_embedding\_idxPaperdimensions=1024, similarity=COSINEpaper\_abstract\_embedding\_idxPaperdimensions=1024, similarity=COSINEkeyword\_text\_embedding\_idxKeyworddimensions=1024, similarity=cosine
## Appendix BPrompts Used in this Report
### B\.1Keyword Extraction
Keyword ExtractionSystem Prompt:You are an expert assistant that extracts high\-level academic keywords for knowledge graph construction\.User Prompt:Goal:From an academic paper abstract, extract a small set of canonical, high\-level keywords that represent the main research topics, tasks, or method categories of the paper\.These keywords will be used as entities in a knowledge graph to connect related papers across many scientific and engineering fields\.Important:The goal is NOT to capture detailed phrases from the abstract\. The goal is to identify general, reusable concepts that are likely to appear in many different papers\.Requirements:•Extract only 3\-\-8 keywords\.•Prefer high\-level research areas, problem types, method families, or evaluation paradigms\.•Normalize detailed wording into broader concepts when possible\.•Keywords should be reusable across many papers and suitable as knowledge graph entities\.•Use concise noun phrases \(typically 1\-\-4 words\)\.•If the abstract does not contain many strong high\-level concepts, return fewer keywords\.•Also score each keyword’s relevance to the abstract on a 1\-10 integer scale\.Avoid:•Long descriptive phrases copied from the abstract•Paper\-specific terminology or system names•Highly customized or marketing\-style expressions•Implementation details or narrow technical descriptionsGood keyword examples\(general, reusable concepts\):•machine learning•wireless communication•computer vision•protein structure prediction•finite element method•Monte Carlo simulation•graph neural networks•energy optimization•fault detection•numerical simulationBad keyword examples\(paper\-specific phrasing\):•hierarchical dual\-path adaptive learning framework•multi\-stage cross\-modal feature fusion architecture•lightweight energy\-aware dynamic routing mechanism•novel high\-performance prototype system•end\-to\-end task\-specific optimization pipelineAbstract:\{abstract\}Output format:Return ONLY a JSON object:``` {"keywords": ["keyword1", "keyword2"], "scores": [8, 7]}. ``` scores must be integers from 1 to 10, aligned with keywords, with no extra text\.
### B\.2Downstream Tasks
#### B\.2\.1Idea Grounding – Query Generation
Idea Grounding – Query Generation:Generate dense\-retrieval queries from structured scientific extraction results\.System Prompt:You are a scientific grounding agent\. Generate retrieval\-oriented paragraph search queries from structured scientific extraction results\. Return strict JSON only\.User Prompt:Generate dense\-retrieval queries from the structured extraction below\.Requirements:•Use only the motivation and method sections as query sources\.•Consider all provided sentences from those two sections before deciding which queries to emit\.•Select the most retrieval\-useful items yourself\. Do not mirror every input sentence if some are redundant\.•Produce at most max\_queries total queries across both sections combined\.•You may allocate the total freely across motivation and method\.•Each output item must contain:–section: either motivation or method–sentence: the source sentence you selected–query: the final retrieval\-oriented rewrite•Keep the selected sentence meaning exactly\.•Write concise academic retrieval phrases or sentences likely to match paper paragraphs\.•Preserve task, object, method, mechanism, training objective, dataset, baseline, metric, or analysis anchors when present\.•Avoid vague wording such as ‘‘the framework’’, ‘‘this method’’, ‘‘evaluation’’, ‘‘performance improvement’’, or ‘‘how it works’’\.•Do not introduce unsupported facts\.Return JSON with this schema only:``` { "items": [ { "section": "motivation | method", "sentence": "selected source sentence", "query": "retrieval-oriented rewrite optimized for paragraph retrieval" } ] } ``` Structured extraction: \{items\_json\}
#### B\.2\.2Idea Grounding – Grounding
Idea Grounding – Grounding:Analyze how retrieved evidence aligns with a research idea unit\.System Prompt:You are a scientific grounding alignment agent\. Given a research idea unit, a retrieved paragraph, and local paper context, analyze how the retrieved evidence aligns with the target idea\. Use only the provided evidence\. Return strict JSON only\.User Prompt:Analyze how the retrieved evidence aligns with a specific research idea unit\.Research idea: \{idea\_text\}Target query section: \{query\_section\}Target query sentence: \{query\_sentence\}Retrieval query: \{query\_text\}Paper title: \{paper\_title\}Paper abstract: \{paper\_abstract\}Section path: \{section\_path\_text\}Retrieved paragraph: \{paragraph\_text\}Previous paragraphs: \{previous\_paragraphs\}Next paragraphs: \{next\_paragraphs\}Requirements:•Judge whether the retrieved evidence supports the target query unit\.•Use the paragraph as the primary evidence and use the surrounding context only to clarify scope, subject, or omitted details\.•Do not invent claims that are not grounded in the provided paragraph or context\.•If the paragraph is weakly related, say so explicitly\.•focus\_aspect should state which specific aspect of the target query is actually addressed by the evidence\.•grounded\_passage should be a concise evidence\-focused passage, usually 1\-\-3 sentences, that is more context\-aware and better aligned to the target query than the raw paragraph alone\.•evidence\_span should quote or closely paraphrase the most directly relevant span from the retrieved paragraph or immediate context\.•shared\_points should list concrete aspects that are aligned between the target idea unit and the evidence\.•different\_points should list concrete mismatches, missing parts, narrower scope, or different emphasis between the evidence and the target idea unit\.•coverage\_label must be one of: high, partial, limited, none\.–Use high only when the evidence covers most of the target idea unit\.–Use partial when there is clear overlap but also important missing coverage\.–Use limited when the evidence is only weakly related or covers a small sub\-aspect\.–Use none when it is essentially irrelevant\.•why\_this\_matches should briefly explain the overall judgment\.Return JSON with this schema only:``` { "status": "supported | partially_supported | weak_match | irrelevant", "focus_aspect": "which specific aspect of the target query is grounded", "grounded_passage": "context-aware grounding passage", "evidence_span": "most relevant evidence span", "shared_points": ["..."], "different_points": ["..."], "coverage_label": "high | partial | limited | none", "why_this_matches": "brief explanation of the match quality" } ```
#### B\.2\.3Idea Generation
Idea Generation:Propose novel research ideas from a collection of papers\.System Prompt:You are an expert research idea generator and only return valid JSON\.User Prompt:You are an expert research idea generator\. Given the papers below, propose novel research ideas that extend, combine, or contrast this work\. Return strict JSON only with this schema:``` { "ideas": [ { "title": "concise idea title", "description": "2-3 sentence core description of the idea", "novelty": "why this is novel compared to existing work", "significance": "potential impact and importance", "key_references": ["paper title 1", "paper title 2"] } ] } ``` Generate exactly \{idea\_count\} ideas\.Papers:\{paper\_context\}
#### B\.2\.4Research Trend Predicting
Research Trend Predicting:Summarize research trends from chronologically ordered papers\.System Prompt:You analyze academic topic evolution and only return valid JSON objects\.User Prompt:You are an expert research trend analyst\. Given chronologically ordered papers from one topic, summarize the research trend\. Return strict JSON only with this schema:``` { "one_sentence_summary": "...", "trend_summary": "...", "stage_summary": [ {"period": "...", "theme": "...", "description": "..."} ], "methodological_shifts": ["..."], "emerging_topics": ["..."], "open_gaps": ["..."], "future_directions": ["..."], "representative_papers": [ {"title": "...", "year": 2024, "why_representative": "..."} ] } ``` Papers: \{papers\_by\_year\}
#### B\.2\.5Author Research Profile
Author Research Profile:Summarize a researcher’s major directions from their publication list\.System Prompt:You summarize an author’s research trajectory and only return valid JSON objects\.User Prompt:You are an expert academic intelligence analyst\. Summarize one researcher’s major directions from their publication list\. Return strict JSON only with this schema:``` { "author_name": "\{author\_name\}", "overall_academic_profile": "...", "main_research_directions": [ {"theme": "...", "active_years": "...", "description": "..."} ], "technical_arsenal": ["..."], "representative_papers": [ {"title": "...", "year": 2024, "why_representative": "..."} ] } ``` Papers: \{paper\_lines\}Similar Articles
The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale
This paper introduces the Scientific Contribution Graph, a large-scale resource containing 2 million scientific contributions extracted from 230k open-access papers connected by 12.5 million prerequisite edges, and formulates the task of automated technological roadmapping and prerequisite prediction.
AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle
AutoSci is a memory-centric agentic system designed to automate the full scientific research lifecycle, from literature understanding to rebuttal, using LLM-based agents with persistent memory and self-evolution capabilities.
InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate
InfoAtlas is a foundation model that directly estimates mutual information in a single forward pass, achieving 100x speedup over traditional neural estimators while matching accuracy. It is pretrained on synthetic data and generalizes to real-world scenarios.
@Nature: A newly released AI tool has generated an atlas of more than one billion predicted protein structures and billions more…
A newly released AI tool has generated an atlas of over one billion predicted protein structures and sequences.
ATLAS: Autoformalized Textbook Library At Scale
ATLAS is a large-scale Lean 4 library of textbook mathematics autoformalized by LLMs, covering 26 books with over 46,000 declarations. It provides reusable formal building blocks for human and machine-driven formalization.