Verifiable Knowledge Expansion through Retrieval-Grounded Formal Concept Analysis
Summary
This paper proposes a retrieval-grounded small language model framework that uses formal concept analysis as a symbolic verification loop for ontology construction, demonstrating its effectiveness in a rare ataxia setting.
View Cached Full Text
Cached at: 07/03/26, 05:45 AM
# Verifiable Knowledge Expansion through Retrieval-Grounded Formal Concept Analysis
Source: [https://arxiv.org/html/2607.01773](https://arxiv.org/html/2607.01773)
\(2026\)
###### Abstract\.
Ontology construction requires deciding which objects, attributes, and structural relations should be accepted as valid knowledge\. Language models can propose such structures from text, but their outputs can still be unsupported or inconsistent\. This paper proposes a retrieval\-augmented small language model \(SLM\) framework that uses formal concept analysis \(FCA\) as a symbolic verification loop for knowledge expansion\. Starting from seed attributes, FCA proposes implications over a growing formal context\. A retrieval\-grounded SLM oracle then validates each implication or returns a counterexample\. The oracle also supports incidence judgments, consistency checks, and attribute proposals, making accepted implications, counterexamples, contradictions, and corrections inspectable\. In a rare ataxia setting constructed from Orphadata resources, retrieval\-grounded 10\-seed runs obtain relation F1 of 0\.29–0\.52 and closure\-based implication F1 of 0\.22–0\.30\. Larger seed sets increase the number of evaluated implications and often improve implication F1\. The lower implication scores reflect a stricter evaluation of derived implications, where one missed or extra relation can affect several implication judgments\. Ablations show that incidence judgments in a fixed object–attribute setting can improve closure\-based implication scores\. However, identifying positive object–attribute pairs remains difficult even when the candidate objects and attributes are fixed\.
Formal concept analysis, Retrieval\-augmented generation, Ontology construction, Small language models, Rare disease phenotyping
††journalyear:2026††copyright:rightsretained††conference:8th epiDAMIK ACM SIGKDD Workshop on Data\-driven Decision Making for Public and Population Health; August 10, 2026; Jeju Island, Republic of Korea††booktitle:Proceedings of the 8th epiDAMIK ACM SIGKDD Workshop on Data\-driven Decision Making for Public and Population Health, August 10, 2026, Jeju Island, Republic of Korea††ccs:Computing methodologies Knowledge representation and reasoning††ccs:Information systems Information retrieval††ccs:Computing methodologies Natural language processing††ccs:Applied computing Life and medical sciences Health informatics## 1\.Introduction
Ontologies turn domain knowledge into a structure that can be shared, queried, checked, and reused\(Gruber,[1993](https://arxiv.org/html/2607.01773#bib.bib1); Uschold and Gruninger,[1996](https://arxiv.org/html/2607.01773#bib.bib2); Khadiret al\.,[2021](https://arxiv.org/html/2607.01773#bib.bib14)\)\. In biomedical domains, for example, an ontology\-like representation can connect rare disease objects to phenotype attributes and expose regularities among those phenotypes\(Robinsonet al\.,[2008](https://arxiv.org/html/2607.01773#bib.bib21); Köhleret al\.,[2021](https://arxiv.org/html/2607.01773#bib.bib22)\)\. However, building such structures manually is costly\. Domain experts must inspect source documents, encode object–attribute relations, and revise the structure when new evidence reveals missing or inconsistent knowledge\(Uschold and Gruninger,[1996](https://arxiv.org/html/2607.01773#bib.bib2); Al\-Aswadiet al\.,[2020](https://arxiv.org/html/2607.01773#bib.bib15); Khadiret al\.,[2021](https://arxiv.org/html/2607.01773#bib.bib14)\)\. This makes ontology and knowledge graph construction a natural setting for language model assistance, but errors in domain specific knowledge make unverifiable generation risky\(Panet al\.,[2024](https://arxiv.org/html/2607.01773#bib.bib16); Loet al\.,[2024](https://arxiv.org/html/2607.01773#bib.bib20); Huanget al\.,[2025](https://arxiv.org/html/2607.01773#bib.bib7)\)\. The key technical challenge is therefore to verify proposed knowledge before it becomes a structural commitment\. For example, the system may propose that a rare ataxia disease withAtaxia,Cerebellar atrophy, andTremorshould also haveDysarthria\. This rule is rejected if even one disease has the first three phenotypes but lacksDysarthria\. Ontology construction therefore needs a procedure that asks targeted structural questions and incorporates counterexamples when a proposed regularity fails\(Ganteret al\.,[2016](https://arxiv.org/html/2607.01773#bib.bib4)\)\. Formal concept analysis \(FCA\) provides such a procedure\. Its attribute exploration procedure proposes implications over a formal context and requires a counterexample when an implication is invalid\(Ganteret al\.,[1999](https://arxiv.org/html/2607.01773#bib.bib3),[2016](https://arxiv.org/html/2607.01773#bib.bib4)\)\.
The proposed framework combines FCA with retrieval\-grounded SLM judgments for rare ataxia diseases\. The task is to build a disease–phenotype formal context over standardized HPO labels such asAtaxia,Tremor, andDysarthria\. The practical question is whether a small seed attribute set can activate useful parts of the controlled phenotype attribute set while preserving verifiable incidence judgments and implication checks\. Across three small language models \(SLMs\), the system expands the context over 20 rounds, keeping accepted implications, counterexamples, contradictions, and corrections inspectable\.
This paper makes three contributions\.
- •An FCA\-based verification loop that tests object–attribute structures through implications and counterexamples\.
- •A retrieval\-grounded SLM oracle for lower cost evidence based local judgments\.
- •A symbolic–subsymbolic hybrid where FCA controls construction and the language model makes evidence\-based local decisions\.
Together, the experiments position the method as a verifiable construction procedure rather than an unchecked ontology generator\. The ablations further separate the difficulty of discovering object and attribute sets from the difficulty of judging incidences within those sets\.
## 2\.Background
FCA supplies the symbolic side by representing the current construction state as a formal context and testing implications with counterexamples\. RAG supplies the evidence grounding side by retrieving disease level text for the SLM oracle, turning ontology construction into a verifiable loop of implication queries, local judgments, counterexamples, and context updates\.
### 2\.1\.Formal concept analysis and attribute exploration
Formal concept analysis \(FCA\) represents object–attribute knowledge as a binary table called a formal context\(Ganteret al\.,[1999](https://arxiv.org/html/2607.01773#bib.bib3)\)\. In this paper, the objects are rare ataxia diseases and the attributes are phenotype labels\. A cell is marked when a disease is associated with a phenotype\. For example, one disease may haveAtaxia,Tremor, andDysarthria, while another disease may haveAtaxiaandTremorbut notDysarthria\. A formal concept consists of an extent, the set of objects in the concept, and an intent, the attributes shared by those objects\. An implication describes a regularity in the table\. The implicationA→BA\\rightarrow Bmeans that every object having all attributes inAAalso has all attributes inBB\. For example,\{Ataxia,Tremor\}→\{Dysarthria\}\\\{\\textit\{Ataxia\},\\textit\{Tremor\}\\\}\\rightarrow\\\{\\textit\{Dysarthria\}\\\}states that every disease withAtaxiaandTremoralso hasDysarthria\. Such a rule is useful only if it survives a counterexample check\. A disease withAtaxiaandTremorbut withoutDysarthriawould reject it\.
Attribute exploration turns these regularities into oracle questions\. Following Algorithm 19 inConceptual Exploration\(Ganteret al\.,[2016](https://arxiv.org/html/2607.01773#bib.bib4)\), the intended domain context\(G,M,I\)\(G,M,I\)is distinguished from the observed partial context\(Eτ,Mτ,Jτ\)\(E\_\{\\tau\},M\_\{\\tau\},J\_\{\\tau\}\)at roundτ\\tau\. Here,GGis the domain object set,MMis the finite attribute set,IIis the target incidence relation,EτE\_\{\\tau\}is the observed object set,MτM\_\{\\tau\}is the active attribute set, andJτJ\_\{\\tau\}records checked incidences\. The closureAJτJτA^\{J\_\{\\tau\}J\_\{\\tau\}\}is the set of attributes shared by the currently observed objects that have all attributes inAA\. The exploration query asks whether the observed implicationA→AJτJτA\\rightarrow A^\{J\_\{\\tau\}J\_\{\\tau\}\}also holds in the intended domain\. If the oracle accepts the query, the implication is added to the implication base\. If the oracle rejects it, the oracle must return a counterexample objectggsuch thatA⊆gIA\\subseteq g^\{I\}butAJτJτ⊈gIA^\{J\_\{\\tau\}J\_\{\\tau\}\}\\nsubseteq g^\{I\}\. The counterexample is added to the observed context, preventing the same overgeneral rule from being accepted again\. Classical attribute exploration assumes thatMMis fixed from the beginning; under a reliable expert and finite fixedMM, it can return a canonical implication basis\. Following the FCA usage inConceptual Exploration, this term refers to a complete and non redundant implication basis of a fixed formal context\. Every valid implication of the context is derivable from the basis, and no accepted implication is treated as a free text rule outside the context\. In this paper, FCA and attribute exploration provide the symbolic mechanism for proposing implications, checking counterexamples, and updating the disease–phenotype context\.
### 2\.2\.Retrieval\-augmented generation
Retrieval\-augmented generation addresses a central limitation of purely parametric language models\. Factual decisions should be grounded in external evidence that can be inspected or updated\. Lewis et al\.\(Lewiset al\.,[2020](https://arxiv.org/html/2607.01773#bib.bib5)\)introduced RAG as a way to combine parametric generation with non parametric retrieval for knowledge intensive NLP tasks\. In ontology construction, retrieval is useful because object–attribute relations and implications require evidence sensitive judgments\. Without retrieved evidence, a language model must answer from parametric memory alone, which can produce unsupported or incorrect relations\. Retrieved evidence gives the model task relevant information to consult before making each local judgment\. Providing retrieved evidence can improve performance over relying on parametric generation alone, especially in knowledge grounded dialogue and other knowledge intensive tasks\(Shusteret al\.,[2021](https://arxiv.org/html/2607.01773#bib.bib6); Huanget al\.,[2025](https://arxiv.org/html/2607.01773#bib.bib7)\)\. In this paper, retrieved evidence is used to ground local object–attribute and implication validity questions\.
### 2\.3\.Small language models
Large language models remain advantageous for open ended generation and multi step reasoning\. However, the decisions required in the FCA loop are narrower than open ended generation\. They are repeated, evidence conditioned YES/NO judgments over local object–attribute assignments or candidate implications\. For this setting, maximum accuracy is not the only consideration\. Inference cost, latency, memory footprint, and local deployability also matter\. This task dependent view of model size is supported by prior work on text classification, where specialized smaller models can reach or exceed the performance of general large models with a limited number of labeled examples\(Pecheret al\.,[2025](https://arxiv.org/html/2607.01773#bib.bib12)\)\. It is also supported by industrial studies showing that smaller transformer models can handle practical classification workloads while offering better deployment efficiency\(Liet al\.,[2025](https://arxiv.org/html/2607.01773#bib.bib13)\)\. These findings match the use of SLMs for constrained classification style judgments over retrieved evidence rather than for long form reasoning or free form ontology generation\. Thus, SLMs are not a replacement for LLMs in complex reasoning, but they are a cost efficient option when the task can be reduced to repeated, evidence conditioned YES/NO decisions\.
## 3\.Method
This section describes how the symbolic and retrieval\-grounded components are combined into a single ontology construction loop\. The framework starts from seed attributes and uses FCA attribute exploration to generate implication queries\. For each query, the system first retrieves disease level evidence relevant to the premise and conclusion\. The SLM oracle then uses this retrieved evidence to accept the implication or search for a counterexample\. The formal context is updated with accepted implications or counterexample objects\. Unlike classical attribute exploration, this setting keeps a finite controlled phenotype attribute set but activates it progressively\. The loop starts from seed attributesM0M\_\{0\}, explores only the active attribute setMτM\_\{\\tau\}, and then adds a selected set of new attributesΔMτ\\Delta M\_\{\\tau\}for the next round\. Each round keeps the FCA\-based verification step\. The overall process tests whether a small seed set can expand into useful parts ofMM, while avoiding unchecked free text ontology generation\. The following paragraphs define the round state, context updates, oracle decisions, and logged artifacts\.
Figure 1\.Overview of the RAG\-grounded SLM\-FCA framework\.A flow diagram of the RAG\-grounded SLM\-FCA framework, showing the FCA exploration loop, retrieval\-grounded oracle decisions, counterexample validation, and implication\-guided attribute discovery updates\.### 3\.1\.RAG\-grounded SLM\-FCA framework
The proposed method combines the FCA and RAG components introduced above into a single verifiable construction loop\. Figure[1](https://arxiv.org/html/2607.01773#S3.F1)summarizes the loop as six stages\. The paragraphs below define the main components of this loop, from controlled attribute selection to attribute screening for the next round\.
Controlled attribute set\.The method starts from a finite domain attribute setMMsupplied before exploration and activates only a seed subsetM0M\_\{0\}in the first round\. In Figure[1](https://arxiv.org/html/2607.01773#S3.F1), the term*controlled vocabulary*denotes the finite domain attribute setMM, i\.e\., the full controlled set of phenotype attributes considered by the method\. The seed set is chosen to cover diverse, interpretable regions of the attribute set rather than a single narrow attribute cluster\. Good seeds should be evidence accessible, non redundant, and frequent enough to produce early counterexamples, while still specific enough to avoid accepting overly broad implications\. The rare ataxia experiment in Section 4 uses this criterion to choose clinically interpretable seed attributes\.
Activate formal context\.The initial context contains seed attributes but no seed objects, and rejected initial queries introduce the first counterexamples\. At roundτ\\tau, the active state is the formal context\(Eτ,Mτ,Jτ\)\(E\_\{\\tau\},M\_\{\\tau\},J\_\{\\tau\}\), whereEτE\_\{\\tau\}is the observed object set,MτM\_\{\\tau\}is the active attribute set, andJτJ\_\{\\tau\}records checked incidences\.
FCA exploration\.At roundτ\\tau, FCA attribute exploration runs over the current attribute setMτM\_\{\\tau\}and example context\(Eτ,Mτ,Jτ\)\(E\_\{\\tau\},M\_\{\\tau\},J\_\{\\tau\}\), producing candidate implicationsA→AJτJτA\\rightarrow A^\{J\_\{\\tau\}J\_\{\\tau\}\}\. Here,JτJ\_\{\\tau\}is the currently observed incidence relation, andAJτJτA^\{J\_\{\\tau\}J\_\{\\tau\}\}is the current closure ofAA, i\.e\., the attributes shared by all observed objects that have every attribute inAA\. For readability, Algorithm[1](https://arxiv.org/html/2607.01773#alg1)writes the same closure asclKτ\(A\)\\operatorname\{cl\}\_\{K\_\{\\tau\}\}\(A\), whereKτ=\(Eτ,Mτ,Jτ\)K\_\{\\tau\}=\(E\_\{\\tau\},M\_\{\\tau\},J\_\{\\tau\}\)\. When an implication is not already closed in the current context, the retrieval\-grounded oracle either returns a verified counterexample object or accepts the implication\. After attribute exploration finishes for the fixedMτM\_\{\\tau\}, the method selects a small set of new attributes for the next round\. BecauseMτM\_\{\\tau\}grows from a seed, the output is an explored implication set over the activated context, not a claim of a complete canonical basis for the full controlled phenotype attribute set\.
RAG\-SLM oracle\.The oracle uses short prompts for binary YES/NO decisions over retrieved evidence\. These decisions cover object–attribute judgments, counterexample checks, and corrections\. Retrieval queries are built from the labels in the current FCA question\. For an object–attribute query, the query combines the disease name with the phenotype label being checked\. For an implication query, the query combines premise and conclusion labels, such asAtaxia,Cerebellar atrophy, andDysarthria\. The retrieved snippets are then placed in the oracle prompt as evidence for accepting the implication or searching for a counterexample\. When the loop needs additional attributes, the SLM proposes reusable property phrases from retrieved snippets, which are then filtered, validated, and checked before entering the context\. This prompt design follows prior findings that retrieval\-augmented outputs are sensitive to context placement, filtering, and distracting evidence\(Ramet al\.,[2023](https://arxiv.org/html/2607.01773#bib.bib8); Wanget al\.,[2023](https://arxiv.org/html/2607.01773#bib.bib9); Liuet al\.,[2024](https://arxiv.org/html/2607.01773#bib.bib10)\)\. The prompt therefore shows only the most relevant snippets near the YES/NO question before constrained output generation\.
Context and implication update\.The symbolΔMτ\\Delta M\_\{\\tau\}denotes the selected attributes appended after roundτ\\tau, not the entire attribute set\. Accepted implications are reused in later rounds as provisional background knowledge for query avoidance and contradiction detection\. When a later counterexample candidate conflicts with a previously accepted implication, the implementation re checks the relevant object–attribute incidence and logs the correction event\. It does not perform full implication retraction or belief revision\. Thus, the method is an inspectable construction procedure, not a complete expert verified ontology editor\.
Attribute screening\.Attribute proposals are guided by the current implication structure\. For an accepted implicationA→clKτ\(A\)A\\rightarrow\\operatorname\{cl\}\_\{K\_\{\\tau\}\}\(A\), the closure differenceclKτ\(A\)∖A\\operatorname\{cl\}\_\{K\_\{\\tau\}\}\(A\)\\setminus Asupplies inferred attributes that can seed retrieval queries\. The system retrieves evidence from the premise, from the premise plus one inferred attribute, and from premise satisfying object anchors; if these queries fail, it visits unused vector database chunks one at a time\. Only short, grounded, non duplicate property phrases that can be judged as YES/NO attributes are mapped to canonical attribute labels, selected, and appended toΔMτ\\Delta M\_\{\\tau\}\.
Oracle decisions\.The retrieval\-grounded SLM oracle is used only for constrained local decisions\. For object–attribute queries, it returns a binary YES/NO incidence judgment from retrieved evidence, with unsupported or ambiguous evidence treated as NO\. For implication queries, it searches for a counterexample object that satisfies the premise but misses at least one conclusion attribute\. Accepted implications enter the implication base, while rejected implications add verified counterexample objects and checked incidences to the context\. Later consistency conflicts are recorded as logged contradictions and corrections rather than full implication retractions\.
Algorithm[1](https://arxiv.org/html/2607.01773#alg1)summarizes this round\-level procedure in pseudocode\.
Algorithm 1RAG\-grounded SLM\-FCA ontology exploration1:Controlled attribute set
MM, seed attributes
M0⊆MM\_\{0\}\\subseteq M, retrieval index, SLM oracle
2:Initialize formal context
K0=\(E0,M0,J0\)K\_\{0\}=\(E\_\{0\},M\_\{0\},J\_\{0\}\)with
E0=∅E\_\{0\}=\\varnothing
3:forround
τ=1,…,T\\tau=1,\\dots,Tdo
4:Update object incidences for new attributes, then run FCA on
KτK\_\{\\tau\}
5:foreach candidate implication
A→clKτ\(A\)A\\rightarrow\\operatorname\{cl\}\_\{K\_\{\\tau\}\}\(A\)do
6:Retrieve evidence relevant to
AAand
clKτ\(A\)\\operatorname\{cl\}\_\{K\_\{\\tau\}\}\(A\)
7:Ask the SLM oracle to search for and verify a counterexample
8:ifthe implication is rejectedthen
9:Add the counterexample object and checked incidences to
KτK\_\{\\tau\}
10:else
11:Add
A→clKτ\(A\)A\\rightarrow\\operatorname\{cl\}\_\{K\_\{\\tau\}\}\(A\)to the implication base
12:endif
13:endfor
14:Use implications and object anchors to retrieve evidence and propose attributes
15:ifno attribute is selected from the pending poolthen
16:Visit unswept vector database chunks sequentially, one chunk at a time, and ask the SLM to extract candidate attributes
17:endif
18:Validate, map to labels in
MM, and select
ΔMτ\\Delta M\_\{\\tau\}for the next round
19:if
ΔMτ=∅\\Delta M\_\{\\tau\}=\\varnothingthen
20:break
21:endif
22:Set
Mτ\+1=Mτ∪ΔMτM\_\{\\tau\+1\}=M\_\{\\tau\}\\cup\\Delta M\_\{\\tau\}and carry objects forward
23:endfor
24:returnfinal context and implication base
### 3\.2\.Verification Artifacts
Each round produces artifacts that make the construction process inspectable\. The system records the evolving object–attribute matrix, accepted implications, counterexamples, and logged contradiction/correction events from consistency checks\. It also records final evaluation outputs for disease–phenotype relation metrics and closure based implication metrics\. These artifacts are the basis for the paper’s claim of verifiable ontology construction\. The output is therefore not only a final context, but also a trace of why structural commitments were accepted, rejected, or revised\.
## 4\.Experiment
This section evaluates whether the proposed framework can construct and verify a rare ataxia disease–phenotype context from Orphadata derived resources\. The setting separates disease level retrieval text used by the oracle from curated disease–HPO annotations used only for evaluation\.
The experiments are run for 20 rounds with temperature 0\. The default 10\-seed condition starts from frequent, clinically interpretable attributes covering speech, gait, cerebellar, eye\-movement, pyramidal, and seizure\-related findings\. The controlled exploration attribute set contains the 160 evidence\-accessible HPO attributes described below, and the retrieval index embeds disease\-aligned Orphanet records with EmbeddingGemma\-300M111[https://huggingface\.co/google/embeddinggemma\-300m](https://huggingface.co/google/embeddinggemma-300m)\. The same budgets are used across models, with up to ten counterexample candidates per implication test and up to three attribute discovery retrieval queries per round\. Gold disease–HPO relations are never used by the runtime oracle, and generated attributes are aligned to the gold attribute set only after exploration for evaluation\.
Evaluation has two parts\. Predicted object–attribute relations are compared with the run specific projection of the evidence accessible gold context\. FCA implications induced by the output context are evaluated against the canonical basis of that context, so implication scores measure closure behavior rather than recovery of all implications in the full curated disease–phenotype annotation context\.
### 4\.1\.Dataset
The framework is evaluated on rare ataxia diseases using Orphadata222[https://www\.orphadata\.com/](https://www.orphadata.com/)and HPO\(Robinsonet al\.,[2008](https://arxiv.org/html/2607.01773#bib.bib21); Köhleret al\.,[2021](https://arxiv.org/html/2607.01773#bib.bib22)\)\. Orphadata provides disease identifiers, names, classification records, and disease definition text, while HPO provides phenotype labels and curated disease–phenotype relations\. In the resulting formal context, objects are rare ataxia diseases, attributes are HPO phenotype labels, and a positive cell means that the disease is curated as having that phenotype\.
The disease objects are selected from the Rare ataxia branch of the Orphadata rare neurological disease classification\. From 189 classification nodes, 139 disease candidates are retained, 124 remain after requiring curated HPO annotations, and 122 remain after requiring aligned Orphanet disease definition text\. Thus, every object has curated phenotype annotations for evaluation and disease definition text for retrieval\.
The phenotype attribute set is derived from the curated disease–HPO annotations of these 122 diseases, which contain 850 HPO labels and 2,716 positive relations\. Because the retrieval corpus contains disease definition text rather than the curated annotation table, the evidence accessible gold context is constructed by exact keyword matching\. Only HPO labels whose canonical names appear in the disease aligned retrieval chunks are retained, yielding 122 disease objects, 160 HPO phenotype attributes, and 1,398 positive relations\. This attribute level filtering means that some retained curated relations are not explicitly stated in the corresponding disease definition text and may not be recoverable from retrieval evidence alone\.
The runtime oracle sees only disease definition text, and evaluation uses curated HPO disease–phenotype annotations\.
### 4\.2\.Main Results
The evaluation reports relation scores over disease–phenotype incidences and closure\-based implication scores over predicted implications\. Table[1](https://arxiv.org/html/2607.01773#S4.T1)reports how far each 10\-seed RAG\-grounded run expands the formal context over 20 rounds\. The reported objects are discovered counterexamples rather than a fixed input disease list\. Thus, larger object counts indicate that the exploration loop found more diseases that challenged candidate implications\. In Tables[1](https://arxiv.org/html/2607.01773#S4.T1)and[3](https://arxiv.org/html/2607.01773#S4.T3), Obj\., Attr\., Impl\., Eval\., and Qry\. denote objects, attributes, accepted implications, evaluated implications, and oracle queries\. Eval\. counts accepted implications whose attributes all fall inside the evidence accessible gold context\. Accepted implications outside this context are not scored by the gold projection\.
Table 1\.10\-seed exploration profile\.The models use these objects and newly selected attributes to move beyond the seed attribute set\. Table[1](https://arxiv.org/html/2607.01773#S4.T1)therefore describes the exploration behavior, not only the final matrix size\. Llama expands most aggressively, reaching 44 objects, 68 attributes, 298 accepted implications, and 444 oracle queries\. Qwen adds almost as many counterexample objects as Llama but keeps a smaller attribute set, while Gemma remains the smallest explored object context\. The small Eval\. counts show that many accepted implications include generated attributes that cannot be projected back to the gold attribute set\. Table[2](https://arxiv.org/html/2607.01773#S4.T2)evaluates the final contexts from the same runs against the evidence accessible gold context\. Rel\. denotes relation scores, Impl\. denotes closure based implication scores, and bold and underline mark the best and second best values within a table\.
Table 2\.10\-seed RAG\-grounded SLM\-FCA evaluation\.Table[2](https://arxiv.org/html/2607.01773#S4.T2)shows complementary metric tradeoffs\. Llama has the best relation F1 \(0\.52\) and relation precision \(0\.58\)\. Qwen has the best implication F1 \(0\.30\), while Gemma has the highest relation recall \(0\.87\)\. Overall, the runs construct verifiable partial contexts, while relation agreement and implication quality diverge sharply\.
To examine the effect of a larger seed set, Tables[3](https://arxiv.org/html/2607.01773#S4.T3)and[4](https://arxiv.org/html/2607.01773#S4.T4)report the corresponding 20\-seed runs under the same 20\-round budget\. The 10\-seed runs in Tables[1](https://arxiv.org/html/2607.01773#S4.T1)and[2](https://arxiv.org/html/2607.01773#S4.T2)provide the comparison point\. The 20\-seed setting expands contexts and increases evaluable implications\.
Table 3\.20\-seed exploration profile\.The 20\-seed setting increases accepted implications for all three models, from 99 to 172 for Gemma, from 298 to 404 for Llama, and from 263 to 544 for Qwen\. It also increases evaluable implications from 22 to 57 for Gemma, from 43 to 98 for Llama, and from 11 to 67 for Qwen\. Table[4](https://arxiv.org/html/2607.01773#S4.T4)then evaluates whether this larger explored space improves relation and implication quality\.
Table 4\.20\-seed RAG\-grounded SLM\-FCA evaluation\.The larger seed set improves implication F1 for all three models\. The largest absolute gain is Gemma, rising from 0\.22 to 0\.36, while Qwen reaches the highest final implication F1 at 0\.41\. The gap between accepted and evaluable implications in Table[3](https://arxiv.org/html/2607.01773#S4.T3)remains large\. Many generated attributes still fall outside the matched gold attribute set\. Thus, the larger seed set improves reach but does not remove the evaluability bottleneck\.
Figure 2\.Round\-level context growth and oracle activity\. The left panel shows object–attribute expansion; the right panel shows accepted implications and oracle questions per round\.Two line charts comparing RAG\-grounded SLM\-FCA runs and LLM\-only FCA Experiment 1 runs\. The left chart shows object and attribute growth across rounds\. The right chart shows accepted implications and oracle questions per round\.For closure based implication evaluation, a canonical implication basis is first computed from the evidence accessible gold context\. This basis is stored as the reference implication basis\. In the sense of classical FCA, this basis is a complete implication representation of that fixed formal context\. Valid implications of the context are derivable from the basis by closure\. The same closure based implication scoring protocol is then applied to every run\. For implication precision, a predicted implicationA→BA\\rightarrow Bis counted as correct whenBBis contained in the closure ofAA\. This closure is computed under the canonical basis of the evidence accessible gold context\. Equivalently, the implication is entailed if every gold disease with the premise phenotypes inAAalso has the predicted conclusion phenotypes inBB\. For example,A→\{Dysarthria\}A\\rightarrow\\\{\\text\{Dysarthria\}\\\}is correct when Dysarthria belongs to the gold closure ofAA, even if that closure contains additional phenotypes\. Thus, a prediction may state only part of the gold closure and still be correct; it is not required that the predicted implication exactly match a stored gold implication string\. For implication recall, each gold basis implicationC→DC\\rightarrow Dis counted as recovered whenDDis contained in the closure ofCCunder the predicted implication set\. This closure based implication protocol is used for the proposed framework and for Experiments 1–3 in the ablation study\. Table[5](https://arxiv.org/html/2607.01773#S4.T5)gives concrete implication examples from the final run outputs\. Both examples are taken from the final 20\-seed RAG\-grounded run outputs\. The table keeps only the judgment type and implication so the examples remain readable\.
Table 5\.Example implication judgments\.The core FCA output in this seeded setting is an explored implication set\. The table contrasts predicted implications that are entailed by the evidence accessible gold context with those that are not\. In the Case column,Entailed \(TP\)means that the predicted implication is true under the gold closure\.Non\-entailed \(FP\)means that the run predicted the implication, but the gold closure does not license it\. These non entailed examples are not necessarily clinically implausible\. They are simply not supported by the evidence accessible gold context under closure based evaluation\.
The same logs also show how the system makes verification failures inspectable\. During exploration, a failed implication introduces a disease level counterexample rather than silently accepting a plausible rule\. During consistency checking, an existing object can violate an accepted implication\. The system records the violated premise, missing conclusion attributes, and any corrected disease–phenotype relations proposed from the same Orphanet nomenclature evidence bundle\.
#### Trace examples\.
In the 20\-seed runs, Gemma2:2B records 21 disease level counterexamples, 57 logged contradictions, and 6 corrections\. Llama3\.2:3B records 55 counterexamples, 132 logged contradictions, and no accepted corrections\. The corresponding Qwen2\.5:1\.5B run records 58 counterexamples, 105 logged contradictions, and 8 corrections\. These traces show that failed implications are not hidden\. They become explicit counterexample objects, violation records, and candidate corrections that can be inspected after the run\. The contrast also explains the metric pattern in Table[4](https://arxiv.org/html/2607.01773#S4.T4)\. Llama expands the context most aggressively, while Qwen produces stronger implication F1 with moderate relation recall and lower precision\.
### 4\.3\.Ablation Study
Tables[6](https://arxiv.org/html/2607.01773#S4.T6)–[8](https://arxiv.org/html/2607.01773#S4.T8)separate three GPT\-based ablations from the local SLM oracle used in the proposed framework\.
Experiment 1\.Retrieval is removed from the exploration loop\. GPT\-4o\-mini or GPT\-4o is used as the oracle\. Counterexample validation, consistency checking, and attribute expansion are retained, with GPT providing attribute proposals\. The runs stop when the exploration loop reaches its termination condition rather than a fixed round limit\. Table[6](https://arxiv.org/html/2607.01773#S4.T6)tests whether a GPT\-based oracle can compensate for removing retrieved disease evidence\.
Table 6\.LLM\-only FCA loop\.These results show that GPT\-based LLM\-only runs can achieve high implication quality in smaller explored contexts, but they do not replace retrieval driven context growth\. Because the LLM\-only runs terminate early, their high implication F1 is diagnostic rather than evidence that retrieval is unnecessary or directly comparable to larger retrieved contexts\. This separates oracle strength from the evidence supply that helps the loop find counterexamples and attributes\.
Figure[2](https://arxiv.org/html/2607.01773#S4.F2)adds a round level view of the difference between retrieval\-grounded exploration and the LLM\-only ablation\. The left panel shows that the RAG\-grounded runs continue to add objects and attributes across rounds, whereas the LLM\-only runs saturate after fewer rounds\. The right panel shows that oracle questions and accepted implications are not the same signal\. Many oracle calls are used for verification, incidence checks, or counterexample search rather than becoming accepted implications\. Together, these trends show that retrieved disease evidence sustains exploration by supplying new object and attribute evidence, while a GPT\-based oracle without retrieval tends to operate within a narrower explored context\.
Experiment 2\.This experiment uses the disease objects and phenotype attributes from each 10\-seed, 20\-round RAG\-grounded source run in Table[1](https://arxiv.org/html/2607.01773#S4.T1)\. For Gemma2, Llama3\.2, and Qwen2\.5, these discovered objects and attributes are reused as the candidate matrix\. No new diseases or phenotype labels are added\. GPT\-4o\-mini then rejudges every disease–phenotype cell in each source model’s matrix\. The completed matrix is evaluated with the same projected implication protocol\. Thus, Table[7](https://arxiv.org/html/2607.01773#S4.T7)isolates incidence quality after object and attribute discovery has already happened\.
Table 7\.Incidence judgment on discovered object–attribute matrices\.The rejudged matrices indicate that incidence judgment remains a bottleneck even after the source model has found the candidate objects and attributes\. Relation F1 stays below 0\.50 for all three matrices, with Llama highest and Qwen close behind\. The induced implication scores improve over the original SLM oracle judgments for all three 10\-seed matrices, with the largest gain for Qwen\. Thus, local incidence decisions affect structural quality, but rejudged cell decisions do not uniformly improve every relation metric\.
Experiment 3\.The candidate diseases and phenotype labels are taken from the full evidence accessible gold context\. This fixed object–attribute setting is kept unchanged throughout the experiment\. GPT\-4o\-mini then judges every disease–phenotype pair within that fixed evaluation setting\. The completed incidence table is evaluated for both relation quality and induced implications\. Table[8](https://arxiv.org/html/2607.01773#S4.T8)reports this favorable fixed object–attribute setting over 122 diseases and 160 phenotype labels, yielding19,52019\{,\}520disease–phenotype pairs\.
Table 8\.Incidence judgement on the fixed gold object–attribute matrices\.The fixed object–attribute setting shows that even the complete evaluation scope does not make incidence recovery easy\. The relation F1 remains only 0\.31, so many disease–phenotype cells are still hard to judge from the available textual evidence\. This shows that the limitation is not only missing objects or missing phenotype labels\. The evidence text itself can be too indirect for reliable cell level recovery\. Because the discovery problem is removed, the model only has to judge whether each known pair is supported\. At the implication level, the resulting context achieves high precision but lower recall\. This precision–recall gap suggests conservative output: fewer gold implications are recovered, but many accepted ones remain valid\. Thus, the result is a favorable fixed object–attribute baseline rather than evidence that open world exploration has been solved\. The remaining errors show that relation recovery is still difficult even when all candidate diseases and phenotype attributes are already fixed\.
Cost\.Table[9](https://arxiv.org/html/2607.01773#S4.T9)reports API calls and cost for the GPT\-4o and GPT\-4o\-mini ablations\. The table separates model price from the number of oracle judgments\. Exp1 uses relatively few calls because the explored contexts are small, but its GPT\-4o condition remains expensive because each call has a higher unit price\. Exp3 shows the opposite pattern because it uses GPT\-4o\-mini while evaluating every disease–phenotype pair in the evidence\-accessible gold context\. Thus, local incidence judgments become the main driver of API usage\.
Table 9\.OpenAI API usage\.The cost table shows that API usage depends on both oracle strength and evaluation scope\. Exp2 calls follow the size of each discovered object–attribute matrix, while Exp3 grows to 19,520 calls because every fixed gold context pair is judged\.
### 4\.4\.Evaluation and Limitations
The results characterize the framework as verifiable partial construction, not full ontology reconstruction\. Its output is an explored implication set rather than a canonical basis of the full controlled phenotype attribute set\. For the completed 10\-seed RAG\-grounded runs, relation F1 ranges from 0\.29 to 0\.52 and closure based implication F1 ranges from 0\.22 to 0\.30\. Experiment 1 reaches 0\.56 relation F1 and 0\.67 implication F1 in a smaller explored context\. These results demonstrate a working verification loop, but not high\-precision ontology completion\. Part of this gap comes from the larger explored object–attribute contexts, where more discovered objects, selected attributes, and incidence decisions create more opportunities for cell level errors\. Because the retrieval corpus uses disease definition text rather than the curated annotation table, many HPO associations are not explicitly verbalized in the raw evidence\. This mismatch especially affects relation recall, since a curated phenotype can be valid even when its exact label is absent from the retrieved disease description\. It also affects implication scores indirectly, because a small number of missed or extra incidences can change several closure judgments\. The appropriate claim is therefore an inspectable construction process whose scores measure recovered incidences and closure behavior in the evidence accessible gold context\.
## 5\.Discussion
The results clarify why ontology construction should be treated as verified structure building, not unconstrained text generation\. FCA is the structural control mechanism because it derives implications from the current object–attribute context and tests them through counterexamples\. Retrieval supplies disease level evidence, and the SLM oracle provides a lower cost interface for repeated evidence conditioned decisions\. The GPT\-based LLM\-only ablation reaches high evaluation scores in smaller explored contexts, but Figure[2](https://arxiv.org/html/2607.01773#S4.F2)shows that retrieval\-grounded runs keep expanding objects and attributes\. Thus, final evaluation scores and evidence\-driven context growth capture different aspects of the construction process\.
The main limitation is not only the model\. The available disease definition text can also be too sparse for cell level incidence recovery\. Experiment 3 fixes the candidate diseases and phenotype attributes, but relation F1 remains low\. This suggests that some curated HPO relations cannot be recovered from disease definition text alone\. For this reason, the output should be read as an auditable pre curation artifact\. The trace records accepted implications, rejected implications, counterexample objects, contradiction logs, and corrections\. These records show where the construction succeeded or failed\. An expert can inspect them before accepting the resulting object–attribute context as ontology content\. This keeps the claim centered on inspectable construction rather than automatic ontology completion\.
## 6\.Conclusion
This paper presented a RAG\-grounded SLM\-FCA framework for verifiable ontology construction in rare ataxia phenotyping\. The framework starts from seed attributes and expands the context step by step\. FCA proposes structural commitments, retrieval provides disease evidence, and SLMs make repeated local judgments over that evidence\. Experiments on Orphadata\-derived disease records show that the loop can add objects and attributes while recording counterexamples and corrections\. The logs also connect implications to object level evidence, which makes the construction process inspectable\. The ablations show that high scores in a fixed or smaller explored context do not describe the whole open world task\. In open world construction, the system must also discover useful objects and attributes\. The output should therefore be read as an explored implication set over an evidence accessible context\. It is not a finished clinical ontology or a canonical basis for the full controlled phenotype attribute set\. The central takeaway is that FCA can make retrieval\-grounded language model construction inspectable\. Reliable ontology completion will still require stronger retrieval coverage, better synonym and entity matching for HPO labels, improved seed selection, stricter implication acceptance criteria, and expert facing review of logged oracle questions, counterexamples, and corrections\.
## References
- F\. N\. Al\-Aswadi, C\. H\. Yong, and K\. H\. Gan \(2020\)Automatic ontology construction from text: a review from shallow to deep learning trend\.The Artificial Intelligence Review53\(6\),pp\. 3901–3928\.Cited by:[§1](https://arxiv.org/html/2607.01773#S1.p1.1)\.
- B\. Ganter, S\. Obiedkov, S\. Rudolph, and G\. Stumme \(2016\)Conceptual exploration\.Springer\.Cited by:[§1](https://arxiv.org/html/2607.01773#S1.p1.1),[§2\.1](https://arxiv.org/html/2607.01773#S2.SS1.p2.17)\.
- B\. Ganter, R\. Wille, and R\. Wille \(1999\)Formal concept analysis\.Vol\.150,Springer\.Cited by:[§1](https://arxiv.org/html/2607.01773#S1.p1.1),[§2\.1](https://arxiv.org/html/2607.01773#S2.SS1.p1.4)\.
- T\. R\. Gruber \(1993\)A translation approach to portable ontology specifications\.Knowledge acquisition5\(2\),pp\. 199–220\.Cited by:[§1](https://arxiv.org/html/2607.01773#S1.p1.1)\.
- L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin,et al\.\(2025\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Transactions on Information Systems43\(2\),pp\. 1–55\.Cited by:[§1](https://arxiv.org/html/2607.01773#S1.p1.1),[§2\.2](https://arxiv.org/html/2607.01773#S2.SS2.p1.1)\.
- A\. C\. Khadir, H\. Aliane, and A\. Guessoum \(2021\)Ontology learning: grand tour and challenges\.Computer Science Review39,pp\. 100339\.Cited by:[§1](https://arxiv.org/html/2607.01773#S1.p1.1)\.
- S\. Köhler, M\. Gargano, N\. Matentzoglu, L\. C\. Carmody, D\. Lewis\-Smith, N\. A\. Vasilevsky, D\. Danis, G\. Balagura, G\. Baynam, A\. M\. Brower,et al\.\(2021\)The human phenotype ontology in 2021\.Nucleic acids research49\(D1\),pp\. D1207–D1217\.Cited by:[§1](https://arxiv.org/html/2607.01773#S1.p1.1),[§4\.1](https://arxiv.org/html/2607.01773#S4.SS1.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Vol\.33,pp\. 9459–9474\.Cited by:[§2\.2](https://arxiv.org/html/2607.01773#S2.SS2.p1.1)\.
- L\. Li, L\. Sleem, G\. Nichil,et al\.\(2025\)Small language models in the real world: insights from industrial text classification\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 6: Industry Track\),pp\. 971–982\.Cited by:[§2\.3](https://arxiv.org/html/2607.01773#S2.SS3.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the association for computational linguistics12,pp\. 157–173\.Cited by:[§3\.1](https://arxiv.org/html/2607.01773#S3.SS1.p5.1)\.
- A\. Lo, A\. Q\. Jiang, W\. Li, and M\. Jamnik \(2024\)End\-to\-end ontology learning with large language models\.Vol\.37,pp\. 87184–87225\.Cited by:[§1](https://arxiv.org/html/2607.01773#S1.p1.1)\.
- S\. Pan, L\. Luo, Y\. Wang, C\. Chen, J\. Wang, and X\. Wu \(2024\)Unifying large language models and knowledge graphs: a roadmap\.IEEE Transactions on Knowledge and Data Engineering36\(7\),pp\. 3580–3599\.Cited by:[§1](https://arxiv.org/html/2607.01773#S1.p1.1)\.
- B\. Pecher, I\. Srba, and M\. Bielikova \(2025\)Comparing specialised small and general large language models on text classification: 100 labelled samples to achieve break\-even performance\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 165–184\.Cited by:[§2\.3](https://arxiv.org/html/2607.01773#S2.SS3.p1.1)\.
- O\. Ram, Y\. Levine, I\. Dalmedigos, D\. Muhlgay, A\. Shashua, K\. Leyton\-Brown, and Y\. Shoham \(2023\)In\-context retrieval\-augmented language models\.Transactions of the Association for Computational Linguistics11,pp\. 1316–1331\.Cited by:[§3\.1](https://arxiv.org/html/2607.01773#S3.SS1.p5.1)\.
- P\. N\. Robinson, S\. Köhler, S\. Bauer, D\. Seelow, D\. Horn, and S\. Mundlos \(2008\)The human phenotype ontology: a tool for annotating and analyzing human hereditary disease\.The American Journal of Human Genetics83\(5\),pp\. 610–615\.Cited by:[§1](https://arxiv.org/html/2607.01773#S1.p1.1),[§4\.1](https://arxiv.org/html/2607.01773#S4.SS1.p1.1)\.
- K\. Shuster, S\. Poff, M\. Chen, D\. Kiela, and J\. Weston \(2021\)Retrieval augmentation reduces hallucination in conversation\.InFindings of the Association for Computational Linguistics: EMNLP 2021,pp\. 3784–3803\.Cited by:[§2\.2](https://arxiv.org/html/2607.01773#S2.SS2.p1.1)\.
- M\. Uschold and M\. Gruninger \(1996\)Ontologies: principles, methods and applications\.The knowledge engineering review11\(2\),pp\. 93–136\.Cited by:[§1](https://arxiv.org/html/2607.01773#S1.p1.1)\.
- Z\. Wang, J\. Araki, Z\. Jiang, M\. R\. Parvez, and G\. Neubig \(2023\)Learning to filter context for retrieval\-augmented generation\.arXiv preprint arXiv:2311\.08377\.Cited by:[§3\.1](https://arxiv.org/html/2607.01773#S3.SS1.p5.1)\.Similar Articles
Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction
This paper proposes a neuro-symbolic framework for constructing ontology-grounded knowledge graphs from text by deferring consistency corrections to a post-extraction stage, reducing token usage while improving KG consistency and maintaining QA performance.
KARLA: Knowledge-base Augmented Retrieval for Language Models
KARLA proposes a method for LLMs to query a knowledge base during generation, enabling factual updates without retraining and improving transparency. Experiments show improved factual grounding in both short and long-form generation.
Weakly Supervised Concept Learning for Object-centric Visual Reasoning
This paper introduces a two-stage neuro-symbolic framework that uses weak supervision (as little as 1% labels) with a slot-based VAE to learn interpretable symbols for object-centric visual reasoning, outperforming foundation models in domain generalization.
Towards Fine-Grained and Verifiable Concept Bottleneck Models
This paper proposes a fine-grained concept bottleneck model framework that grounds each concept in localized visual evidence, enabling direct verification of concept correctness and improving transparency in medical imaging tasks.
Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models
This paper introduces Micro-Macro Retrieval (M2R), a retrieve-while-generate framework that reduces hallucination in long-form LLM outputs by ensuring key information stays close to generated text. It uses curriculum learning-based reinforcement learning to train retrieval and grounding skills, showing effectiveness especially in lengthy contexts.