KG2Cypher: Data-Centric Pipeline for Building Enterprise Text-to-Cypher Systems

arXiv cs.CL Papers

Summary

KG2Cypher presents a data-centric pipeline for building enterprise text-to-Cypher systems from existing knowledge graphs. It uses LLMs to generate natural language question-Cypher pairs, validated by an LLM judge and human review, and achieves significant performance improvements on Korean enterprise datasets with LoRA-based fine-tuning.

arXiv:2606.27742v1 Announce Type: new Abstract: Enterprise Knowledge Graphs (KGs) are increasingly used for internal search, analytics, and question answering, but building natural-language interfaces for private enterprise graphs remains costly. We present KG2Cypher, a data-centric pipeline for building enterprise text-to-Cypher systems from existing KGs. KG2Cypher first constructs an executable Cypher query from observed graph facts and then uses LLMs to generate its associated natural-language question. The resulting Text-Cypher pairs are validated with an LLM judge and human validation, and are converted into candidate-aware SFT data. The trained generator is served with class-conditioned schema prompting, entity retrieval, and LoRA-based inference. We evaluate KG2Cypher in Korean enterprise settings, where short search-style queries and schema paraphrases make language grounding difficult. LoRA SFT improves execution-result F1 from 0.806 to 0.950 on broadcast-program queries and from 0.70 to 0.92 on company queries. In an 11-class setting, KG2Cypher achieves 95.2% exact match, 99.9% execution rate, and 0.964 execution-result F1.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:24 AM

# KG2Cypher: Data-Centric Pipeline for Building Enterprise Text-to-Cypher Systems
Source: [https://arxiv.org/html/2606.27742](https://arxiv.org/html/2606.27742)
Minjun Choi1,†Yerin Kim2,†Junghyuk Seo2Sujin Mo2Hyemin Lee2 Youngjoong Ko1 1Sungkyunkwan University2NAVER \{alswns078, lovekyll0\}@gmail\.com, yjko@skku\.edu \{junghyuk\.seo, sujin\.mo, hmin\.lee\}@navercorp\.com

###### Abstract

Enterprise Knowledge Graphs \(KGs\) are increasingly used for internal search, analytics, and question answering, but building natural\-language interfaces for private enterprise graphs remains costly\. We present KG2Cypher, a data\-centric pipeline for building enterprise text\-to\-Cypher systems from existing KGs\. KG2Cypher first constructs an executable Cypher query from observed graph facts and then uses LLMs to generate its associated natural\-language question\. The resulting Text\-Cypher pairs are validated with an LLM judge and human validation, and are converted into candidate\-aware SFT data\. The trained generator is served with class\-conditioned schema prompting, entity retrieval, and LoRA\-based inference\. We evaluate KG2Cypher in Korean enterprise settings, where short search\-style queries and schema paraphrases make language grounding difficult\. LoRA SFT improves execution\-result F1 from 0\.806 to 0\.950 on broadcast\-program queries and from 0\.70 to 0\.92 on company queries\. In an 11\-class setting, KG2Cypher achieves 95\.2% exact match, 99\.9% execution rate, and 0\.964 execution\-result F1\.

KG2Cypher: Data\-Centric Pipeline for Building Enterprise Text\-to\-Cypher Systems

Minjun Choi1,†Yerin Kim2,†Junghyuk Seo2Sujin Mo2Hyemin Lee2Youngjoong Ko1††thanks:Corresponding author\.1Sungkyunkwan University2NAVER\{alswns078, lovekyll0\}@gmail\.com, yjko@skku\.edu\{junghyuk\.seo, sujin\.mo, hmin\.lee\}@navercorp\.com

$\\dagger$$\\dagger$footnotetext:This work was done while Minjun Choi and Yerin Kim were Research Interns at NAVER\.## 1Introduction

Enterprise Knowledge Graphs \(KGs\) store structured business knowledge for internal search, analytics, question answering, and so on\. For example, a media KG may connect a program to its broadcaster, genre, cast, and episode count and a company KG may connect an organization to its industry, founders, listing exchange, and financial attributes\. These graphs are very useful, but most users do not know which node types and relation types exist, how to write Cypher queries \(hereafter, Cypher\), or which internal IDs identify the entities in the graph\. This creates a need for a natural\-language interface that lets users outside expert teams use enterprise KGs as well\.

Table 1:Illustrative Korean text\-to\-Cypher examples\. English translations are shown in parentheses\. Cypher outputs are anonymized because the underlying enterprise KG and entity identifiers are private\.Text\-to\-Cypher transformation is a constrained structured\-query generation task\. A generated query has to choose valid schema relations, construct graph patterns, bind entity URIs, use the correct literal sub\-fields, and execute against a live graph database\. Table[1](https://arxiv.org/html/2606.27742#S1.T1)illustrates these requirements with broadcast and company queries from our Korean enterprise settings\. These examples show how user expressions must be mapped to entity URIs, schema relations, literal conditions, and executable Cypher\. Users may write short search\-style phrases, omit arguments, vary spacing, mix Korean text with transliterated or foreign names, and use Korean paraphrases that do not match schema relation names\.

These constraints make data construction a central challenge\. The most straightforward solution is manual annotation, but this lacks scalability in an enterprise knowledge graph \(KG\) environment because each domain has its own node types, relation names, entity identifiers, and literal conventions\. For instance, a dataset for broadcast programs does not cover companies, sports teams, or festivals\. As a result, the manual construction of natural\-language and Cypher pairs requires a separate annotation task for each new domain\. Furthermore, in\-context learning with a strong LLM can be another solution\. However, a prompt\-only gpt\-oss\-120B model often produced syntactically executable Cypher, but it still returned wrong graph results because it selected the wrong relation, hallucinated an entity identifier, or used the wrong literal format in our experiments\.

Our key idea is simple; an enterprise that already has a KG should be able to reuse the KG itself as a supervision source\. We present KG2Cypher, a data\-centric industry pipeline that implements this idea for building enterprise text\-to\-Cypher systems\. In the data construction stage, KG2Cypher samples relation patterns that appear in the graph, executes those patterns to obtain real subgraphs, and builds executable Cypher using the returned entities and literals\. LLMs are then used only for language\-side operations including paraphrasing, compressed query generation, and quality judging\. This automates most of the symbol manipulation, and it can reduce the need to manually construct Text\-Cypher pairs from scratch\. As a result, human efforts are focused more on verification and revision than on initial data creation\.

During the training and serving stages, KG2Cypher converts verified pairs into candidate\-aware SFT examples\. The prompt contains the question, candidate relations from the class schema, and entity candidates with retrieval distractors, so the model learns to select the needed relations and URIs\. A LoRA adapter is then trained and served with the same prompt structure in a production\-oriented inference pipeline\.

We evaluate KG2Cypher on proprietary Korean enterprise KG domains including broadcast and company settings\. Although our experiments use Korean queries, the pipeline is not designed only for Korean because relation sampling, subgraph fetching, and canonical Cypher construction operate only on graph structure and graph values\. To apply KG2Cypher to another enterprise language, the language\-facing components need adaptation including question diversification prompts, judge prompts, and the domain classifier\. Overall, the experiments show that execution validity alone is not sufficient, because prompt\-only models can run but return wrong results\. KG\-grounded SFT improves enterprise\-specific grounding, and class\-conditioned schema prompting avoids relation\-first retrieval in our service setting\.

![Refer to caption](https://arxiv.org/html/2606.27742v1/figure/framework_figure_edit.png)Figure 1:Overview of KG2Cypher\. Left: KG\-grounded data construction from graph facts to validated Text\-Cypher pairs\. Right: candidate\-aware SFT and class\-conditioned serving\. Examples are translated, and Cypher is anonymized\.
## 2Related Work

#### Structured query generation\.

Natural\-language interfaces to databases have long been studied as structured query generation\. Text\-to\-SQL benchmarks such as WikiSQL and Spider define tasks that map user questions to executable SQL queries\(Zhonget al\.,[2017](https://arxiv.org/html/2606.27742#bib.bib1); Yuet al\.,[2018](https://arxiv.org/html/2606.27742#bib.bib2)\)\. Later methods study schema linking, constrained decoding, and LLM prompting for more reliable query generation\(Wanget al\.,[2020](https://arxiv.org/html/2606.27742#bib.bib4); Scholaket al\.,[2021](https://arxiv.org/html/2606.27742#bib.bib3); Gaoet al\.,[2024](https://arxiv.org/html/2606.27742#bib.bib5)\)\. KGQA datasets such as WebQuestionsSP, LC\-QuAD, and GrailQA also map natural\-language questions to logical forms or graph queries\(Yihet al\.,[2016](https://arxiv.org/html/2606.27742#bib.bib7); Dubeyet al\.,[2019](https://arxiv.org/html/2606.27742#bib.bib8); Guet al\.,[2021](https://arxiv.org/html/2606.27742#bib.bib9)\)\. These works establish evaluation practices for executable structured queries\. KG2Cypher follows this tradition, but the target is Cypher over a property graph\. The system must also bind private entity URIs and use enterprise\-specific relation names\.

#### Text\-to\-Cypher and enterprise graph settings\.

Cypher is a property\-graph query language for expressive graph pattern matching in industrial graph databases\(Franciset al\.,[2018](https://arxiv.org/html/2606.27742#bib.bib10)\)\. Recent Text\-to\-Cypher work addresses the lack of public data and evaluation resources\. The Neo4j Text2Cypher dataset combines public examples into a large benchmark\(Ozsoyet al\.,[2024](https://arxiv.org/html/2606.27742#bib.bib17)\)\. Auto\-Cypher/SynthCypher uses LLM\-supervised generation and verification for synthetic Cypher data\(Tiwariet al\.,[2025](https://arxiv.org/html/2606.27742#bib.bib16)\)\. Mind the Query emphasizes execution\-grounded benchmarking with graph databases and validation checks\(Chauhanet al\.,[2025](https://arxiv.org/html/2606.27742#bib.bib18)\)\. Recent multilingual Text\-to\-Cypher work also reports performance gaps across languages\(Ozsoy and Tai,[2025](https://arxiv.org/html/2606.27742#bib.bib19)\)\. These studies make Text\-to\-Cypher more measurable on public resources\. KG2Cypher addresses a different industry problem\. It constructs data, trains a model, and supports deployment for private enterprise KGs whose data, identifiers, schemas, and retrieval APIs cannot be released\.

## 3Methodology

### 3\.1Task Formulation

Given a natural\-language questionqq, relation candidatesRR, and entity candidatesEE, the generator modelfθf\_\{\\theta\}produces an executable Cypher queryyy:

y=fθ​\(q,R,E\)\.y=f\_\{\\theta\}\(q,R,E\)\.\(1\)This formulation aligns supervised fine\-tuning \(SFT\) with deployment\. The generator must select valid schema elements from retrieved candidates rather than generate them from scratch\. Each relation candidater∈Rr\\in Rcontains subject and object classes, a predicate identifier, and linguistic hints\. Each entity candidatee∈Ee\\in Econtains an internal URI, display name, and class label\.

### 3\.2System Overview

Figure[1](https://arxiv.org/html/2606.27742#S1.F1)outlines the full workflow of KG2Cypher: KG\-grounded data construction, candidate\-aware SFT, and class\-conditioned serving\. The key design choice is to separate symbolic query construction from language generation\. KG2Cypher builds Cypher targets from graph values with deterministic code and uses LLMs for paraphrasing and validation\. This reduces failures such as nonexistent relations, hallucinated entity identifiers, and literal conditions that do not execute\.

### 3\.3Predicate Collection and Filtering

Instead of relying on static schema specifications, the pipeline inspects graph instances to collect Subject\-Predicate\-Object \(SPO\) patterns that connect subject and object nodes\. For the broadcast\-program class, this step identifies active relations with predicate identifiers such as “broadcast\_by”, “genre”, and “number\_of\_episodes” and it also records whether each object is an entity or a literal\. This ensures that subsequent queries are based on the observed graph facts\. Rule\-based filtering removes metadata and non\-searchable attributes such as geocoordinates, media URLs, social\-media IDs, and system fields\. Objects with the same relation identifiers are merged if their query semantics are equivalent\.

### 3\.4Skeleton Sampling and Subgraph Fetching

The filtered relations are combined into multi\-condition query skeletons with the bucket distributions 40/30/20/10 for one\-, two\-, three\-, and four\-relation structures, respectively\. Skeletons are discarded if a domain lacks sufficient relations, and a limit on the number of attempts prevents redundant sampling\. Each skeleton is validated against the Memgraph graph database via aLIMIT 1query\. For each valid skeleton, the pipeline samples up to 50 matching subgraphs\. This limit prevents high\-frequency graph patterns from biasing the dataset and collects real entity URIs, literals, and relation attributes for grounded data generation\.

### 3\.5Canonical Cypher Construction

KG2Cypher deterministically constructs a canonical targetCgoldC\_\{\\text\{gold\}\}for each subgraph\. Entity nodes are bound by unique graph identifiers in theWHEREclause, and literals are mapped to schema attributes with valid comparison operatorsθ∈\{=,\>,<,≥,≤\}\\theta\\in\\\{=,\>,<,\\geq,\\leq\\\}\. This stage also creates an analyzed intermediate formNLanalyzed\\text\{NL\}\_\{\\text\{analyzed\}\}and a template\-derived naive statementNLnaive\\text\{NL\}\_\{\\text\{naive\}\}alongsideCgoldC\_\{\\text\{gold\}\}\. These synchronized views expose the same query semantics and keep the next LLM rewriting step anchored to verified graph structure and literal constraints\. Appendix[H](https://arxiv.org/html/2606.27742#A8)gives a concrete example of these representations\.

### 3\.6LLM\-Based Language Diversification

This stage uses the synchronized representations \(CgoldC\_\{\\text\{gold\}\},NLanalyzed\\text\{NL\}\_\{\\text\{analyzed\}\},NLnaive\\text\{NL\}\_\{\\text\{naive\}\}\), ontology constraints, and target\-language synonym maps as input to gpt\-oss\-120B\. KG2Cypher keeps the symbolic Cypher target fixed and uses the LLM only to rewrite the language side\. This design reduces unsupported graph structures and hallucinated literal constraints\.

This language\-side expansion is related to self\-instruction\(Wanget al\.,[2023](https://arxiv.org/html/2606.27742#bib.bib11)\), but KG2Cypher fixes the symbolic Cypher target before rewriting\. The LLM generates three types of questions: a term\-preserving question, five paraphrases, and compressed search\-style queries for shallow skeletons \(≤3\\leq 3joins\)\. For numeric and date relations, deterministic checks verify that units and comparison words match the Cypher condition, such as mapping “at least” to≥\\geq\.

### 3\.7LLM Judge and Human Validation

To detect semantic drift, KG2Cypher uses gpt\-oss\-120B to score each instance on a 0/1/2 scale across faithfulness toCgoldC\_\{\\text\{gold\}\}, target\-language fluency, and completeness of schema constraints\. The scale is a simple ordinal rubric used to align LLM scores with human validation labels\. Instances that pass all dimensions receive apassstatus\. Imperfect rows are marked asneeds\_reviewand routed to a human validation interface for verification \(keep\) or correction \(edit\)\.

The judge is calibrated based on human\-assigned scores and brief comments that explain the reason for score deductions from 200 sampled instances\. KG2Cypher automatically revises the scoring guide prompt with gpt\-oss\-120B using these comments\. Because the validated synthetic data has high\-score skew and low variance, we use Mean Absolute Error \(MAE\), adjacent agreement, and deduction catch rate instead of variance\-dependent metrics, in line with concerns from LLM\-as\-a\-judge work\(Zhenget al\.,[2023](https://arxiv.org/html/2606.27742#bib.bib20)\)\. Appendix[A](https://arxiv.org/html/2606.27742#A1)shows why no single metric is sufficient\.

### 3\.8Candidate\-Aware SFT Construction

After the construction and validation stages, KG2Cypher converts validated Text\-Cypher pairs into instruction\-following SFT examples\. Each input contains the question, candidate relations, and candidate entities, and the output is the gold Cypher\. Candidate relations include subject and object classes, predicate identifiers, and linguistic hints\. Candidate entities include URI, name, class, and retrieval distractors from the inference\-time entity API\. This matches inference\-time prompts, exposes the model to retrieval noise, and forces it to select the relations and entity URIs needed for the question\. Appendix[I](https://arxiv.org/html/2606.27742#A9)gives a full anonymized example\.

### 3\.9Class\-Conditioned Schema Prompting

Prior KBQA systems often retrieve candidate relations before logical\-form generation\. For example, SG\-KBQA ranks question\-relation pairs with a BERT\-based cross\-encoder\(Gaoet al\.,[2025](https://arxiv.org/html/2606.27742#bib.bib6)\)\. This relation\-first design is effective in benchmarks, but it is difficult to use as\-is in a low\-latency enterprise service\.

In practice, relation retrieval is not reliable enough in our low\-latency enterprise service setting\. If the gold relation is absent from a prompt, the generator usually cannot recover\. In contrast, entity retrieval is reliable enough to provide URI candidates\. Thus KG2Cypher uses class\-conditioned schema prompting\. A domain classifier first selects the target graph class and then the prompt includes the full relation schema for that class, while entity candidates still come from the entity API\.

The KLUE\-BERT\-base model\(Parket al\.,[2021](https://arxiv.org/html/2606.27742#bib.bib12)\)is trained for 11\-way single\-label classification as a domain classifier\. This design shifts retrieval from relation\-level ranking to class routing and entity URI binding, and it keeps the prompt bounded by the predicted class schema\.

### 3\.10Training and Serving

The generator is based on the Llama\-3\.1\-8B\-Instruct model\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.27742#bib.bib14)\)\. We adapt it with LoRA\(Huet al\.,[2021](https://arxiv.org/html/2606.27742#bib.bib13)\)\. The final adapter trains approximately 41M parameters, about 0\.52% of the base model\. Prompt tokens are masked, and loss is computed only on the assistant Cypher response\. Appendix[G](https://arxiv.org/html/2606.27742#A7)provides the full training configuration\.

At inference time, the system classifies the domain, loads the class schema, performs NER, retrieves entity candidates, builds the prompt, and generates Cypher through vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2606.27742#bib.bib15)\)\. The backbone stays loaded in vLLM; NER uses it without LoRA, and Cypher generation attaches the LoRA adapter at request time to avoid separate large\-model endpoints\. For entity retrieval, KG2Cypher merges two API result sets, one from the full question with detected entities and the other from detected entities alone\.

## 4Experiments

### 4\.1Data and Metrics

We evaluate KG2Cypher with internal enterprise KG domains\. The final class\-conditioned setting uses 11 graph classes with 20,745/2,590/2,621 train/dev/test examples; Appendix[D](https://arxiv.org/html/2606.27742#A4)lists the classes\. We also report broadcast and company diagnostics for prompting, retrieval, SFT, and transfer\. We report EM, execution rate, and execution\-result F1\. EM is exact string match; execution rate checks whether Cypher runs on Memgraph; and execution\-result F1 compares predicted and gold answer sets\. We use F1 as the main user\-facing metric because executable Cypher can return wrong results\. Appendix[B](https://arxiv.org/html/2606.27742#A2)gives an anonymized scoring example\.

### 4\.2LLM Judge Calibration

Before using the LLM judge to validate generated Text\-Cypher pairs, we calibrate its scoring prompt on 200 human\-annotated samples\. The judge scores faithfulness, fluency, and completeness, and calibration checks agreement with human labels\. We mainly track MAE, where lower values mean closer agreement with human scores\. The initial prompt passes fluency and completeness but fails faithfulness with MAE 0\.314, so it is not reliable enough for automatic quality gating\. Appendix[A](https://arxiv.org/html/2606.27742#A1)reports the full initial scores and explains the additional agreement checks\.

KG2Cypher then automatically revises the scoring guide prompt with gpt\-oss\-120B using the human disagreement comments\. As shown in Figure[2](https://arxiv.org/html/2606.27742#S4.F2), the guide grows from 270 to 325 lines and then stabilizes\. Faithfulness MAE decreases from 0\.314 to 0\.251 after three iterations, passing the 0\.27 target\. The calibrated judge accepts high\-confidence pairs and routes lower\-confidence pairs to human validation\.

![Refer to caption](https://arxiv.org/html/2606.27742v1/x1.png)Figure 2:LLM judge calibration\. Human comments guide automatic scoring\-guide revision, and faithfulness MAE falls below the 0\.27 target\. Lower MAE is better\.
### 4\.3Prompt\-Only Diagnostics

We first test whether a strong prompt\-only LLM can generate enterprise Cypher without task\-specific training\. We use gpt\-oss\-120B and compare four settings: rules only, rules with five few\-shot examples, rules with few\-shot examples plus the full broadcast schema, and an oracle prompt with gold relations and gold entities\.

Table 2:Broadcast\-domain prompt\-only diagnostics\. EM and execution are percentages; oracle uses gold relation/entity candidates\.Table[2](https://arxiv.org/html/2606.27742#S4.T2)shows that rules and few\-shot examples are not enough\. The prompts produce executable Cypher, but EM and execution\-result F1 remain near zero, which indicates that the model mostly learns runnable query forms rather than correct graph grounding\. Adding the full broadcast schema helps only slightly because relation selection remains unresolved\. Oracle gold relation/entity candidates sharply improve performance, which shows that candidate selection is critical\. Even with these oracle candidates, prompt\-only generation reaches only 47\.7% EM and 0\.7139 F1\.

We observe three recurring failure modes: hallucinated private entity URIs, plausible but wrong relations, and incorrect literal sub\-fields such asvalue\_numberandvalue\_date\.month\. These failures motivate retrieval\-aware prompting and SFT\. Appendices[C](https://arxiv.org/html/2606.27742#A3)and[E](https://arxiv.org/html/2606.27742#A5)give examples and query\-complexity results\.

### 4\.4Retrieval Diagnostics

Because Table[2](https://arxiv.org/html/2606.27742#S4.T2)shows that candidate selection is critical, we next evaluate whether available service APIs can retrieve those candidates\. Table[3](https://arxiv.org/html/2606.27742#S4.T3)reports candidate retrieval diagnostics\. Entity retrieval is evaluated on searchable entity mentions because literal\-only conditions do not require lookup, and it reaches 95\.0% recall@20\. In contrast, the deployed relation API reaches 40\.6% recall@20; query rewriting with five variants raises this to 49\.8%, still too low when generation requires the gold relation\.

Table 3:Retrieval diagnostics\. Entity retrieval is reliable, but relation retrieval is the bottleneck\.This result explains why KG2Cypher does not rely on relation\-first retrieval at inference time\. In knowledge\-base QA, relation\-first methods can use heavy cross\-encoders to rank question\-relation pairs, but our enterprise service setting requires low\-latency retrieval\. Therefore, KG2Cypher predicts the target class and provides the full relation schema for that class\.

### 4\.5LoRA SFT and Class\-Conditioned Results

Following the prompt\-only and retrieval diagnostics, Table[4](https://arxiv.org/html/2606.27742#S4.T4)reports the main generation results: LoRA SFT on broadcast and company, and the final 11\-class class\-conditioned setting\. Prompt\-only baselines receive few\-shot examples and gold relation/entity candidates\. For company, we also test out\-of\-domain transfer by applying a broadcast\-trained adapter to company queries\.

Table 4:Main generation results\. EM and execution are percentages; Prompt \+ gold uses gold relation/entity candidates\.LoRA SFT substantially improves execution\-result F1 in both broadcast and company domains\. The broadcast\-trained adapter \(LoRA OOD\) outperforms prompt\-only generation on company queries, but it remains below company in\-domain SFT\. This shows that Cypher conventions transfer across domains, but class\-specific grounding differs enough to motivate training on all target classes and class\-conditioned serving\. Appendix[F](https://arxiv.org/html/2606.27742#A6)reports a controlled broadcast ablation with distractor relations\.

For the final KG2Cypher setting, a KLUE\-BERT\-base classifier chooses one of 11 graph classes, and the generator receives the schema of that class\. The classifier reaches 99\.66% accuracy\. The final row of Table[4](https://arxiv.org/html/2606.27742#S4.T4)shows 95\.2% EM, 99\.9% execution rate, and 0\.964 execution\-result F1\. This result supports the class\-conditioned design; it avoids low\-recall relation retrieval before generation while keeping the prompt bounded by the predicted class\. Appendix[D](https://arxiv.org/html/2606.27742#A4)lists the 11 graph classes\.

Overall, the experiments support three conclusions\. First, execution validity alone is not sufficient because prompt\-only models can produce executable Cypher that returns wrong graph results\. Second, KG\-grounded SFT teaches enterprise\-specific conventions, including URI binding, relation selection, and literal sub\-field use\. Third, class\-conditioned schema prompting is more practical than relation\-first retrieval in our service setting\.

## 5Conclusion

We presented KG2Cypher, a data\-centric pipeline that reuses enterprise KGs for Text\-to\-Cypher data construction, model training, and serving\. Its main advantages are that it turns observed graph facts into executable Cypher targets and uses LLMs only for language generation and validation, which reduces manual pair authoring for private enterprise schemas\. KG2Cypher then trains a candidate\-aware LoRA generator for production\-style prompts\. Experiments show strong gains over prompt\-only generation and demonstrate that class\-conditioned schema prompting avoids a major relation\-retrieval bottleneck\.

## 6Limitations

The KG, entity APIs, generated data, and model checkpoints are proprietary, so we report aggregate statistics and anonymized examples rather than released artifacts\. The pipeline also inherits KG coverage limits: missing relations, entities, or values cannot produce supervision\. Although the symbolic stages are language\-agnostic, our current instantiation includes language\-specific prompts, judge calibration data, and classification modules because the enterprise data are Korean\. Applying the system to other languages would require adapting these language\-facing components\.

Deployment still depends on domain classification, NER, entity retrieval, and entity disambiguation\. Outside entities and homonymous entities remain challenging cases\. A practical extension is to treat frequent outside values as name\-value fields, but scaling this solution across all classes requires further engineering\. We evaluate the generator with Llama\-3\.1\-8B\-Instruct, and further experiments are needed to assess transfer to other base models\.

## Ethical Considerations

The system is intended for enterprise KG search and analytics by authorized users\. Because internal KGs may contain proprietary or sensitive business information, generated data, execution logs, and model outputs must follow organizational access\-control and data\-governance policies\. Human validation is used to reduce semantic drift in generated questions\. Examples in this paper are anonymized to avoid exposing private identifiers, and Korean examples are translated for readability\.

## References

- V\. Chauhan, S\. Raj, S\. Mujumdar, A\. Saha, and A\. Jain \(2025\)Mind the query: a benchmark dataset towards Text2Cypher task\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track,Suzhou, China,pp\. 1890–1905\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-industry.133),[Link](https://aclanthology.org/2025.emnlp-industry.133/)Cited by:[§2](https://arxiv.org/html/2606.27742#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Dubey, D\. Banerjee, A\. Abdelkawi, and J\. Lehmann \(2019\)LC\-quad 2\.0: a large dataset for complex question answering over wikidata and dbpedia\.InThe Semantic Web – ISWC 2019,pp\. 69–78\.External Links:[Document](https://dx.doi.org/10.1007/978-3-030-30796-7%5F5),[Link](https://doi.org/10.1007/978-3-030-30796-7_5)Cited by:[§2](https://arxiv.org/html/2606.27742#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Francis, A\. Green, P\. Guagliardo, L\. Libkin, T\. Lindaaker, V\. Marsault, S\. Plantikow, M\. Rydberg, P\. Selmer, and A\. Taylor \(2018\)Cypher: an evolving query language for property graphs\.InProceedings of the 2018 International Conference on Management of Data,pp\. 1433–1445\.External Links:[Document](https://dx.doi.org/10.1145/3183713.3190657),[Link](https://doi.org/10.1145/3183713.3190657)Cited by:[§2](https://arxiv.org/html/2606.27742#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Gao, H\. Wang, Y\. Li, X\. Sun, Y\. Qian, B\. Ding, and J\. Zhou \(2024\)Text\-to\-sql empowered by large language models: a benchmark evaluation\.Proceedings of the VLDB Endowment17\(5\),pp\. 1132–1145\.External Links:[Document](https://dx.doi.org/10.14778/3641204.3641221),[Link](https://www.vldb.org/pvldb/vol17/p1132-gao.pdf)Cited by:[§2](https://arxiv.org/html/2606.27742#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Gao, J\. H\. Lau, and J\. Qi \(2025\)Beyond seen data: improving KBQA generalization through schema\-guided logical form generation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 8753–8772\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.442),[Link](https://aclanthology.org/2025.emnlp-main.442/)Cited by:[§3\.9](https://arxiv.org/html/2606.27742#S3.SS9.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§3\.10](https://arxiv.org/html/2606.27742#S3.SS10.p1.1)\.
- Y\. Gu, S\. Kase, M\. Vanni, B\. Sadler, P\. Liang, X\. Yan, and Y\. Su \(2021\)Beyond i\.i\.d\.: three levels of generalization for question answering on knowledge bases\.InProceedings of the Web Conference 2021,WWW ’21,pp\. 3477–3488\.External Links:[Link](http://dx.doi.org/10.1145/3442381.3449992),[Document](https://dx.doi.org/10.1145/3442381.3449992)Cited by:[§2](https://arxiv.org/html/2606.27742#S2.SS0.SSS0.Px1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2021\)LoRA: low\-rank adaptation of large language models\.External Links:2106\.09685,[Link](https://arxiv.org/abs/2106.09685)Cited by:[§3\.10](https://arxiv.org/html/2606.27742#S3.SS10.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,External Links:[Document](https://dx.doi.org/10.1145/3600006.3613165),[Link](https://doi.org/10.1145/3600006.3613165)Cited by:[§3\.10](https://arxiv.org/html/2606.27742#S3.SS10.p2.1)\.
- M\. G\. Ozsoy, L\. Messallem, J\. Besga, and G\. Minneci \(2024\)Text2Cypher: bridging natural language and graph databases\.External Links:2412\.10064,[Link](https://arxiv.org/abs/2412.10064)Cited by:[§2](https://arxiv.org/html/2606.27742#S2.SS0.SSS0.Px2.p1.1)\.
- M\. G\. Ozsoy and W\. Tai \(2025\)Text2Cypher across languages: evaluating and finetuning llms\.External Links:2506\.21445,[Link](https://arxiv.org/abs/2506.21445)Cited by:[§2](https://arxiv.org/html/2606.27742#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Park, J\. Moon, S\. Kim, W\. I\. Cho, J\. Han, J\. Park, C\. Song, J\. Kim, Y\. Song, T\. Oh, J\. Lee, J\. Oh, S\. Lyu, Y\. Jeong, I\. Lee, S\. Seo, D\. Lee, H\. Kim, M\. Lee, S\. Jang, S\. Do, S\. Kim, K\. Lim, J\. Lee, K\. Park, J\. Shin, S\. Kim, L\. Park, A\. Oh, J\. Ha, and K\. Cho \(2021\)KLUE: korean language understanding evaluation\.External Links:2105\.09680,[Link](https://arxiv.org/abs/2105.09680)Cited by:[§3\.9](https://arxiv.org/html/2606.27742#S3.SS9.p3.1)\.
- T\. Scholak, N\. Schucher, and D\. Bahdanau \(2021\)PICARD: parsing incrementally for constrained auto\-regressive decoding from language models\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,External Links:[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.779),[Link](https://aclanthology.org/2021.emnlp-main.779/)Cited by:[§2](https://arxiv.org/html/2606.27742#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Tiwari, S\. K\. R\. Malay, V\. Yadav, M\. Hashemi, and S\. T\. Madhusudhan \(2025\)Auto\-cypher: improving llms on cypher generation via llm\-supervised generation\-verification framework\.External Links:2412\.12612,[Link](https://arxiv.org/abs/2412.12612)Cited by:[§2](https://arxiv.org/html/2606.27742#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Wang, R\. Shin, X\. Liu, O\. Polozov, and M\. Richardson \(2020\)RAT\-sql: relation\-aware schema encoding and linking for text\-to\-sql parsers\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 7567–7578\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.677),[Link](https://aclanthology.org/2020.acl-main.677/)Cited by:[§2](https://arxiv.org/html/2606.27742#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Wang, Y\. Kordi, S\. Mishra, A\. Liu, N\. A\. Smith, D\. Khashabi, and H\. Hajishirzi \(2023\)Self\-instruct: aligning language models with self\-generated instructions\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics,External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.754),[Link](https://aclanthology.org/2023.acl-long.754/)Cited by:[§3\.6](https://arxiv.org/html/2606.27742#S3.SS6.p2.2)\.
- W\. Yih, M\. Richardson, C\. Meek, M\. Chang, and J\. Suh \(2016\)The value of semantic parse labeling for knowledge base question answering\.InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics,External Links:[Document](https://dx.doi.org/10.18653/v1/P16-2033),[Link](https://aclanthology.org/P16-2033/)Cited by:[§2](https://arxiv.org/html/2606.27742#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Yu, R\. Zhang, K\. Yang, M\. Yasunaga, D\. Wang, Z\. Li, J\. Ma, I\. Li, Q\. Yao, S\. Roman, Z\. Zhang, and D\. Radev \(2018\)Spider: a large\-scale human\-labeled dataset for complex and cross\-domain semantic parsing and text\-to\-sql task\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,External Links:[Document](https://dx.doi.org/10.18653/v1/D18-1425),[Link](https://aclanthology.org/D18-1425/)Cited by:[§2](https://arxiv.org/html/2606.27742#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.External Links:2306\.05685,[Link](https://arxiv.org/abs/2306.05685)Cited by:[§3\.7](https://arxiv.org/html/2606.27742#S3.SS7.p2.1)\.
- V\. Zhong, C\. Xiong, and R\. Socher \(2017\)Seq2SQL: generating structured queries from natural language using reinforcement learning\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,External Links:[Document](https://dx.doi.org/10.18653/v1/D17-1088),[Link](https://aclanthology.org/D17-1088/)Cited by:[§2](https://arxiv.org/html/2606.27742#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix ALLM Judge Metric Validation

We use synthetic failure simulations to check why the calibration gate uses MAE, adjacent agreement, and catch rate together\. Each simulation contains 200 virtual samples and represents a common judge failure pattern\.

Single metrics miss different errors\. In aSystemic Bias Scenario, the judge assigns every score one point too low\. Adjacent agreement can still be 100%, but MAE detects the bias\. In aRandom\-Guessing Scenario, the judge may catch some deducted cases by chance, but MAE and adjacent agreement fail\. These examples show why KG2Cypher requires the judge to satisfy all three criteria before the prompt is accepted for data validation\.

Table[5](https://arxiv.org/html/2606.27742#A1.T5)reports the initial judge scores before calibration\.

Table 5:Initial baseline evaluator performance on 200 human\-annotated samples before prompt calibration\.
## Appendix BEvaluation Metric Example

Execution\-result F1 compares the answer set returned by the predicted Cypher with the answer set returned by the gold Cypher\. For example, assume the gold query returns four programs:

\{Program A,Program B,Program C,Program D\}\\\{\\text\{Program A\},\\text\{Program B\},\\text\{Program C\},\\text\{Program D\}\\\}and the predicted query returns three programs:

\{Program A,Program B,Program E\}\.\\\{\\text\{Program A\},\\text\{Program B\},\\text\{Program E\}\\\}\.The true positives are Program A and Program B\. Precision is2/32/3, recall is2/42/4, and F1 is the harmonic mean of precision and recall\.

## Appendix CPrompting Failure Modes

The prompt\-only diagnostics reveal three common failure modes\. First, the model may hallucinate private entity identifiers because the identifiers are not inferable from surface text alone\. Second, the model may choose a plausible but wrong relation even when the schema is present\. For example, a query about an ordered broadcast episode can be confused with an episode\-count relation\. Third, literal nodes require schema\-specific sub\-fields\. Date, number, price, and time expressions must map to fields such asvalue\_date\.month,value\_number, or a value\-unit pair\. These cases motivate candidate\-aware SFT\.

## Appendix DClass Labels

The 11\-class class\-conditioned setting uses the following graph classes: broadcast program, company, person, performance, music song, automobile model, movie, sports team, organization, country, and festival\. These labels are the output space of the domain classifier\. At inference time, the predicted class determines which full relation schema is inserted into the generator prompt\.

## Appendix EOracle Prompting by Query Complexity

Table[6](https://arxiv.org/html/2606.27742#A5.T6)breaks down oracle prompt\-only performance by the number of relations in the target query\.

Table 6:Oracle prompt\-only performance by query complexity\. Failure increases as more relations must be composed\.
## Appendix FCandidate\-Aware SFT Diagnostics

Before the final class\-conditioned setting, we analyze candidate\-aware SFT under controlled broadcast\-domain conditions\. We compare two schema formats\. The CLS format includes the target class of each relation, such as\[D\] BROADCAST\_PROGRAM \[N\] narrator \[R\] PERSON\. The No\-CLS format replaces the target class with a coarse type such asENTITYorLITERAL\. We also compare two candidate conditions\. The oracle condition provides only the gold relations\. The noise condition adds three negative relation candidates for each gold relation\.

Table 7:Broadcast\-domain SFT diagnostics\. EM and execution are percentages\. Noise candidates reduce F1, which supports training and evaluation with retrieval\-like distractors\.Table[7](https://arxiv.org/html/2606.27742#A6.T7)shows that SFT remains robust with noisy relation candidates, but noise clearly reduces execution\-result F1\. This supports the final candidate\-aware training setup, where the model must choose from realistic relation and entity candidates\. Table[10](https://arxiv.org/html/2606.27742#A9.T10)gives an anonymized SFT instance with candidate relations, entity candidates, and distractors\.

## Appendix GTraining Details

Table[8](https://arxiv.org/html/2606.27742#A7.T8)lists the LoRA training configuration used for the generator\. The setup trains only adapter parameters and masks prompt tokens, so the loss is applied to the assistant\-side Cypher response rather than to the input context\.

Table 8:LoRA training configuration\.
## Appendix HData Construction Example

Table[9](https://arxiv.org/html/2606.27742#A8.T9)illustrates how one grounded graph record is transformed before SFT\. The canonical CypherCgoldC\_\{\\text\{gold\}\}is built first, and the language\-side formsNLanalyzed\\text\{NL\}\_\{\\text\{analyzed\}\}andNLnaive\\text\{NL\}\_\{\\text\{naive\}\}are derived from the same verified conditions\. The LLM then rewrites only the question text, while the gold Cypher remains fixed\.

Table 9:An anonymized data construction example before SFT conversion\. The symbolic Cypher target is fixed before LLM rewriting, and validation checks the generated questions against the same target\.
## Appendix IExample SFT Instance

Table[10](https://arxiv.org/html/2606.27742#A9.T10)shows the candidate\-aware SFT format used by the generator\. The example illustrates the three pieces that are present in both training and serving: a natural\-language question, relation and entity candidates, and the gold Cypher target\. Entity identifiers are anonymized placeholders rather than real enterprise URIs\.

Table 10:An anonymized SFT instance in the candidate\-aware instruction format\. The input contains the question, candidate relations, and entity candidates, and the output is the gold Cypher\.

Similar Articles