@pauliusztin_: I spent months optimizing GraphRAG retrieval. But it turned out I was optimizing the wrong thing.... The biggest knowle…

X AI KOLs Timeline Tools

Summary

A detailed guide on optimizing knowledge graph ingestion for AI agents, presenting a five-step pipeline (extraction, resolution, embedding, deduplication, routing) to prevent graph corruption and improve retrieval quality.

I spent months optimizing GraphRAG retrieval. But it turned out I was optimizing the wrong thing.... The biggest knowledge graph problems usually occur during ingestion (even though most conversations focus on retrieval). Every new document creates a risk of graph corruption. This is why I now think about knowledge graph ingestion as a 5-step pipeline: 1/ Extraction Convert raw text into entities and relationships. For example: Person → WORKS_AT → Organization The goal is to extract what your ontology cares about. 2/ Resolution This step standardizes names. For example: NYC → New York City P Morgan → JPMorgan Chase Jon Smith → John Smith Most importantly, nothing has been merged yet. 3/ Embedding Next, embed the entity's full context (not just the name). Think: Type Attributes Metadata Relevant content Because identity lives in context. 4/ Deduplication Many systems fail here. Because: Apple the company ≠ Apple the fruit Paris, France ≠ Paris, Texas Two people can share the same name Resolution answers naming. Deduplication answers identity. Those are two completely different jobs. 5/ Routing Finally, the system decides: Merge Human review Create new node And the best systems follow a simple rule: Evidence strength = permission strength. Weak evidence → new node Strong evidence → merge Uncertain evidence → human review Because false merges are expensive. A duplicate node is annoying. But a corrupted graph can silently poison retrieval quality for months. My biggest takeaway? Knowledge graph quality isn't determined by your retrieval strategy... It's determined by the pipeline that creates the graph in the first place. Get these five steps right... and retrieval becomes much easier. P.S. I break down the full pipeline, entity resolution, deduplication thresholds, review queues, and production architecture in Decoding AI Magazine Check it out here: https://decodingai.com/p/keep-knowledge-graph-clean…
Original Article
View Cached Full Text

Cached at: 06/10/26, 09:58 PM

I spent months optimizing GraphRAG retrieval.

But it turned out I was optimizing the wrong thing….

The biggest knowledge graph problems usually occur during ingestion (even though most conversations focus on retrieval).

Every new document creates a risk of graph corruption.

This is why I now think about knowledge graph ingestion as a 5-step pipeline:

1/ Extraction

Convert raw text into entities and relationships.

For example: Person → WORKS_AT → Organization

The goal is to extract what your ontology cares about.

2/ Resolution

This step standardizes names.

For example:

NYC → New York City P Morgan → JPMorgan Chase Jon Smith → John Smith

Most importantly, nothing has been merged yet.

3/ Embedding

Next, embed the entity’s full context (not just the name).

Think:

Type Attributes Metadata Relevant content

Because identity lives in context.

4/ Deduplication

Many systems fail here.

Because:

Apple the company ≠ Apple the fruit Paris, France ≠ Paris, Texas Two people can share the same name

Resolution answers naming. Deduplication answers identity.

Those are two completely different jobs.

5/ Routing

Finally, the system decides:

Merge Human review Create new node

And the best systems follow a simple rule:

Evidence strength = permission strength.

Weak evidence → new node Strong evidence → merge Uncertain evidence → human review

Because false merges are expensive.

A duplicate node is annoying.

But a corrupted graph can silently poison retrieval quality for months.

My biggest takeaway?

Knowledge graph quality isn’t determined by your retrieval strategy…

It’s determined by the pipeline that creates the graph in the first place.

Get these five steps right… and retrieval becomes much easier.

P.S. I break down the full pipeline, entity resolution, deduplication thresholds, review queues, and production architecture in Decoding AI Magazine

Check it out here: https://decodingai.com/p/keep-knowledge-graph-clean…


How to Keep a Knowledge Graph Clean for AI Agents

Source: https://www.decodingai.com/p/keep-knowledge-graph-clean Two months ago, I started building unified memory layers on top of knowledge graphs. One question kept coming back from readers. How do you handle entity resolution and deduplication without corrupting the graph?

Rather than guessing, I spent serious time studying how mem0, cognee, and Neo4j actually solve it. The recurring question exposes a confusion almost everyone shares. People treat entity resolution and deduplication as the same step.

That confusion is exactly what corrupts graphs. People collapse naming and identity into 1 fuzzy check.

Also, if the merging step is not properly designed, 2 different real-world entities can silently merge, corrupting your graph.

Resulting in losing the trust in your graph that made it worth building. The graph quietly rots. Nobody trusts it, and the entire memory layer you invested in goes unused.

The failure is invisible until it becomes expensive to undo. The fix is to separate naming from identity.

We will walk through the end-to-end pipeline. This includes LLM extraction, entity resolution for naming, embedding the full node and deduplication for identity. Plus, the 2 safety nets most tutorials skip.

We covered the fullmemory-system designand theontology designin prior articles. This piece focuses only on keeping the graph clean. By the end, you will be able to design a graph that stays clean and usable as it grows.

Why We Killed RAG in Production (Product)

placeholder This article shows how to keep a graph memory layer clean. In a recent podcast, I covered the decision that comes before it: whether you need retrieval at all.

I explain why we killed RAG for a financial advisor product. All of an advisor’s data summed to 64,000 tokens, so loading the full context beat RAG’s zigzag retrieval loop. The formula I use: your data-to-context-window ratio.

We also get into regretting MCP everywhere, treating vibe-coded output as a compilation step, and why AI evals become the real job once the model writes the code.

Watch the episode

In goes a document or a conversation turn. Out comes a set of canonical, deduplicated nodes correctly wired into the existing graph. Everything between is about making sure each new node is named and identified right.

First, an LLM extractor reads the text and emits entities and relationships connected by\(entity, relationship, entity\)triplets. It anchors within the POLE+O, Facts, and Preferences ontology. This ensures it only extracts the entity types you actually care about, as we explained in depthin this article.

For example, a sentence about a person working at a company becomes a\(Person\)\-\[:WORKS\_AT\]\-\>\(Organization\)triplet. The ontology told the extractor those are the types that matter.

If using only LLMs for extraction becomes too costly, you can use a cost-tiered cascade here, starting with fast statistical models like spaCy for common entities, moving to zero-shot models like GLiNER for domain-specific types, and falling back to an LLM for complex cases.

Before touching the graph, we must decide what this new entity should be called. The system normalizes its name against existing nodes of the same type. This is the finding-the-canonical-name step, and no merges happen yet.

From raw documents to a clean graph node: extraction, resolution, embedding, deduplication, then the merge/flag/add decision.From raw documents to a clean graph node: extraction, resolution, embedding, deduplication, then the merge/flag/add decision. Next, we compute an embedding over the entity’s full context. This includes its name, type, and attributes. We embed more than just its bare name. This is what later lets deduplication compare identity rather than spelling.

We compare the embedded node against existing nodes. This decides whether it is the same real-world entity as one already in the graph.

Based on the deduplication outcome, the system makes a final routing decision. It either merges into an existing node, flags the pair for human review, or adds a brand-new node.

A new mention of a company gets extracted as a typed entity. The resolution step normalizes it to a canonical name. Then, it gets embedded with its full context so we capture its semantic meaning. It is compared against existing same-type nodes to verify its identity. Finally, it gets added, merged, or flagged for review.

Resolution and deduplication are 2 distinct decisions doing 2 distinct jobs. Let’s zoom in on each one.

During resolution we find the canonical name for each entity. It answers*“what should we call this?”*.

It handles typos, acronyms, and surface-form similarity. These are the noisy ways humans and documents write the same thing. It uses exact, fuzzy, and semantic matching in a short-circuit chain.

The short-circuit chain passes the entity to the next matcher only if no confident match is found. If exact match fails, it tries fuzzy match. If fuzzy match fails, it tries semantic match (using light embeddings only on the name).

But it matches only against the names of existing nodes of the same type. You never compare aPERSONname against anORGANIZATIONname.

“NYC” resolves to “New York City”. “JP Morgan” resolves to “JPMorgan Chase”. The 3 forms"John Smith ","john smith", and"Jon Smith"all collapse to 1 canonical “John Smith”.

This happens because resolution absorbs whitespace, casing, and typo variations. Fuzzy string matching uses token-based comparison to handle word order and partial matching for abbreviations. At this stage the system only updates the node’scanonical\_nameproperty. No graph merges happen yet.

Resolution chains exact → fuzzy → semantic matching against same-type names to assign a canonical name — without ever merging nodes.Resolution chains exact → fuzzy → semantic matching against same-type names to assign a canonical name (without ever merging nodes). Often, you also keep track of a list of aliases for each node. Whenever you find a new hit via fuzzy or semantic match that doesn’t match the currentcanonical\_name, you add it to the list of aliases. Like this, in future checks you can speed up matching by checking the alias list first.

Similar names are not strong enough evidence that 2 entities are identical. This is the line most people blur. Blurring it is what causes silent corruption.

Apple the company is not Apple the fruit. They have different types, so type-gating already separates them. A harder example is Jensen Huang the CEO of NVIDIA versus a doctor in Taipei with the same name.

They have the same name and the same type. Yet they are 2 different real-world people. Naming similarity alone would happily fuse them.

Still, canonical names are extremely useful for GROUP BY operations where, during querying and visualizations, we can quickly understand the data. During human review, we can even spot duplicates and resolve them manually.

That is why identity is a separate decision. Resolution has told us what to call the node. It has deliberately not told us whether the node is a duplicate.

That second, riskier question belongs to deduplication.

Deduplication is the identity layer. It answers the harder question:“is this the same real-world entity?”. It is the step where merges actually happen[5].

In goes 1 embedded node. Out comes a single routing decision: merge into an existing node, flag it for review, or create a new node.

The system embeds the full entity context. It compares it against existing nodes using semantic and fuzzy similarity across that full context. The richer signal is what lets it distinguish 2 same-named, same-type entities that resolution could not.

By the context of a node, we refer to the entity’s attributes such as its text, image, video content or even its metadata properties such as a person’s email or date of birth. Or an object’s model or manufacturer. Still, you don’t want to embed everything, such as identifier, but per each ontology type pick the fields that contain the highest signal.

The combined deduplication score is an explicit weighted blend. It uses the embedding score multiplied by 0.7 and the fuzzy score multiplied by 0.3. Based on a similarity score from 0 to 1, we have 3 bands.

High confidence (≥0.95) triggers an auto-merge. Medium confidence (0.85–0.95) flags the pair for human review. Low confidence (<0.85) creates a new node.

Near-certain identity is allowed to merge automatically. The uncertain middle is escalated. Weak evidence just becomes a fresh node.

False merges silently corrupt the graph. The corruption is invisible until it is expensive. Take the Paris example: 2LOCATIONnodes both named “Paris”.

One is the capital of France, and the other is Paris, Texas. They have the same name, the same type, and very similar bare-name embeddings. But they are 2 different places.

The dangerous part is the middle band, the gray area. This is where the system is not sure and a human has to step in.

Deduplication scores full-context similarity, then routes to auto-merge, human review, or a new node — stronger evidence earns more irreversible action.Deduplication scores full-context similarity, then routes to auto-merge, human review, or a new node. When a deduplication score lands in the medium band (0.85–0.95), the system deliberately does not merge. It flags the pair for a human to decide, as merging is a dangerous operation we should be really deliberate about.

The source node gets tombstoned, meaning it is kept queryable for forensics but skipped from future matching. Actually undoing a merge means re-ingesting the source data. That reversibility cost is the whole reason for the gray zone.

Whenever a new entity is flagged for human review, a new node is created and a\(:Entity\)\-\[:SAME\_AS \{status:'pending', confidence\}\]\-\>\(:Entity\)edge is added inside the graph itself. The human review step transitions thatstatustoconfirmedorrejected. The review queue is just a Cypher query over pendingSAME\_ASedges, ordered by confidence.

For each flagged pair, the reviewer answers 1 question. Is this actually a duplicate, a new node, or neither?

This usually happens to entities that are related but not identical. The Codex model and the Codex CLI are related, but not the same object. The same applies to Jensen Huang the CEO versus a same-named doctor in Taipei.

This is hardest at the start of an entity’s lifecycle. When metadata is scarce, similarity spikes, and you risk polluting 1 node with another’s attributes.

Human review catches the uncertain pairs the live pipeline surfaces. But some duplicates never get surfaced at all. That is the gap the dream pipeline closes.

While the system ingests documents, data often flows through in parallel. If 2 entities are processed at the same time, the resolution and deduplication steps never get to compare them against each other.

The system would never check whether Claude Code from Conversation X and Claude Code from Document Y are the same entity, because neither existed in the graph when the other was written.

You run a dream pass every night. It re-runs the deduplication pass on recently ingested nodes only. Otherwise, you will have to loop through all nodes in the graph. Which as the graph grows, becomes increasingly expensive.

It does not run the full resolution chain. Because the embeddings were already computed at ingest time, this is a light operation. It is primarily database reads and writes, not fresh model calls. Since it mostly adds I/O pressure, run it when organic traffic is low, which is usually during the night, hence the namethe dream pipeline.

I’ve spent the past 4 months building unified memory layers on top of knowledge graphs, and I learned that keeping them clean is the hardest part. Keeping your knowledge graph clean is the maintenance step that decides whether the graph ever gets used. A graph full of noise, fragments, and false merges will not be trusted or queried.

In case you want to learn more, remember that we also covered the fullmemory-system architecture via knowledge graphsand theontology designin prior articles.

But here is what I’m wondering:

What are the core strategies you’ve used to keep your knowledge graph clean and usable? Something close to our approach here, or something completely different?

Click the button below and tell me. I read every response.

Leave a comment

Enjoyed the article? The most sincere compliment is to restack this for your readers.

Share

Whenever you’re ready, here is how I can help you

If you want to go from zero to shipping production-grade AI agents, check out my**Agentic AI Engineering course**, built with Towards AI.

35 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.

Built for software, data engineers or scientists transitioning into AI engineering.

Rated 5/5 by 300+ students. The first 7 lessons are free:

Start here

Not ready to commit?Start with ourfree Agentic AI Engineering Guide**, a 6-day email course on the mistakes that silently break AI agents in production.

  1. Iusztin, P. (n.d.). Understanding the Neo4j Graph Agent Memory System. Decoding AI Magazine.https://www.decodingai.com/p/understanding-neo4j-graph-agent-memory-system
  2. Iusztin, P. (n.d.). Ship a Knowledge Graph Ontology in 5 Minutes. Decoding AI Magazine.https://www.decodingai.com/p/ship-a-knowledge-graph-ontology-in-5-minutes
  3. POLE+O Data Model. (n.d.). Neo4j Labs.https://neo4j.com/labs/agent-memory/explanation/poleo-model/
  4. How Entity Extraction Works. (n.d.). Neo4j Labs.https://neo4j.com/labs/agent-memory/explanation/extraction-pipeline/
  5. Entity Resolution and Deduplication. (n.d.). Neo4j Labs.https://neo4j.com/labs/agent-memory/explanation/resolution-deduplication/

If not otherwise stated, all images are created by the author.

Similar Articles

@pauliusztin_: 2 months ago, I started building unified memory layers with knowledge graphs. Here’s the most common question I’ve been…

X AI KOLs Timeline

This thread discusses best practices for building unified memory layers with knowledge graphs, emphasizing the separation of entity resolution (naming) from deduplication (identity) to avoid graph corruption. It also highlights using orchestration tools like PrefectIO to manage expensive LLM extraction pipelines with checkpointing and caching.