DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation

arXiv cs.CL Papers

Summary

DALM proposes a domain-algebraic language model that generates text under exact structural constraints derived from a domain lattice, addressing hallucination by organizing knowledge into separate domain fibers with algebraic guarantees. The model uses three-phase structured denoising (domain → relation → concept) with domain-annotated training data to prevent cross-domain contamination.

arXiv:2604.15593v1 Announce Type: new Abstract: Large language models compress heterogeneous knowledge into a single parameter space, allowing facts from different domains to interfere during generation. We propose DALM, a Domain-Algebraic Language Model that replaces unconstrained token generation with structured denoising over a domain lattice. DALM follows a three-phase generation path: it first resolves domain uncertainty, then relation uncertainty, and finally concept uncertainty, so each stage operates under explicit algebraic constraints. The framework requires only three ingredients: a lattice of domains with computable meet, join, and implication; a typing function over relations that controls inheritance across domains; and a fiber partition that localizes knowledge to domain-specific subsets. Given these ingredients, DALM yields a three-phase encoder-decoder architecture in which generation is confined to a domain fiber, cross-domain contamination is structurally prevented in closed-vocabulary mode and auditably bounded in open-vocabulary mode, and a single query can produce a domain-indexed multi-perspective answer space. We instantiate the framework with the CDC knowledge representation system and outline training and evaluation on validated domain-annotated crystal libraries. DALM reframes language generation as algebraically constrained structured denoising rather than unconstrained decoding over a flat token space.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:28 AM

# DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation
Source: https://arxiv.org/html/2604.15593
###### Abstract

Large language models compress human knowledge into unstructured weight vectors where domain boundaries do not exist — a fact about quantum mechanics and a fact about cooking share the same parameter space and can contaminate each other during generation. We propose a domain-algebraic language model (DALM) that generates under exact structural constraints derived from a domain lattice.

DALM shares a core intuition with diffusion language models (dLLMs): generation as progressive denoising from high entropy to low entropy. The difference is structural. Existing dLLMs (LLaDA, Dream, Zhou et al. 2026) denoise by randomly unmasking tokens — there is no semantic ordering, no domain constraint, no algebraic guarantee on the denoising path. DALM denoises along a domain lattice: domain uncertainty is resolved first, then relation uncertainty, then concept uncertainty. Each denoising step is algebraically constrained.

The framework requires three abstract ingredients: (1) a lattice (L, ⊑) of domains with computable meet, join, and implication; (2) a typing function τ that classifies relations as monotone or non-monotone, controlling inheritance across domain boundaries; and (3) a fiber function F that partitions the knowledge base into domain-local subsets. Given any system satisfying these requirements, DALM provides a three-phase encoder-decoder architecture where every generation step is confined to a domain fiber, cross-domain contamination is structurally prohibited in closed-vocabulary mode and auditably bounded in open-vocabulary mode, and a single query produces a domain-indexed multi-perspective answer space.

The architecture is trained on domain-annotated, consistency-verified structured knowledge bases where each training example carries domain annotation, relation type, and validation guarantees — structurally richer signal than raw text. We demonstrate the framework using the CDC (Domain-Contextualized Concept Graph) knowledge representation as a concrete instantiation, and specify an evaluation protocol on medical domain crystal libraries.

A note on reading this paper. The core contribution is *controlled structured denoising along an algebraic lattice*, not a mathematical derivation of Markov diffusion chains. DALM is not a likelihood-based generative model operating on a flat token space; it is a constrained denoising system whose computation space is shaped by domain algebra. Reading it through the lens of representation alignment or maximum-likelihood factorization will produce apparent gaps that do not exist in the framework as formulated.

Keywords. Domain-Algebraic Language Model, DALM, Structured Denoising, Domain Lattice, Algebraic Constraints, Diffusion Language Models, Hallucination Prevention

## 1. Introduction

### 1.1. The Compression Problem

All large language models perform the same fundamental operation: compress a corpus of human knowledge into a parameter vector θ ∈ R^d, then generate text by sampling from p(x_{t+1} | x_{1:t}; θ).

This compression is lossy in a specific and consequential way. The parameter vector θ does not preserve the domain structure of the original knowledge. A fact about quantum mechanics and a fact about cooking share the same parameter space, influence each other during training, and can contaminate each other during generation. There is no structural mechanism in θ that separates "atom is a quantum field excitation" from "atom is an indivisible particle" — both are smeared across the same weight matrices, distinguishable only by the statistical patterns of their surrounding tokens.

This is the structural origin of hallucination. When an LLM generates text about quantum mechanics, it samples from a distribution conditioned on the entire parameter space, including regions shaped by classical physics, chemistry, cooking, and everything else in the training corpus. Cross-domain contamination is not a bug in the sampling algorithm — it is a structural property of unstructured compression.

### 1.2. Structured Compression as an Alternative

We propose a different compression target. Instead of compressing knowledge into an unstructured vector θ, compress it into a domain-indexed structured representation:

Unstructured (LLM): Corpus → θ ∈ R^d (one vector, all domains mixed)

Structured (DALM): Corpus → {h_c(d), h_r(d), h_d} for all (c, r, d) (domain-indexed embeddings)

The DALM’s representation preserves domain structure: concept embeddings are indexed by domain, relation embeddings are typed by τ, and domain embeddings form a lattice with algebraic operations (meet, join, implication). Generation from this representation is domain-constrained by construction — not by post-hoc filtering but by the geometry of the representation space itself.

### 1.3. Structured Denoising: What Diffusion Language Models Should Have Been

Diffusion language models (Zhou et al., 2026; Nie et al., 2025; Ye et al., 2025) made an important architectural discovery: generation can be formulated as progressive denoising rather than left-to-right autoregressive prediction. A fully masked sequence is iteratively unmasked, with the model predicting which tokens to reveal at each step.

The limitation is that the denoising path has no semantic structure. Token i might be unmasked before token j for purely statistical reasons — there is no principle dictating that domain-level information should be resolved before concept-level information, or that relation types should constrain what concepts can appear. A medical token and a sports token can be unmasked in the same step, in the same attention context, with no structural isolation.

Independent physical evidence that denoising needs structure. Sclocchi, Favero & Wyart (2025a) proved in PNAS that diffusion models operating on hierarchically structured data exhibit a phase transition at a critical noise threshold ε*: below ε*, the denoising process preserves high-level features (e.g., image class); above ε*, high-level features collapse to random while low-level features persist and recombine. Their follow-up work (Sclocchi et al., 2025b, ICLR 2025) further demonstrated that forward-backward diffusion experiments can probe the latent hierarchical structure of data, with correlation lengths diverging at the phase transition.

This is direct empirical evidence — from statistical physics, not from knowledge representation — that hierarchical structure matters for denoising. In the physics idiom of Wyart’s group: above ε*, the hierarchical correlations decouple and the system becomes effectively one-scale — equivalent, in our formulation, to collapsing the domain lattice to a single universal fiber. What Sclocchi et al. observe experimentally as a critical slowing-down and a divergence of correlation length is, structurally, the point at which the algebraic constraint that separates fibers can no longer be sustained by the denoising dynamics. Wyart’s group observed this with physics tools on image data; we formalize it with domain algebra and prevent it architecturally by pinning the denoising schedule to the lattice rather than letting it drift freely.

DALM is the architectural response to this physical observation. Instead of allowing the hierarchy to collapse at ε* (which happens in unstructured diffusion), DALM enforces hierarchical structure at every denoising step through the domain lattice. The three-phase denoising path is not arbitrary — it follows the lattice:

Step 1 (Domain denoising): Resolve which domain(s) the input/query belongs to. This eliminates the largest source of uncertainty: which world are we in?

Step 2 (Relation denoising): Within each activated domain, resolve which relations are active, constrained by τ-typing. This eliminates relational uncertainty: what rules apply in this world?

Step 3 (Concept denoising): Within each domain-relation pair, generate specific concepts from the fiber-local vocabulary. This eliminates conceptual uncertainty: what specific entities populate this world under these rules?

Each step’s output space is constrained by the previous step’s result. This is algebraically guaranteed denoising — not a learned schedule that might drift during training, but exact structural constraints derived from the domain lattice.

The existing dLLM infrastructure — masking mechanisms, KV cache partitioning, parallel decoding — is fully compatible with DALM. The modification is the denoising schedule: replace random unmasking with lattice-structured unmasking. This is architecturally a mask replacement, not a framework replacement.

### 1.4. Contributions

1. A general framework for structured denoising over algebraic lattices, applicable to any knowledge system that provides a lattice, a typing function, and a fiber partition (Section 2).
2. Three-phase encoder-decoder architecture where each phase corresponds to a level of the lattice structure (concept, relation, domain) and is constrained by domain algebra (Sections 3–4).
3. Formal analysis of hallucination as cross-domain leakage, with structural prevention in closed vocabulary and auditable bounds in open vocabulary (Section 5).
4. Multi-perspective generation: a single query produces a domain-indexed answer space (Section 5).
5. Graceful degradation: partial success yields useful components — automatic crystallizer, domain-structured embeddings (Section 5).
6. Concrete instantiation using the CDC framework (Section 7), with evaluation protocol on medical domain knowledge bases (Section 8).

## 2. Structured Denoising over Algebraic Lattices

This section defines the abstract algebraic requirements for DALM. Any knowledge representation system satisfying these requirements can serve as DALM’s structural substrate. The definitions are self-contained; a concrete instantiation is given in Section 7.

### 2.1. The Three Ingredients

Ingredient 1: A domain lattice (L, ⊑). A partially ordered set of domains with a specialization order ⊑, equipped with computable meet (⊓), join (⊔), and implication (→) operations. The lattice has a top element ⊤ (the universal domain) and satisfies the Heyting algebra axioms: it is distributive, bounded, and supports a pseudo-complement. Concretely: @Physics@Quantum ⊑ @Physics ⊑ ⊤, and the meet of two domains is their most specific common generalization. The implication operation d1 → d2 is used in the τ-typing mechanism: it determines whether knowledge transfer from d1 to d2 is structurally licensed, and it underlies the inheritance decisions that the decoder’s Phase 2 must respect. The DALM architecture uses meet and join explicitly (in domain selection and lattice loss) and implication implicitly (through the τ-typing mask that governs cross-domain inheritance).

Ingredient 2: A typing function τ: R → {monotone, non-monotone}. Each relation predicate r in the system’s vocabulary is classified as either monotone (truth in a parent domain implies truth in child domains) or non-monotone (truth does not propagate downward). This classification controls inheritance: if is_a is monotone, then "Atom is_a Particle" in @Physics inherits into @Physics@Quantum. If contrasts_with is non-monotone, then "Wave contrasts_with Particle" in @Physics does *not* inherit into @Physics@Quantum, because wave-particle duality dissolves this contrast at that level.

The typing function may be global (τ depends only on r) or domain-conditioned (τ depends on both r and d). The global case provides stronger algebraic guarantees; the domain-conditioned case is more expressive. Both are supported. When τ is domain-conditioned and a relation r has τ(r, d1) = monotone but τ(r, d2) = non-monotone for d1 ⊑ d2, the more restrictive classification governs: the relation is treated as non-monotone for the d1 → d2 inheritance path, because allowing monotone propagation into a domain that classifies it as non-monotone would violate the child domain’s constraints.

Ingredient 3: A fiber function F: L → 2^K. Each domain d ∈ L defines a fiber F(d): the complete set of knowledge units scoped to that domain. Facts in different fibers are semantically independent: is_a(Apple, Fruit, @Biology) and is_a(Apple, Company, @Business) coexist without contradiction because they reside in different fibers. A query scoped to domain d is evaluated only against F(d) — concepts in other fibers do not exist for that query.

### 2.2. Knowledge Units and Validation

The fundamental unit of structured knowledge is a tuple ⟨c, r@d, c′⟩ where c and c’ are concepts, r is a relation predicate, and @d is a domain specification. The @d field is not metadata — it is a structural part of the predicate arity. Any system that reads the tuple as a four-field unit automatically respects the domain scope.

A validated knowledge unit (which we call a *crystal*) is a tuple that has passed insertion-time validation against its fiber F(d): the new assertion does not create cycles in acyclic relations, does not reverse established causal chains, and does not contradict existing content within the same fiber. Crystals are guaranteed to be fiber-locally consistent.

### 2.3. The Structured Denoising Path

Given the three ingredients, the structured denoising path is defined as:

noise → Phase 1: domain → Phase 2: relation → Phase 3: concept → crystal

Each phase eliminates one dimension of uncertainty, and each phase’s output space is constrained by the previous phase’s result:

- Phase 1 (Domain): Select which domain(s) are relevant. Output: a probability distribution over L.
- Phase 2 (Relation): Within each activated domain d, select which relations are active, subject to the τ-typing constraint. Output: a set of typed relation predictions per domain.
- Phase 3 (Concept): Within each domain-relation pair, generate specific concepts from the fiber-local vocabulary F(d). Output: complete tuples ⟨c, r@d, c′⟩.

The ordering is not arbitrary. Domain uncertainty is the highest-level uncertainty (which world are we in?); concept uncertainty is the lowest-level uncertainty (which specific entity?). Resolving them in this order ensures that each step operates in a progressively more constrained space. This mirrors the hierarchical phase transition structure observed by Sclocchi et al. (2025a): high-level features (domain) must be resolved before low-level features (concepts) for the denoising to be coherent.

Framing: denoising dynamics, not likelihood factorization. A useful analogy: DALM behaves like a constrained dynamical system whose denoising trajectory is shaped by domain algebra, rather than a likelihood-based generative model that must explicitly align all semantically equivalent outputs. Under the lattice + τ-typing + validation constraints, illegal trajectories are structurally excluded, and legal trajectories are channeled into domain-specific regions of the output space. Semantic equivalence between expressions is expected to manifest as convergence to shared or adjacent

Similar Articles

Continuous Latent Diffusion Language Model

Hugging Face Daily Papers

Cola DLM is a hierarchical latent diffusion language model that uses text-to-latent mapping and conditional decoding to achieve efficient, non-autoregressive text generation.

DALL·E: Creating images from text

OpenAI Blog

OpenAI introduces DALL·E, a 12-billion parameter transformer model that generates images from text descriptions by treating text and images as a single token stream. The model demonstrates diverse capabilities including creating anthropomorphized objects, combining disparate concepts, rendering text, and performing image inpainting tasks.

HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

arXiv cs.CL

Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.