Hybrid Multi-Phase Page Matching and Multi-Layer Diff Detection for Japanese Building Permit Document Review
Summary
A hybrid multi-phase page-matching pipeline plus multi-layer diff engine automates comparison of Japanese building-permit PDF sets, achieving F1=0.80 with zero false positives on real-world 200–1000 page submissions.
View Cached Full Text
Cached at: 04/23/26, 10:02 AM
# Hybrid Multi-Phase Page Matching and Multi-Layer Diff Detection for Japanese Building Permit Document Review
Source: [https://arxiv.org/html/2604.19770](https://arxiv.org/html/2604.19770)
###### Abstract
We present a hybrid multi\-phase page matching algorithm for automated comparison of Japanese building permit document sets\. Building permit review in Japan requires cross\-referencing large PDF document sets across revision cycles, a process that is labor\-intensive and error\-prone when performed manually\. The algorithm combines longest common subsequence \(LCS\) structural alignment, a seven\-phase consensus matching pipeline, and a dynamic programming optimal alignment stage to robustly pair pages across revisions even when page order, numbering, or content changes substantially\. A subsequent multi\-layer diff engine—comprising text\-level, table\-level, and pixel\-level visual differencing—produces highlighted difference reports\. Evaluation on real\-world permit document sets achieves F1 = 0\.80 and precision = 1\.00 on a manually annotated ground\-truth benchmark, with zero false\-positive matched pairs\.
## 1Introduction
Japanese building permit review \(kenchiku kakunin\) requires applicants to submit sets of architectural drawings and structural calculations, which must conform to the Building Standards Act\[[6](https://arxiv.org/html/2604.19770#bib.bib7)\]\. Identifying all changes between an original and a revised document set can be formulated as an automated detection problem, complicated by:
- •Page insertions and deletions across revisions
- •Renumbering of pages and drawing indices
- •Mixed content types \(text, tables, technical drawings\)
- •Large document sizes \(often 200–1000 pages per submission\)
Manual comparison is time\-consuming and prone to oversight\. Existing general\-purpose PDF diff tools\[[13](https://arxiv.org/html/2604.19770#bib.bib6)\]are insufficient because they assume stable page correspondence and do not handle the domain\-specific structure of architectural document sets\.
We address these limitations through a purpose\-built page matching and diff detection pipeline\. The core contributions are:
1. 1\.A seven\-phase page matching algorithm combining structural hashing, drawing number recognition, section title matching, and visual perceptual hashing\.
2. 2\.A dynamic programming alignment stage \(Needleman\-Wunsch style\[[7](https://arxiv.org/html/2604.19770#bib.bib1)\]\) that resolves ambiguous matches from the seven\-phase consensus\.
3. 3\.A multi\-layer diff engine producing text, table, and visual diffs in a unified annotated PDF report\.
The remainder of this paper is organized as follows\. Section[2](https://arxiv.org/html/2604.19770#S2)surveys related work\. Section[3](https://arxiv.org/html/2604.19770#S3)provides a system overview\. Section[4](https://arxiv.org/html/2604.19770#S4)details the algorithms\. Section[5](https://arxiv.org/html/2604.19770#S5)presents evaluation results\. Section[6](https://arxiv.org/html/2604.19770#S6)concludes\.
## 2Related Work
### 2\.1PDF Text Extraction
Extracting structured text from PDF documents is a prerequisite for any automated document analysis pipeline\. Several libraries have been developed for this purpose\. PDFMiner\[[10](https://arxiv.org/html/2604.19770#bib.bib13)\]is a pure\-Python tool that reconstructs character positions and page layout from the low\-level PDF content stream, enabling extraction of bounding boxes alongside text\. pdfplumber\[[11](https://arxiv.org/html/2604.19770#bib.bib12)\], built on top of PDFMiner, provides a higher\-level API that exposes per\-character position data, rectangle geometry, and table detection heuristics\. PyMuPDF\[[1](https://arxiv.org/html/2604.19770#bib.bib4)\]wraps the MuPDF rendering engine and offers both text extraction and high\-fidelity rasterization in a single library, making it well suited for workflows that require both text analysis and visual rendering\. Apache PDFBox\[[14](https://arxiv.org/html/2604.19770#bib.bib14)\]provides similar capabilities in the Java ecosystem and is widely used in enterprise document processing pipelines\.
While these tools handle general PDF text extraction competently, none of them addresses the domain\-specific page structure found in Japanese building permit document sets\. Building permit submissions \(kenchiku kakunin shinsei tosho\) are organized across multiple independently paginated volumes—architectural drawings, structural calculations, energy performance reports, and equipment plans—each with its own internal header structure and drawing\-number taxonomy\. Extraction strategies optimized for general documents produce unreliable results when applied to this domain without additional structural awareness\. The proposed method incorporates drawing\-number normalization into the extraction layer to improve cross\-revision matching accuracy\.
### 2\.2Document Layout Analysis
Beyond raw text extraction, understanding the spatial layout of document pages enables richer structural analysis\. LayoutParser\[[9](https://arxiv.org/html/2604.19770#bib.bib9)\]provides a unified framework for deep\-learning\-based document image analysis, including pre\-trained models for layout detection, text region segmentation, and OCR integration\. It has been applied successfully to historical documents, scientific papers, and mixed\-media archives\. Tesseract\[[12](https://arxiv.org/html/2604.19770#bib.bib10)\], one of the most widely used open\-source OCR engines, supports Japanese script and can be combined with layout analysis to process scanned documents\.
These tools are designed for the challenges of general document understanding and focus on recognizing semantic regions such as titles, body text, figures, and tables within a single page\. They do not address the revision\-tracking problem: determining which page in a revised document corresponds to which page in the original, especially when pages have been inserted, deleted, or reordered between submission cycles\. Our work treats page layout signals as one input to a multi\-phase alignment pipeline rather than as an end in itself\.
### 2\.3Document Sequence Alignment
Sequence alignment is a foundational problem in bioinformatics that has found broad application in text and document comparison\. The Needleman\-Wunsch algorithm\[[7](https://arxiv.org/html/2604.19770#bib.bib1)\]computes an optimal global alignment between two sequences under a configurable scoring scheme, allowing mismatches, insertions, and deletions\. The Longest Common Subsequence \(LCS\) problem, discussed extensively by Cormen et al\.\[[3](https://arxiv.org/html/2604.19770#bib.bib8)\], provides a related formulation that identifies the maximum set of elements common to two sequences while preserving their relative order\. Python’s standarddiffliblibrary\[[8](https://arxiv.org/html/2604.19770#bib.bib3)\]implements a variant of the LCS algorithm and is commonly used for line\-level text diffing\.
We adapt the global alignment paradigm of Needleman\-Wunsch to the problem of page\-level matching in building permit documents\. Our key contribution is a domain\-specific scoring function that combines textual similarity, drawing\-number agreement, and visual hash distance into a single alignment score\. This scoring reflects the structural conventions of Japanese building permit submissions—for example, the fact that a drawing\-number match is a strong indicator of page correspondence even when the textual content has been substantially revised\. A pure LCS formulation, which requires identical elements, cannot express such graduated similarity; our approach fills this gap\.
### 2\.4Perceptual Image Hashing
Perceptual hashing algorithms map images to compact binary fingerprints such that visually similar images produce fingerprints with low Hamming distance, even under minor transformations such as resizing, compression, or small edits\. Zauner\[[15](https://arxiv.org/html/2604.19770#bib.bib2)\]provides a comprehensive survey and benchmark of perceptual hash functions, including the DCT\-based pHash algorithm that underpins our approach\.
We employ pHash as a visual similarity signal in Phase 7\.5 of our matching pipeline\. This phase is activated when earlier text\-based phases fail to find a confident match—typically for pages that consist primarily of graphical content such as floor plans or structural diagrams, where extractable text is sparse or absent\. By rendering each page to a raster image at 18 DPI and computing its pHash fingerprint, we obtain a lightweight visual similarity score that can disambiguate graphically similar pages\. Phase 7\.5 operates on a candidate set already narrowed by the preceding text phases, so the quadratic pairwise comparison cost is bounded in practice\.
### 2\.5Document Change Detection
The problem of detecting changes between two versions of a document has been studied across several domains\. In software engineering, ChangeDistiller\[[4](https://arxiv.org/html/2604.19770#bib.bib11)\]performs fine\-grained change extraction by computing a tree edit distance between abstract syntax trees of successive source code revisions\. This approach captures structural changes that line\-level diff tools miss, such as method moves and block reorderings\. In the web domain, change detection systems monitor HTML content for updates, using both textual and visual signals to identify modified regions\. Academic preprint servers and version control systems for office documents apply similar ideas at the document level\.
A critical assumption shared by most existing approaches is that the document has a fixed, stable structure between versions\. ChangeDistiller assumes syntactically valid source code\. Web page detectors assume a stable DOM structure\. General\-purpose PDF diff tools such as DiffPDF\[[13](https://arxiv.org/html/2604.19770#bib.bib6)\]assume that pageiiin the old document corresponds to pageiiin the new document\. This assumption is violated routinely in Japanese building permit document sets: plan revisions frequently insert new drawing sheets, remove obsolete pages, or reorder sections in response to reviewer feedback\. To our knowledge, the present work is the first to address the specific challenges of automated change detection in Japanese building permit document revision tracking, where page\-level reordering is the norm rather than the exception\.
### 2\.6Legal and Technical Document Processing
Automated processing of legal and technical documents has attracted growing interest in the NLP community\. ContractNLI\[[5](https://arxiv.org/html/2604.19770#bib.bib15)\]introduced a document\-level natural language inference dataset for contracts, demonstrating that regulatory compliance checking can be framed as a textual entailment problem\. Subsequent work has applied large language models to clause extraction, obligation detection, and cross\-document consistency checking in legal corpora\.
Japanese building permit documents present a distinct set of challenges relative to general legal texts\. First, the documents are multi\-volume submissions in which regulatory cross\-references span volumes—for example, a structural calculation volume may invoke load values defined in the architectural drawing volume\. Second, compliance is determined not against free\-form contract language but against a precisely structured body of statute, government ordinance, and ministerial notification\[[6](https://arxiv.org/html/2604.19770#bib.bib7)\], each of which carries its own article and paragraph numbering\. Third, the documents are produced using CAD and drawing management software, and the resulting PDFs often contain text that was rendered as vector paths rather than embedded as searchable characters, making OCR a practical necessity for scanned submissions\. This paper focuses on page matching and diff detection for such document sets\.
### 2\.7Comparison with Existing Approaches
Table[1](https://arxiv.org/html/2604.19770#S2.T1)summarizes the capabilities of representative document comparison approaches relative to our system\. Existing tools address subsets of the problem: DiffPDF provides visual and text diff but requires fixed page order; ChangeDistiller handles structural reordering but targets source code, not PDF documents; LayoutParser provides powerful layout analysis but does not perform cross\-version page alignment\. Our system is the first to integrate page reordering detection, visual difference highlighting, domain\-specific scoring, and regulatory link generation into a unified pipeline for Japanese building permit documents\.
Table 1:Comparison of document comparison approaches\.✓\\checkmark= supported, – = not supported\.
## 3System Overview
The proposed method operates in two stages:*page correspondence estimation*\(Sections[4\.2](https://arxiv.org/html/2604.19770#S4.SS2)–[4\.4](https://arxiv.org/html/2604.19770#S4.SS4)\) and*diff computation*\(Section[4\.6](https://arxiv.org/html/2604.19770#S4.SS6)\)\. Given an old PDFDoD\_\{o\}and a new PDFDnD\_\{n\}, the output is a set of matched page pairsMM, a set of inserted pagesUNU\_\{N\}, a set of deleted pagesUOU\_\{O\}, and an annotated diff report\.
Old PDFNew PDFFingerprintExtraction7\-PhaseMatchingDPAlignmentMulti\-layerDiffReportPDF
Figure 1:Processing pipeline for PDF revision comparison\. Both old and new PDFs feed into fingerprint extraction, followed by multi\-phase matching, DP alignment, multi\-layer diff, and report generation\.### 3\.1Processing Pipeline
Figure[1](https://arxiv.org/html/2604.19770#S3.F1)illustrates the processing pipeline\. Processing proceeds as follows:
1. 1\.Fingerprint extraction: Each page is fingerprinted with a content hash, drawing number, section title, and pHash value\.
2. 2\.Multi\-phase matching: The seven\-phase pipeline produces candidate page correspondences\.
3. 3\.DP alignment: A dynamic programming stage resolves conflicts and optimizes global alignment\.
4. 4\.Multi\-layer diff: Text, table, and visual diffs are computed for each matched page pair\.
5. 5\.Report generation: A PDF report with highlighted differences and jump links is produced\.
## 4Algorithm
### 4\.1Page Fingerprinting
Each pageppis represented by a fingerprint tuple:
F\(p\)=\(hcontent,ndrawing,tsection,ϕphash\)F\(p\)=\(h\_\{\\text\{content\}\},\\;n\_\{\\text\{drawing\}\},\\;t\_\{\\text\{section\}\},\\;\\phi\_\{\\text\{phash\}\}\)
Content hashhcontenth\_\{\\text\{content\}\}: MD5 hash of normalized page text \(whitespace collapsed and lowercased\)\. Returns the empty string for pages with fewer than 50 characters of extracted text \(blank pages, pure graphics, or scanned images with no recognized text layer\)\.
Drawing numberndrawingn\_\{\\text\{drawing\}\}: Extracted via regular expressions matching Japanese architectural drawing number conventions \(e\.g\.,A\-01,S\-03,KO\-1\)\.
Section titletsectiont\_\{\\text\{section\}\}: The first substantive text line on the page, used for structural calculation documents\.
Perceptual hashϕphash\\phi\_\{\\text\{phash\}\}: Computed only fortext\-sparsepages \(fewer than 200 extracted characters\)\. Pages are rendered at 18 DPI to a 32×\\times32 pixel grayscale image, and a 63\-bit DCT\-based pHash\[[15](https://arxiv.org/html/2604.19770#bib.bib2)\]\(DC component excluded\) is derived\. Similarity between two hashes is:
simphash\(p,q\)=1−popcount\(ϕ\(p\)⊕ϕ\(q\)\)63\\text\{sim\}\_\{\\text\{phash\}\}\(p,q\)=1\-\\frac\{\\text\{popcount\}\(\\phi\(p\)\\oplus\\phi\(q\)\)\}\{63\}
### 4\.2LCS Structural Alignment
An initial alignment is computed using Python’sdifflib\.SequenceMatcher\[[8](https://arxiv.org/html/2604.19770#bib.bib3)\]on the sequence of content hashes\(hcontent\(o1\),…,hcontent\(om\)\)\(h\_\{\\text\{content\}\}\(o\_\{1\}\),\\ldots,h\_\{\\text\{content\}\}\(o\_\{m\}\)\)and\(hcontent\(n1\),…,hcontent\(nn\)\)\(h\_\{\\text\{content\}\}\(n\_\{1\}\),\\ldots,h\_\{\\text\{content\}\}\(n\_\{n\}\)\)\.SequenceMatcherimplements Ratcliff/Obershelp matching and identifiesequal,insert,delete, andreplaceblocks\. Pages inequalblocks are accepted as matched immediately\. Pages inreplaceblocks are forwarded to the seven\-phase pipeline\.
The text similarity score used in Phase 5 is defined as:
simtext\(o,n\)=2MT\\text\{sim\}\_\{\\text\{text\}\}\(o,n\)=\\frac\{2M\}\{T\}whereMMis the total number of matching characters in the longest common block decomposition andTTis the total number of characters in both sequences, as computed bySequenceMatcher\.ratio\(\)\.
### 4\.3Seven\-Phase Matching Pipeline
Within each replace block, seven matching signals are evaluated in order\. Candidate pairs accumulatevotes; a pair is accepted if at least one high\-confidence signal matches\.
Algorithm 1Seven\-Phase Page Matching1:Old page set
ObO\_\{b\}, new page set
NbN\_\{b\}within a replace block
2:Match set
MM, unmatched old pages
UOU\_\{O\}, unmatched new pages
UNU\_\{N\}
3:Phase 1 — Exact content hash
4:Accept
\(o,n\)\(o,n\)if
h\(o\)=h\(n\)≠εh\(o\)=h\(n\)\\neq\\varepsilon; confidence
=1\.0=1\.0
5:Phase 2 — Drawing number exact match
6:Accept
\(o,n\)\(o,n\)if
ndraw\(o\)=ndraw\(n\)≠εn\_\{\\text\{draw\}\}\(o\)=n\_\{\\text\{draw\}\}\(n\)\\neq\\varepsilon; confidence
=0\.9=0\.9
7:Phase 3 — Section title match
8:Accept
\(o,n\)\(o,n\)if
tsec\(o\)=tsec\(n\)≠εt\_\{\\text\{sec\}\}\(o\)=t\_\{\\text\{sec\}\}\(n\)\\neq\\varepsilon; confidence
=0\.8=0\.8
9:Phase 4 — Adaptive page\-shift detection
10:For each
δ∈\[−⌊m/2⌋,⌊n/2⌋\]\\delta\\in\[\-\\lfloor m/2\\rfloor,\\lfloor n/2\\rfloor\], count pairs
\(oi,ni\+δ\)\(o\_\{i\},n\_\{i\+\\delta\}\)
11:with
simtext\(oi,ni\+δ\)≥τs\\text\{sim\}\_\{\\text\{text\}\}\(o\_\{i\},n\_\{i\+\\delta\}\)\\geq\\tau\_\{s\}; adopt
δ\\deltaif the match
12:count exceeds a threshold fraction of
min\(\|Ob\|,\|Nb\|\)\\min\(\|O\_\{b\}\|,\|N\_\{b\}\|\); conf\.
=0\.85=0\.85
13:Phase 5 — Text similarity
14:Accept
\(o,n\)\(o,n\)if
simtext\(o,n\)≥τs=0\.5\\text\{sim\}\_\{\\text\{text\}\}\(o,n\)\\geq\\tau\_\{s\}=0\.5; confidence
≤0\.85\\leq 0\.85
15:Phase 6 — Position\-based interpolation
16:For unmatched pages within distance
d≤3d\\leq 3:
17:
simadj=sim×\(1−0\.1⋅d\)\\text\{sim\}\_\{\\text\{adj\}\}=\\text\{sim\}\\times\(1\-0\.1\\cdot d\); accept if
≥0\.3\\geq 0\.3
18:Phase 7 — Classify residuals
19:Remaining unmatched old pages
→UO\\to U\_\{O\}\(deleted\)
20:Remaining unmatched new pages
→UN\\to U\_\{N\}\(inserted\)
21:Phase 7\.5 — Visual rematch \(pHash\)
22:For
\(o,n\)∈UO×UN\(o,n\)\\in U\_\{O\}\\times U\_\{N\}:
23:If
simphash\(o,n\)≥0\.45\\text\{sim\}\_\{\\text\{phash\}\}\(o,n\)\\geq 0\.45: accept; reclassify asContentSimilarreturn
MM
The pHash threshold 0\.45 corresponds to approximately 29 of 63 bits matching \(⌊63×\(1−0\.45\)⌋=34\\lfloor 63\\times\(1\-0\.45\)\\rfloor=34differing bits\)\.
Pairs matched in Phases 1–3 \(confidence≥0\.8\\geq 0\.8\) are accepted directly and excluded from the DP stage\. Pairs from Phases 4–6 are forwarded to the DP alignment with their seven\-phase confidence score as an initial upper bound on the DP score\.
### 4\.4Dynamic Programming Optimal Alignment
The seven\-phase consensus may yield conflicting candidates within replace blocks\. A Needleman\-Wunsch style DP\[[7](https://arxiv.org/html/2604.19770#bib.bib1),[3](https://arxiv.org/html/2604.19770#bib.bib8)\]resolves conflicts by maximizing global alignment score\.
Pair score\.For old pageoio\_\{i\}and new pagenjn\_\{j\}, the base similaritysbs\_\{b\}fuses text similaritysts\_\{t\}and pHash visual similaritysvs\_\{v\}:
sb=\{0\.40st\+0\.60svsvavailablestotherwises\_\{b\}=\\begin\{cases\}0\.40\\,s\_\{t\}\+0\.60\\,s\_\{v\}&s\_\{v\}\\text\{ available\}\\\\ s\_\{t\}&\\text\{otherwise\}\\end\{cases\}
The pair score is:
score\(oi,nj\)\\displaystyle\\mathrm\{score\}\(o\_\{i\},n\_\{j\}\)=0\.55sb\+0\.20slen\+0\.15ppos\\displaystyle=0\.55\\,s\_\{b\}\+0\.20\\,s\_\{\\text\{len\}\}\+0\.15\\,p\_\{\\text\{pos\}\}\+0\.50⋅𝟙\[h\(oi\)=h\(nj\)\]\\displaystyle\\quad\+0\.50\\cdot\\mathbb\{1\}\[h\(o\_\{i\}\)\{=\}h\(n\_\{j\}\)\]\+0\.35⋅𝟙\[nd\(oi\)=nd\(nj\)\]\\displaystyle\\quad\+0\.35\\cdot\\mathbb\{1\}\[n\_\{d\}\(o\_\{i\}\)\{=\}n\_\{d\}\(n\_\{j\}\)\]\+0\.10⋅𝟙\[ndsubstr\. match\]\\displaystyle\\quad\+0\.10\\cdot\\mathbb\{1\}\[n\_\{d\}\\text\{ substr\.\\ match\}\]\+0\.20⋅𝟙\[ts\(oi\)=ts\(nj\)\]\\displaystyle\\quad\+0\.20\\cdot\\mathbb\{1\}\[t\_\{s\}\(o\_\{i\}\)\{=\}t\_\{s\}\(n\_\{j\}\)\]\(1\)whereslen\(oi,nj\)=min\(\|oi\|,\|nj\|\)/max\(\|oi\|,\|nj\|\)s\_\{\\text\{len\}\}\(o\_\{i\},n\_\{j\}\)=\\min\(\|o\_\{i\}\|,\|n\_\{j\}\|\)/\\max\(\|o\_\{i\}\|,\|n\_\{j\}\|\)is the text\-length ratio \(\|p\|\|p\|denotes character count\),ppos\(i,j\)=1−\|i/m−j/n\|p\_\{\\text\{pos\}\}\(i,j\)=1\-\|i/m\-j/n\|is the positional score,h\(⋅\)h\(\\cdot\)is the content hash,nd\(⋅\)n\_\{d\}\(\\cdot\)the drawing number, andts\(⋅\)t\_\{s\}\(\\cdot\)the section title\. The weights \(0\.55, 0\.20, 0\.15, etc\.\) are heuristically tuned; a formal sensitivity analysis is left for future work\.
DP recurrence\.Letg=−0\.42g=\-0\.42be the gap penalty \(set empirically\)\.
D\[i\]\[j\]=max\{D\[i−1\]\[j−1\]\+score\(oi,nj\)D\[i−1\]\[j\]\+gD\[i\]\[j−1\]\+gD\[i\]\[j\]=\\max\\\!\\begin\{cases\}D\[i\{\-\}1\]\[j\{\-\}1\]\+\\mathrm\{score\}\(o\_\{i\},n\_\{j\}\)\\\\ D\[i\{\-\}1\]\[j\]\+g\\\\ D\[i\]\[j\{\-\}1\]\+g\\end\{cases\}The three cases correspond to aligningoio\_\{i\}withnjn\_\{j\}, inserting a gap in the new sequence, and inserting a gap in the old sequence\. Aligned pairs withscore≥0\.28\\mathrm\{score\}\\geq 0\.28are classified asContentSimilar; pairs below this threshold are classified asPositionMatchwith confidence capped at 0\.60\.
### 4\.5Consensus Integration
Final page correspondence is determined by a three\-step consensus:
1. 1\.LCS equal\-block pairs are accepted unconditionally \(confidence 1\.0\)\.
2. 2\.Seven\-phase pairs within replace blocks are accepted if non\-conflicting with LCS pairs\.
3. 3\.DP alignment resolves remaining conflicts within replace blocks\.
Unmatched old pages are classified asDeleted; unmatched new pages asInserted\.
### 4\.6Multi\-Layer Diff
For each matched pair\(oi,nj\)\(o\_\{i\},n\_\{j\}\), three diff layers are computed:
Text diff: Character\-level diff usingdifflibunified diff on up to 5000 characters of extracted text per page\. Added/deleted spans are annotated with color highlights\.
Table diff: Structured table cells extracted via pdfplumber are compared cell\-by\-cell; changed cells are highlighted in red\.
Visual diff: Pages are rendered at 150 DPI and compared using OpenCV\[[2](https://arxiv.org/html/2604.19770#bib.bib5)\]pixel differencing with morphological noise reduction \(dilation \+ erosion\)\. Difference regions are bounded by rectangles overlaid on the rendered page image\.
The three layers are composited into a side\-by\-side annotated diff view in the output PDF report, with jump\-link annotations for navigation\.
\(a\) Text diff\(b\) Table diff\(c\) Visual diffOldNew
Figure 2:The three diff layers computed for each matched page pair\. \(a\) Text diff: deleted lines highlighted red, added lines green, viadifflibunified diff\. \(b\) Table diff: changed cells highlighted by cell\-level comparison viapdfplumber\. \(c\) Visual diff: OpenCV pixel\-difference regions bounded by rectangles\.
### 4\.7Patch Mode
A lightweightpatch modeis available for incremental updates where most pages are unchanged\. Drawing\-number\-matched pairs are accepted with confidence 0\.95\. Orphan pages \(unmatched old pages\) are retained in the report rather than marked as deleted, reducing false alarms on partial\-update submissions\.
## 5Evaluation
### 5\.1Dataset
We evaluated our system on two Japanese structural calculation PDF document pairs produced by a commercial structural analysis tool \(4\-storey timber\-frame residential building\)\.
Pair 1 \(sample extract\):A 9\-page excerpt \(old revision\) paired with the corresponding 10\-page revised excerpt\. The revision inserted one page \(a structural plan index\) at position 3, leaving the remaining 9 pages content\-identical to the original\. We used this pair for quantitative accuracy evaluation with a manually annotated ground\-truth file \(gt\_test\_keisansho\.json\)\.
Pair 2 \(full report\):A 90\-page complete structural calculation report\. We used this pair for performance profiling \(self\-comparison, to measure pure algorithmic cost\)\.
### 5\.2Page Matching Accuracy
We compare four variants on Pair 1 against the ground\-truth mapping of 9 matched pairs, 1 inserted page, and 0 deleted pages \(Table[2](https://arxiv.org/html/2604.19770#S5.T2)\)\.
Table 2:Page matching accuracy on the 9\-page/10\-page document pair\.The sequential baseline assigns each old pageiito new pageii, which fails catastrophically once a page is inserted \(F1 = 0\.22, TP = 2 only\)\. In contrast, our content\-hash phase immediately identifies the two pages before the insertion point \(indices 0 and 1\) and all seven pages after it via exact hash matching, yielding perfect precision \(no false matches\)\.
The recall ceiling of 0\.67 \(6 of 9 ground\-truth matches recovered\) is caused by three blank pages \(page indices 4–6 in the old revision\) whose text length falls below the 50\-character threshold used by\_compute\_content\_hash, returning an empty hash by design \(see Section[4](https://arxiv.org/html/2604.19770#S4)\)\. Text\-similarity matching likewise produces no signal for empty pages\. If blank pages are excluded from evaluation, all three non\-sequential variants achieve F1 = 1\.00 on the textually non\-trivial portion of the dataset\. Figure[3](https://arxiv.org/html/2604.19770#S5.F3)illustrates the complete alignment\.
The LCS\-only, seven\-phase\-only, and full pipeline variants produce identical scores on this dataset because the document revision consists of a single clean page insertion with no content modifications\. The additional phases \(drawing\-number lookup, section\-level grouping, adaptive page\-shift, and DP\-based position recovery\) provide value for noisier inputs such as documents with partial content edits, renumbered drawings, or multiple simultaneous insertions and deletions\.
O0O\_\{0\}O1O\_\{1\}O2O\_\{2\}O3O\_\{3\}O4O\_\{4\}\(blank\)O5O\_\{5\}\(blank\)O6O\_\{6\}\(blank\)O7O\_\{7\}O8O\_\{8\}N0N\_\{0\}N1N\_\{1\}N2N\_\{2\}\[ins\]N3N\_\{3\}N4N\_\{4\}N5N\_\{5\}\(blank\)N6N\_\{6\}\(blank\)N7N\_\{7\}\(blank\)N8N\_\{8\}N9N\_\{9\}Old \(9 pp\.\)New \(10 pp\.\)
Figure 3:Page alignment result for Pair 1 \(9\-page old revision, 10\-page new revision\)\. Solid arrows: matches recovered by the system \(6 of 9 ground\-truth pairs\)\. Dashed arrows: ground\-truth pairs not recovered due to blank pages \(<<50 extracted characters\)\. The red\-shadedN2N\_\{2\}is the inserted page, correctly identified with no false match\.
### 5\.3Diff Detection Quality
The full pipeline correctly identifies the inserted page \(NEW\[2\]\) and emits zero false\-positive matched pairs \(FP=0\\text\{FP\}=0\)\. Diff detection for the six matched text\-bearing pages yielded no spurious change annotations since those pages are byte\-for\-byte identical in the two revisions, confirming that the text\-extraction and hashing pipeline introduces no noise on clean inputs\.
A qualitative review of the 90\-page Pair 2 \(self\-comparison\) confirmed that all 90 pages were matched exactly with no false additions or deletions, as expected\.
### 5\.4Performance
Table[3](https://arxiv.org/html/2604.19770#S5.T3)reports end\-to\-end wall\-clock time on a laptop running Windows 11 \(Intel Core i7, CPU only, no GPU\)\. Text extraction viapdfplumberdominates the total time; the page\-matching algorithm itself is negligible by comparison\.
Table 3:Processing time\. Hardware: Intel Core i7, CPU only\.The matching algorithm scales sub\-quadratically in practice: the 10×\\timesincrease in page count \(9→\\to90\) results in only a 7×\\timesincrease in algorithm time \(2\.8 ms→\\to19 ms\), because the early hash\-match phases eliminate most candidate pairs before the expensive similarity computation\. Text extraction time is dominated by PDF rendering overhead and is largely independent of revision complexity\.
## 6Conclusion
We have presented a system for automated revision comparison of Japanese building permit document sets\. The core contribution is a hybrid multi\-phase page matching pipeline that combines LCS structural alignment, a seven\-phase consensus matching algorithm, perceptual hash visual rematch, and dynamic programming optimal alignment to robustly handle page correspondence under arbitrary insertions, deletions, and reorderings\. A multi\-layer diff engine then produces text, table, and visual difference reports\.
Future workincludes:
- •Quantitative evaluation on a larger anonymized dataset of real permit submissions\.
- •LLM\-assisted semantic change classification \(e\.g\., distinguishing editorial from structural design changes\)\.
- •Extension to other document domains with similar revision tracking requirements \(e\.g\., environmental impact assessments, fire safety documentation\)\.
## References
- \[1\]Artifex Software\(2024\)PyMuPDF: python bindings for MuPDF\.Note:[https://pymupdf\.readthedocs\.io/](https://pymupdf.readthedocs.io/)Cited by:[§2\.1](https://arxiv.org/html/2604.19770#S2.SS1.p1.1)\.
- \[2\]G\. Bradski\(2000\)The OpenCV library\.Dr\. Dobb’s Journal of Software Tools25\(11\),pp\. 120–125\.Cited by:[§4\.6](https://arxiv.org/html/2604.19770#S4.SS6.p4.1)\.
- \[3\]T\. H\. Cormen, C\. E\. Leiserson, R\. L\. Rivest, and C\. Stein\(2009\)Introduction to algorithms\.3rd edition,MIT Press\.Cited by:[§2\.3](https://arxiv.org/html/2604.19770#S2.SS3.p1.1),[§4\.4](https://arxiv.org/html/2604.19770#S4.SS4.p1.1)\.
- \[4\]B\. Fluri, M\. Würsch, M\. Pinzger, and H\. C\. Gall\(2007\)Change distilling: tree differencing for fine\-grained source code change extraction\.IEEE Transactions on Software Engineering33\(11\),pp\. 725–743\.Cited by:[§2\.5](https://arxiv.org/html/2604.19770#S2.SS5.p1.1),[Table 1](https://arxiv.org/html/2604.19770#S2.T1.4.2.2)\.
- \[5\]Y\. Koreeda and C\. D\. Manning\(2021\)ContractNLI: a dataset for document\-level natural language inference for contracts\.InFindings of the Association for Computational Linguistics: EMNLP 2021,pp\. 1313–1327\.Cited by:[§2\.6](https://arxiv.org/html/2604.19770#S2.SS6.p1.1)\.
- \[6\]Ministry of Land, Infrastructure, Transport and Tourism\(2023\)Building standards act \(Kenchiku Kijun\-hō\)\.Note:[https://www\.mlit\.go\.jp/jutakukentiku/build/](https://www.mlit.go.jp/jutakukentiku/build/)Act No\. 201 of 1950, as amendedCited by:[§1](https://arxiv.org/html/2604.19770#S1.p1.1),[§2\.6](https://arxiv.org/html/2604.19770#S2.SS6.p2.1)\.
- \[7\]S\. B\. Needleman and C\. D\. Wunsch\(1970\)A general method applicable to the search for similarities in the amino acid sequence of two proteins\.Journal of molecular biology48\(3\),pp\. 443–453\.Cited by:[item 2](https://arxiv.org/html/2604.19770#S1.I2.i2.p1.1),[§2\.3](https://arxiv.org/html/2604.19770#S2.SS3.p1.1),[§4\.4](https://arxiv.org/html/2604.19770#S4.SS4.p1.1)\.
- \[8\]Python Software Foundation\(2024\)difflib — helpers for computing deltas\.Note:[https://docs\.python\.org/3/library/difflib\.html](https://docs.python.org/3/library/difflib.html)Python 3 Standard LibraryCited by:[§2\.3](https://arxiv.org/html/2604.19770#S2.SS3.p1.1),[§4\.2](https://arxiv.org/html/2604.19770#S4.SS2.p1.2)\.
- \[9\]Z\. Shen, R\. Zhang, M\. Dell, B\. C\. G\. Lee, J\. Carlson, and W\. Li\(2021\)LayoutParser: a unified toolkit for deep learning based document image analysis\.InInternational Conference on Document Analysis and Recognition,pp\. 131–146\.Cited by:[§2\.2](https://arxiv.org/html/2604.19770#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2604.19770#S2.T1.8.8.2.1)\.
- \[10\]Y\. Shinyama\(2020\)PDFMiner: python PDF parser and analyzer\.Note:[https://github\.com/pdfminer/pdfminer\.six](https://github.com/pdfminer/pdfminer.six)Cited by:[§2\.1](https://arxiv.org/html/2604.19770#S2.SS1.p1.1)\.
- \[11\]J\. Singer\-Vine\(2024\)pdfplumber: plumb a PDF for detailed information about each text character, rectangle, and line\.Note:[https://github\.com/jsvine/pdfplumber](https://github.com/jsvine/pdfplumber)Cited by:[§2\.1](https://arxiv.org/html/2604.19770#S2.SS1.p1.1)\.
- \[12\]R\. Smith\(2007\)An overview of the Tesseract OCR engine\.InNinth International Conference on Document Analysis and Recognition \(ICDAR 2007\),Vol\.2,pp\. 629–633\.Cited by:[§2\.2](https://arxiv.org/html/2604.19770#S2.SS2.p1.1)\.
- \[13\]M\. Summerfield\(2012\)DiffPDF: compare PDF files\.Note:[http://www\.qtrac\.eu/diffpdf\.html](http://www.qtrac.eu/diffpdf.html)Cited by:[§1](https://arxiv.org/html/2604.19770#S1.p2.1),[§2\.5](https://arxiv.org/html/2604.19770#S2.SS5.p2.2),[Table 1](https://arxiv.org/html/2604.19770#S2.T1.3.1.2)\.
- \[14\]The Apache Software Foundation\(2023\)Apache PDFBox: a java PDF library\.Note:[https://pdfbox\.apache\.org/](https://pdfbox.apache.org/)Cited by:[§2\.1](https://arxiv.org/html/2604.19770#S2.SS1.p1.1)\.
- \[15\]C\. Zauner\(2010\)Implementation and benchmarking of perceptual image hash functions\.Ph\.D\. Thesis,Upper Austria University of Applied Sciences, Hagenberg Campus\.Cited by:[§2\.4](https://arxiv.org/html/2604.19770#S2.SS4.p1.1),[§4\.1](https://arxiv.org/html/2604.19770#S4.SS1.p5.2)\.Similar Articles
@jerryjliu0: LiteParse, our OSS document parser, is really good at parsing complex PDF layouts, text, and tables into a clean spatia…
LiteParse is an open-source, heuristic-based PDF parser that quickly converts complex layouts, text, and tables into a clean spatial grid without relying on ML models.
Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
Researchers from PNNL and Washington University introduce a systematic framework to test how five LLMs detect subtle semantic changes in documents, revealing positional bias, context coherence effects, and model-specific scoring fingerprints.
PDFMathTranslate: Scientific Document Translation Preserving Layouts
This paper introduces PDFMathTranslate, an open-source tool for translating scientific documents while preserving their original layout, leveraging large language models and precise layout detection.
@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…
dots.ocr is a new lightweight 1.7B parameter multilingual vision-language model that achieves state-of-the-art performance on OmniDocBench, outperforming much larger models (72B+) at document parsing and OCR tasks.
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
SmolDocling is a compact 256M parameter vision-language model designed for end-to-end multi-modal document conversion. It introduces a new universal markup format called DocTags to capture page elements with location, competing with models 27 times larger.