@llama_index: Automate a loan underwriting pipeline in just a few lines of code A typical loan file is a stack of pay stubs and broke…
Summary
LlamaIndex demonstrates how to automate a loan underwriting pipeline using LlamaParse to extract structured data from financial PDFs, with cross-document analysis and human-in-the-loop review.
View Cached Full Text
Cached at: 05/27/26, 03:01 AM
Automate a loan underwriting pipeline in just a few lines of code
A typical loan file is a stack of pay stubs and brokerage statements, every one formatted differently, every number re-typed by hand.
Here’s a pipeline that does it automatically with LlamaParse: PDFs to clean markdown, fields into Pydantic models, then cross-document analysis that produces an underwriting summary with discrepancy flags.
Full post and repo: https://llamaindex.ai/blog/building-a-financial-document-pipeline-with-llamaparse…
Build a Loan Underwriting Pipeline with LlamaParse
Source: https://www.llamaindex.ai/blog/building-a-financial-document-pipeline-with-llamaparse Loan underwriting requires pulling data from multiple financial documents. This often includes pay stubs and brokerage statements, all with complex layouts that will vary widely across providers. This is a key financial workflow that often incurs heavy manual checks and repetitive processes.
Last week I (Logan, Head of OSS at LlamaIndex) ran a hands-on workshop in NYC where developers built a loan underwriting pipeline from scratch using LlamaParse tools. The resulting application was able to take messy financial PDFs (pay stubs, brokerage statements), extract structured data, and run cross-document analysis.
We wanted to build a pipeline that:
- ParsesPDFs into clean markdown using LlamaParse’s agentic tier
- Extractsstructured fields (employer name, gross pay, holdings, account values) into typed Pydantic models
- Analyzesdata across documents to produce an underwriting summary with discrepancy flags
- Reviewsthe analysis with a human-in-the-loop approval step
This post walks through what we built and how you can try it yourself.
The Workshop Tech Stack
The stack is intentionally simple for a workshop setting. Using a combination of async Python, SQLite, FastAPI, Pydantic, and the LlamaCloud SDK, we built a fully async pipeline with in-memory job queues.
While the tech stack is simple, the architecture is designed to be extensible. You (or your coding agent) can easily swap out components as needed. This means swapping in Celery or Temporal for job queues, Postgres for the database, and S3 instead of local file storage.
LlamaParse Three Ways
The workshop had attendees implement three service files, each using LlamaParse in different ways.
1. Parsing: PDF to Markdown
The first service uploads a PDF and gets back clean markdown. This is LlamaParse’s core capability, its agentic parsing tier handles the messy table layouts and formatting inconsistencies across payroll providers and brokerages.
import asyncio
from llama_cloud import AsyncLlamaCloud
client = AsyncLlamaCloud(api_key=settings.llama_cloud_api_key)
file_obj = await client.files.create(file=file_path, purpose="parse")
job = await client.parsing.create(file_id=file_obj.id, tier="agentic", version="latest")
# Poll until complete
result = await client.parsing.get(job.id, expand=["markdown_full"])
while result.job.status not in ("COMPLETED", "FAILED", "CANCELLED"):
await asyncio.sleep(3)
result = await client.parsing.get(job.id, expand=["markdown_full"])
parsed_markdown = result.markdown_full
Three API calls: upload the file, create a job, poll for the result. The markdown that comes back preserves table structure, which is critical for the next step.
The second service takes a parsed document and extracts typed fields using a Pydantic schema. You define what you want, and LlamaParse pulls it out.
For example, the pay stub schema:
from pydantic import BaseModel
class PayStub(BaseModel):
employer_name: str
employee_name: str
pay_period_start: str
pay_period_end: str
gross_pay: float
net_pay: float
ytd_gross_income: float
deductions: list[Deduction]
The extraction call passes the schema as JSON Schema to LlamaParse:
job = await client.extract.create(
file_input=file_obj.id,
configuration={
"tier": "agentic",
"data_schema": PayStub.model_json_schema(),
}
)
Similar to parsing, you upload a file, use the file ID, and then poll for the result.
Once the job completes, you can then validate it against your schema usingPayStub\.model\_validate\(result\.extract\_result\).
3. Cross-Document Analysis
The third service is the most interesting. It takes theextracteddata from multiple documents (say, a pay stub and a brokerage statement), combines them into a text buffer, uploads that buffer to LlamaParse, and runs extraction again. Except this time, with an underwriting summary schema that looks across all documents that performs more reasoning rather than pure extraction.
# Combine extracted data into a text document
text = _format_extractions_as_text(extracted_data)
# Upload as a buffer file
file_obj = await client.files.create(
file=(f"review_{review_id}.txt", io.BytesIO(text.encode("utf-8"))),
purpose="extract",
)
# Extract with the cross-document schema
job = await client.extract.create(
file_input=file_obj.id,
configuration={
"tier": "agentic",
"data_schema": UnderwritingSummary.model_json_schema(),
"system_prompt": "You are a loan underwriter. Analyze ...",
}
)
The underwriting summary schema asks for verified income, total liquid assets, months of reserves, and a list of discrepancies with severity ratings. By setting the system prompt, we can explicitly prompt the service to perform the kind of analysis we want across the documents, rather than just pulling out fields. This is where business-specific knowledge can be injected into the pipeline to produce more actionable outputs.
Try It Yourself
First, grab an API key fromLlamaCloudif you don’t have one already.
The repo is set up so you can implement each service incrementally. Each phase has a branch with the TODO stubs filled in:
BranchWhat’s implementedmainStarting point with 3 services to implementphase\_1Parser service (PDF to markdown)phase\_2+ Extraction service (structured data)phase\_3+ Review service (cross-document analysis)
To get started:
git clone <https://github.com/logan-markewich/finparse-pipeline> && cd finparse-pipeline
git checkout phase_1 # start with TODO stubs
uv sync --group dev
cp .env.example .env # add your LLAMA_CLOUD_API_KEY
uv run fastapi dev app/main.py
The Swagger UI athttp://localhost:8000/docslets you drive the whole flow: upload a PDF, poll for parsing, trigger extraction, create a review.
Similar Articles
@llama_index: Most AI pipelines are only as good as the data we provide them with, and that usually means PDFs or other unstructured …
Parse-Flow is an open-source visual workflow designer built by LlamaIndex that chains four document processing primitives—Parse, Classify, Split, and Extract—into a drag-and-drop canvas powered by LlamaAgents workflows, enabling reliable structured data extraction from unstructured enterprise documents like PDFs, contracts, and invoices.
@itsclelia: Do you actually own your document parsing infrastructure? At @llama_index, we wanted to make that easier, so we built �…
LlamaIndex introduces liteparse-server, an open-source, self-hosted HTTP backend for parsing PDFs, images, and Office documents with spatial layout extraction, OCR, and screenshot generation, designed for AI and data workflows.
@jerryjliu0: Our core mission today is using AI to solve document OCR. All of our product offerings, from commercial (LlamaParse) to…
LlamaIndex has revamped its website and reaffirmed its core mission of AI-powered document OCR, with offerings including commercial product LlamaParse and open-source tools LiteParse and ParseBench. LlamaParse uses VLM-powered agentic document understanding to handle complex layouts, tables, charts, and handwritten text at scale.
@jerryjliu0: We built an AI agent for due diligence, with exact audit trails back to the source page, that you can use as a template…
LlamaIndex's Jerry Liu demonstrates building a financial due diligence AI agent with LiteParse, a free open-source PDF parser that provides exact citations and bounding box coordinates, enabling trust and transparency in agentic workflows.
@llama_index: How do you know your document parser is ready for production? Existing benchmarks miss what AI agents actually need. Th…
LlamaIndex announces ParseBench, a new benchmark for evaluating document parsing for AI agents, and invites AI engineers to a live webinar on May 27th to discuss its methodology and how it addresses gaps in existing benchmarks like OlmOCR.