@llama_index: Automate a loan underwriting pipeline in just a few lines of code A typical loan file is a stack of pay stubs and broke…

X AI KOLs Following Tools

Summary

LlamaIndex demonstrates how to automate a loan underwriting pipeline using LlamaParse to extract structured data from financial PDFs, with cross-document analysis and human-in-the-loop review.

Automate a loan underwriting pipeline in just a few lines of code A typical loan file is a stack of pay stubs and brokerage statements, every one formatted differently, every number re-typed by hand. Here's a pipeline that does it automatically with LlamaParse: PDFs to clean markdown, fields into Pydantic models, then cross-document analysis that produces an underwriting summary with discrepancy flags. Full post and repo: https://llamaindex.ai/blog/building-a-financial-document-pipeline-with-llamaparse…
Original Article
View Cached Full Text

Cached at: 05/27/26, 03:01 AM

Automate a loan underwriting pipeline in just a few lines of code

A typical loan file is a stack of pay stubs and brokerage statements, every one formatted differently, every number re-typed by hand.

Here’s a pipeline that does it automatically with LlamaParse: PDFs to clean markdown, fields into Pydantic models, then cross-document analysis that produces an underwriting summary with discrepancy flags.

Full post and repo: https://llamaindex.ai/blog/building-a-financial-document-pipeline-with-llamaparse…


Build a Loan Underwriting Pipeline with LlamaParse

Source: https://www.llamaindex.ai/blog/building-a-financial-document-pipeline-with-llamaparse Loan underwriting requires pulling data from multiple financial documents. This often includes pay stubs and brokerage statements, all with complex layouts that will vary widely across providers. This is a key financial workflow that often incurs heavy manual checks and repetitive processes.

Last week I (Logan, Head of OSS at LlamaIndex) ran a hands-on workshop in NYC where developers built a loan underwriting pipeline from scratch using LlamaParse tools. The resulting application was able to take messy financial PDFs (pay stubs, brokerage statements), extract structured data, and run cross-document analysis.

We wanted to build a pipeline that:

  1. ParsesPDFs into clean markdown using LlamaParse’s agentic tier
  2. Extractsstructured fields (employer name, gross pay, holdings, account values) into typed Pydantic models
  3. Analyzesdata across documents to produce an underwriting summary with discrepancy flags
  4. Reviewsthe analysis with a human-in-the-loop approval step

This post walks through what we built and how you can try it yourself.

The Workshop Tech Stack

The stack is intentionally simple for a workshop setting. Using a combination of async Python, SQLite, FastAPI, Pydantic, and the LlamaCloud SDK, we built a fully async pipeline with in-memory job queues.

While the tech stack is simple, the architecture is designed to be extensible. You (or your coding agent) can easily swap out components as needed. This means swapping in Celery or Temporal for job queues, Postgres for the database, and S3 instead of local file storage.

LlamaParse Three Ways

The workshop had attendees implement three service files, each using LlamaParse in different ways.

1. Parsing: PDF to Markdown

The first service uploads a PDF and gets back clean markdown. This is LlamaParse’s core capability, its agentic parsing tier handles the messy table layouts and formatting inconsistencies across payroll providers and brokerages.

import asyncio
from llama_cloud import AsyncLlamaCloud

client = AsyncLlamaCloud(api_key=settings.llama_cloud_api_key)

file_obj = await client.files.create(file=file_path, purpose="parse")
job = await client.parsing.create(file_id=file_obj.id, tier="agentic", version="latest")

# Poll until complete
result = await client.parsing.get(job.id, expand=["markdown_full"])
while result.job.status not in ("COMPLETED", "FAILED", "CANCELLED"):
    await asyncio.sleep(3)
    result = await client.parsing.get(job.id, expand=["markdown_full"])

parsed_markdown = result.markdown_full

Three API calls: upload the file, create a job, poll for the result. The markdown that comes back preserves table structure, which is critical for the next step.

The second service takes a parsed document and extracts typed fields using a Pydantic schema. You define what you want, and LlamaParse pulls it out.

For example, the pay stub schema:

from pydantic import BaseModel

class PayStub(BaseModel):
    employer_name: str
    employee_name: str
    pay_period_start: str
    pay_period_end: str
    gross_pay: float
    net_pay: float
    ytd_gross_income: float
    deductions: list[Deduction]

The extraction call passes the schema as JSON Schema to LlamaParse:

job = await client.extract.create(
    file_input=file_obj.id,
    configuration={
        "tier": "agentic",
        "data_schema": PayStub.model_json_schema(),
    }
)

Similar to parsing, you upload a file, use the file ID, and then poll for the result.

Once the job completes, you can then validate it against your schema usingPayStub\.model\_validate\(result\.extract\_result\).

3. Cross-Document Analysis

The third service is the most interesting. It takes theextracteddata from multiple documents (say, a pay stub and a brokerage statement), combines them into a text buffer, uploads that buffer to LlamaParse, and runs extraction again. Except this time, with an underwriting summary schema that looks across all documents that performs more reasoning rather than pure extraction.

# Combine extracted data into a text document
text = _format_extractions_as_text(extracted_data)

# Upload as a buffer file
file_obj = await client.files.create(
    file=(f"review_{review_id}.txt", io.BytesIO(text.encode("utf-8"))),
    purpose="extract",
)

# Extract with the cross-document schema
job = await client.extract.create(
    file_input=file_obj.id,
    configuration={
        "tier": "agentic",
        "data_schema": UnderwritingSummary.model_json_schema(),
        "system_prompt": "You are a loan underwriter. Analyze ...",
    }
)

The underwriting summary schema asks for verified income, total liquid assets, months of reserves, and a list of discrepancies with severity ratings. By setting the system prompt, we can explicitly prompt the service to perform the kind of analysis we want across the documents, rather than just pulling out fields. This is where business-specific knowledge can be injected into the pipeline to produce more actionable outputs.

Try It Yourself

First, grab an API key fromLlamaCloudif you don’t have one already.

The repo is set up so you can implement each service incrementally. Each phase has a branch with the TODO stubs filled in:

BranchWhat’s implementedmainStarting point with 3 services to implementphase\_1Parser service (PDF to markdown)phase\_2+ Extraction service (structured data)phase\_3+ Review service (cross-document analysis) To get started:

git clone <https://github.com/logan-markewich/finparse-pipeline> && cd finparse-pipeline
git checkout phase_1  # start with TODO stubs
uv sync --group dev
cp .env.example .env  # add your LLAMA_CLOUD_API_KEY
uv run fastapi dev app/main.py

The Swagger UI athttp://localhost:8000/docslets you drive the whole flow: upload a PDF, poll for parsing, trigger extraction, create a review.

Similar Articles

@llama_index: Most AI pipelines are only as good as the data we provide them with, and that usually means PDFs or other unstructured …

X AI KOLs Timeline

Parse-Flow is an open-source visual workflow designer built by LlamaIndex that chains four document processing primitives—Parse, Classify, Split, and Extract—into a drag-and-drop canvas powered by LlamaAgents workflows, enabling reliable structured data extraction from unstructured enterprise documents like PDFs, contracts, and invoices.