@akshay_pachaar: https://x.com/akshay_pachaar/status/2064700531600458093

X AI KOLs Following News

Summary

This article explains how to use GRPO to fine-tune an LLM (Qwen3-8B) for reliable JSON structured output, improving schema accuracy from 62% to 82%, surpassing GPT-4.1's 58%.

https://t.co/1y913VNynx
Original Article
View Cached Full Text

Cached at: 06/10/26, 01:50 PM

Training an LLM to Generate Reliable Structured Output

You ask a language model for JSON, and most of the time it works.

Then one response comes back broken. A number shows up as a string, or the model wraps the JSON in a line of explanation, and the parser chokes.

That small failure rate is the whole problem. The output feeds a function call or a database write, and looking like valid JSON is not the same as being valid JSON.

The code downstream only finds out which one it got when it breaks.

That is what structured output means. The model returns data in a fixed shape that matches a schema, not free text that happens to read right.

Agents, tool calls, and data pipelines all run on this now. The model is writing for code to run, not for a person to read.

Getting this reliable is the hard part, and the usual answer is to give the model more. You add examples, tighten the prompt, and fine-tune on correct outputs.

It helps a little, then stops. The ceiling was never the amount of data. It is the objective.

DeepSeek-R1 showed a way around this. Training a strong model used to mean annotation pipelines, preference pairs, and a team of labelers.

DeepSeek replaced all of it with one Python function that checks whether an answer is right. If you can define correctness via code, you do not need the rest.

That is the idea behind GRPO. Instead of learning from examples, the model learns from a reward function you write.

For each prompt, it generates a few candidate answers. The reward function scores them, and the model is pushed toward the ones that score higher.

GRPO explained step by step, from data prep to response generation, reward calculation and loss calculation

GRPO explained step by step, from data prep to response generation, reward calculation and loss calculation

In this walkthrough, I use it to fine-tune Qwen3-8B for JSON extraction. The loop runs from a local notebook while the model trains on remote H200s.

The reward function does one thing: it checks whether each output parses and matches the schema.

The whole loop is driven from your local Python process, while the sampling and gradient steps run on the remote H200s.

The whole loop is driven from your local Python process, while the sampling and gradient steps run on the remote H200s.

Schema-valid output went from 62% on the base model to 82% after training, past GPT-4.1 on the same eval at 58%.

Before the build, it helps to see why the obvious approach stalls. That is what makes everything after it click.

Why SFT Hits a Ceiling

SFT learns by copying examples. Show it correct completions, and it gets good at producing output that looks like them.

But looking like valid JSON and being valid JSON are different goals. SFT only ever chases the first one.

At the token level, a completion with one wrong field type scores almost the same as a valid one. More data lifts schema accuracy to a ceiling, then stops.

At the token level, a completion with one wrong field type scores almost the same as a valid one. More data lifts schema accuracy to a ceiling, then stops.

The loss is measured token by token. A completion with one field typed wrong scores almost the same as a perfect one.

So you add more examples. The number ticks up, then flattens, because the limit is the objective, not the data.

Training Against Correctness, Not Examples

Once you see the problem, the fix is clear. You define correct in code, and train the model against that definition directly.

This is what GRPO does. It swaps labeled examples for a reward function.

For each prompt, the model generates a small group of answers, usually four to eight. Your reward function scores every one of them.

GRPO generates a group of completions for each prompt, scores every one with a reward function, and pushes the model toward the higher-scoring ones.

GRPO generates a group of completions for each prompt, scores every one with a reward function, and pushes the model toward the higher-scoring ones.

The scores are normalized inside the group. The update then reinforces the answers that scored above the group’s average.

So the model is always comparing its own outputs against each other. It learns what “more correct” means for your task, not what “more similar to an example” means.

Here is how the reward function scores three different outputs for the same prompt.

  • Output that doesn’t parse as JSON scores 0.0.

  • Output that parses but fails the schema scores 0.5.

  • Output that parses and matches the schema scores 1.0.

Output that doesn’t parse scores 0.0, valid JSON with the wrong schema scores 0.5, and a full match scores 1.0. That 0.5 is what keeps valid structure alive as a signal early in training.

Output that doesn’t parse scores 0.0, valid JSON with the wrong schema scores 0.5, and a full match scores 1.0. That 0.5 is what keeps valid structure alive as a signal early in training.

That middle score matters more than it looks. Without it, valid JSON with the wrong field types would score zero, the same as complete garbage.

The model would lose an important signal, that valid structure is already progress. The 0.5 is what gives training something to climb toward.

Why GRPO Needs Real Infrastructure

GRPO is heavier than SFT. On an 8B model it needs H200s and runs for hours.

Every step, it generates several completions per prompt, scores all of them, and updates the weights. That repeats across the whole dataset, many times over.

This is not something you run on your laptop.

There is also a timing problem SFT never has. During rollout, the model samples answers from its current weights. During training, those same weights are changing underneath it.

Without syncing, the inference deployment lags the trainer and rollouts sample from stale weights. Fireworks resyncs after every step, so completions always come from the current model.

Without syncing, the inference deployment lags the trainer and rollouts sample from stale weights. Fireworks resyncs after every step, so completions always come from the current model.

If the inference side and the trainer fall out of sync, you sample from a stale model and train on answers the current one would never give. This is where most custom RL setups fall apart.

Fireworks’ Training API handles both sides. You write the training logic in Python on your own machine.

Their infrastructure does the rest. It provisions the GPUs, runs the forward and backward passes, saves checkpoints, and resyncs the inference deployment after every step.

The setup is three steps. You write the reward function, upload the dataset, and configure the run.

Let’s go through each one.

Building the Training Loop

Fireworks documents the full setup in their Training API docs. That includes rl_loop, the recipe that runs the entire GRPO loop for you.

Clone the cookbook and install the training dependencies.

Step 1: Write the Reward Function

This is the only place your task is defined. The schema says what a correct output looks like, and score() checks each completion against it.

For invoice extraction, I pull four fields from raw text: vendor, date, amount, and currency.

jsonschema handles the type checks, the required fields, and any nested rules in a single call.

For a different task, like SQL or tool-call formatting, you swap in a new schema. The score() function stays the same.

Step 2: Prepare the Dataset

GRPO does not need labeled outputs. The dataset is just the prompts you would send in production.

The model writes its own completions during training, and score() grades them as they come.

I used 200 training prompts. They cover different vendor names, date formats, amount styles, and currency codes.

I also set aside 50 eval prompts that the model never trains on.

Variety matters more than volume here. Prompts that all look alike produce a model that breaks on real invoice variation.

Upload the dataset and wait for the READY state before the training job can use it.

Step 3: Configure and Run the Loop

rl_loop runs the whole thing. It provisions the trainer, schedules the rollouts, syncs the weights, and cleans up when the run is done.

You connect your score() function by wrapping it and assigning it to rl_loop.reward_fn. The wrapper gets both the completion and the dataset row, so you can reach ground-truth metadata if your reward needs it.

A few of these settings are worth a note.

  • dataset points to the dataset_id you uploaded in Step 2. Fireworks pulls it from their storage directly.

  • completions_per_prompt=4 sets the group size for GRPO. Production runs often use 8 to 16, which gives more signal per step at higher compute cost. Four is enough here. The reward is clear-cut enough that even a small group shows real variance between answers.

  • weight_sync_interval=1 resyncs the inference deployment after every step. That keeps rollout sampling from the exact model being trained. Production runs often set this to 4 or 8 for speed. For a short 200-step run, 1 gives the tightest feedback loop, which is what you want.

  • One Qwen3 quirk to handle. It defaults to thinking mode and adds blocks before the answer. Strip them at eval time with content.split(“”)[-1].strip(). Suppress them in training by adding /no-think to the system prompt. Otherwise the reward function reads the reasoning text instead of the JSON, and scores everything 0.0.

Results on the Held-Out Eval

Base Qwen3-8B scores 62% schema-valid on the 50 held-out prompts. After GRPO training on Fireworks H200s, the fine-tuned model hits 82%.

That is past GPT-4.1 on the same eval, which lands at 58%.

On the held-out eval, the fine-tuned Qwen3-8B reaches 82% schema-valid, past base Qwen3-8B at 62% and GPT-4.1 at 58%.

On the held-out eval, the fine-tuned Qwen3-8B reaches 82% schema-valid, past base Qwen3-8B at 62% and GPT-4.1 at 58%.

Here is the baseline run first, on the 50 prompts the model never trained on.

The base model lands at 62% schema-valid, 31 of the 50 held-out prompts.

The base model lands at 62% schema-valid, 31 of the 50 held-out prompts.

And here is the same eval after training.

After training, the same eval hits 82%, 41 of 50, a 20-point gain over the base model.

After training, the same eval hits 82%, 41 of 50, a 20-point gain over the base model.

The trained model runs on a Fireworks serverless endpoint, at a fraction of GPT-4.1’s per-token cost. Latency is lower too, since the output is short and predictable.

The real difference shows up on messy inputs. A prompted general-purpose model starts to slip, while the trained model holds, because it learned the constraint instead of the examples.

What Transfers to Your Task

This pattern works for any task where you can check correctness in code. SQL that has to parse, API responses that must match a shape, tool calls, code that has to pass a linter.

If you can score an output, you can train a model to chase that score.

What DeepSeek-R1 proved at frontier scale holds for your own small task. The model you get has practiced your definition of correct, not memorized examples of it.

On inputs it has never seen, that is the difference that holds up.

The full code is in the repo below. That includes the reward function, the training config, the dataset builder, and the eval script.

**Training API docs → ** **Finetuning Code → **

(don’t forget to star 🌟)

Thanks for reading, and to Fireworks for partnering on today’s article.

Similar Articles