@ArizePhoenix: One of the oldest lessons in ML is still one of the most useful for working with LLM apps: Don’t evaluate on the same d…
Summary
This article discusses best practices for LLM application development using Arize Phoenix, specifically highlighting the importance of using train/validation/test splits for honest evaluation and tracking regressions.
View Cached Full Text
Cached at: 05/08/26, 10:49 AM
One of the oldest lessons in ML is still one of the most useful for working with LLM apps: Don’t evaluate on the same data used to build. Train/dev/validation/test splits exist for a reason. They help separate “this worked because it was tuned against it” from “this actually generalizes.” The same practice maps naturally to agents, prompts, and evals. A good dataset might include: * a dev split for fast iteration * a validation split for prompt/model selection * a test split for final confidence * a hard-examples split for the cases the system keeps failing A single aggregate score over the whole dataset usually hides the thing that matters most: where did the system improve, and where did it regress? Splits make experiments more targeted, more honest, and easier to compare over time. Old ML discipline, very practical LLM engineering pattern. https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments/splits#splits…
Splits - Phoenix
Source: https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments/splits
Documentation Index Fetch the complete documentation index at:https://arizeai-433a7140.mintlify.app/llms.txt Use this file to discover all available pages before exploring further.
Often we want to run an experiment over just a subset of our entire dataset. These subsets of dataset examples are called “splits.” Common splits include:
- hard examples that frequently produce poor output,
- a split of examples used in a few-shot prompt and a disjoint, non-overlapping split of examples used for evaluation,
- train, validation, and test splits for fine-tuning an LLM.
Running experiments over splits rather than entire datasets produces evaluation metrics that better capture the performance of your agent, workflow, or prompt on the particular type of data you care about.
Configuring Splits
Experiments can be run over previously configured splits either via the Python or JavaScript clients or via the Phoenix playground.
Creating Splits
Currently, Splits can be created in the UI on the dataset page. When inspecting the dataset you will see a new splits column along with a splits filter.
On the split filter we have the ability to assign splits and create splits
A split can be assigned a name, description and a color
- Creating a Split
- Split Created
Assigning Splits
Splits can currently only be assigned from the UI on the dataset page. To assign dataset examples to splits, select a set of examples and using the split filter we can select splits and it will automatically assign those selected examples to the set of selected splits
- Assigning Splits
- Splits Assigned
Using splits
For the rest of this example we will be working with the following dataset, which has 3 examples assigned to test and 7 examples assigned to train.
- UI
Experiments can be ran over dataset splits from the playground UI. With dataset splits, the dataset selector UI now shows the dataset with the ability to select all examples or to select from a set of splits
To run an experiment over the “train” split, we can select the dataset by the train split which shows the 7 selected examples and hit Run
- Selected Split
- Run Experiment on Split
Splits are implicitly configured when a dataset is pulled from the Phoenix server using theget\_datasetclient method. Subsequent invocations ofrun\_experimentonly run on the examples belonging to the split(s).
# pip install "arize-phoenix-client>1.22.0"
from phoenix.client import Client
client = Client()
# only pulls examples from the selected splits
dataset = client.datasets.get_dataset(
dataset="my-dataset",
splits=["test", "hard_examples"], # names of previously created splits
)
def my_task(input):
return f"Hello {input['name']}"
experiment = client.experiments.run_experiment(
dataset=dataset, # runs only on the selected splits
task=my_task,
experiment_name="greeting-experiment"
)
Splits can be configured within a DatasetSelector, when fetching datasets. Dataset examples contained within the selected splits will be used in experiment runs, or evaluations.
// npm install @arizeai/phoenix-client@latest
import { runExperiment } from "@arizeai/phoenix-client/experiments"
import type { ExperimentTask } from "@arizeai/phoenix-client/types/experiments";
import type { DatasetSelector } from "@arizeai/phoenix-client/types/datasets";
const myTask: ExperimentTask = (example) => {
return `Hello, ${(example.input as any).name ?? "stranger"}!`
}
// Create a dataset selector that can be used to fetch a dataset
const datasetSelector: DatasetSelector = {
datasetName: "my-dataset",
splits: ["test", "hard_examples"] // names of previously created splits
}
runExperiment({
// runExperiment will perform a just-in-time fetch of the dataset "my-dataset"
// with its examples filtered by the provided splits
dataset: datasetSelector,
task: myTask,
experimentName: "greeting-experiment"
})
Comparing Experiments on Splits
Splits as a property are mutable, meaning you can add or remove examples from splits at any time. However, for consistent experiment comparison, experiment runs are snapshotted at the time of execution. The association between experiment runs and the splits they were executed against is immutable, ensuring that comparisons remain accurate even if split assignments change later.The comparisons between experiments will always consult the snapshot of the base experiment. This means that when comparing experiments, the system uses the exact set of examples that were included in the base experiment at the time it was executed, regardless of any subsequent changes to split assignments.For example, if you run an experiment on the “train” split when it contains 7 examples (with specific example IDs), those same 7 example IDs are what will be retrieved and compared in any future experiment comparisons. Even if you later add more examples to the “train” split or remove some examples, the comparison will still only include the original 7 examples that were part of the base experiment’s snapshot.When comparing experiments run on splits we will now see this new overlap states where a experiment comparison either doesn’t contain those example IDs:
Or the expected state when there is an overlap:
Similar Articles
@ArizePhoenix: Who judges the evaluators? When you use LLM-as-a-judge, you’re trusting a model to decide whether your agent, workflow,…
The article discusses the challenges of debugging and evaluating LLM judges using Arize Phoenix, which traces evaluator runs via OpenTelemetry to inspect decision logic, costs, and potential biases.
Are most LLM eval tools still too prompt-focused?
The author questions whether current LLM evaluation tools are too focused on isolated prompts rather than full workflows and agent interactions, noting that step-by-step accuracy can mask overall behavioral drift in production.
Is using vLLM actually worth it if you aren't serving the model to other people?
A user discusses the trade-offs between using vLLM and llama.cpp for local, single-user inference on AMD hardware, questioning if vLLM's performance benefits justify the complexity in non-enterprise settings.
@LakshyAAAgrawal: Learning from rich textual feedback (errors, traces, partial reasoning) beats scalar reward alone for LLM optimization.…
Fast-Slow Training (FST) interleaves context optimization (via GEPA) with model weight updates via RL, achieving 3× sample efficiency over RL alone on math, code, and physics reasoning while preserving plasticity and enabling continual learning.
@ickma2311: Efficient AI Lecture 13: LLM Deployment Techniques The lecture helped me understand AWQ, vLLM, and FlashAttention very …
A lecture on LLM deployment techniques covering AWQ, vLLM, FlashAttention, quantization, and activation smoothing for efficient serving.