@DailyDoseOfDS_: A Python decorator is all you need to trace LLM apps (open-source). Most LLM evals treat the app like an end-to-end bla…

X AI KOLs Timeline Tools

Summary

Introduces DeepEval's @observe decorator for component-level tracing and evaluation of LLM apps, enabling granular insight into retrievers, tools, and models.

A Python decorator is all you need to trace LLM apps (open-source). Most LLM evals treat the app like an end-to-end black box. But LLM apps need component-level evals and tracing since the issue can be anywhere inside the box, like the retriever, tool call, or the LLM itself. In DeepEval, you can do that with just 3 lines of code: - Trace individual LLM components (tools, retrievers, generators) with the "@ observe" decorator. - Attach different metrics to each part. - Get a visual breakdown of what’s working and what’s not. Done! You don't need to refactor any of your existing code. See the example below for a RAG app. Deepeval is 100% open-source with 15k+ stars, and you can easily self-host it so your data stays where you want. Find the repo in the replies!
Original Article
View Cached Full Text

Cached at: 05/22/26, 03:57 PM

A Python decorator is all you need to trace LLM apps (open-source).

Most LLM evals treat the app like an end-to-end black box.

But LLM apps need component-level evals and tracing since the issue can be anywhere inside the box, like the retriever, tool call, or the LLM itself.

In DeepEval, you can do that with just 3 lines of code:

  • Trace individual LLM components (tools, retrievers, generators) with the “@ observe” decorator.
  • Attach different metrics to each part.
  • Get a visual breakdown of what’s working and what’s not.

Done!

You don’t need to refactor any of your existing code.

See the example below for a RAG app.

Deepeval is 100% open-source with 15k+ stars, and you can easily self-host it so your data stays where you want.

Find the repo in the replies!

Similar Articles

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

arXiv cs.CL

This paper proposes LURE (Live-Usage Replay Evaluations), a method for constructing realistic, deployment-like evaluations of large language models by replaying real agentic interaction trajectories and appending evaluation prompts, reducing the detectability of evaluations compared to existing benchmarks.