@DailyDoseOfDS_: A Python decorator is all you need to trace LLM apps (open-source). Most LLM evals treat the app like an end-to-end bla…
Summary
Introduces DeepEval's @observe decorator for component-level tracing and evaluation of LLM apps, enabling granular insight into retrievers, tools, and models.
View Cached Full Text
Cached at: 05/22/26, 03:57 PM
A Python decorator is all you need to trace LLM apps (open-source).
Most LLM evals treat the app like an end-to-end black box.
But LLM apps need component-level evals and tracing since the issue can be anywhere inside the box, like the retriever, tool call, or the LLM itself.
In DeepEval, you can do that with just 3 lines of code:
- Trace individual LLM components (tools, retrievers, generators) with the “@ observe” decorator.
- Attach different metrics to each part.
- Get a visual breakdown of what’s working and what’s not.
Done!
You don’t need to refactor any of your existing code.
See the example below for a RAG app.
Deepeval is 100% open-source with 15k+ stars, and you can easily self-host it so your data stays where you want.
Find the repo in the replies!
Similar Articles
@DailyDoseOfDS_: Turn any Autoregressive LLM into a Diffusion LM. dLLM is a Python library that unifies the training & evaluation of dif…
dLLM is an open-source Python library that allows converting any autoregressive language model into a diffusion language model with minimal compute, unifying training and evaluation.
@svpino: How to enable full observability and automatic analytics for your LLM-based application. It takes one library + one lin…
This tweet promotes a library that enables full observability and automatic analytics for LLM-based applications with just one line of code, claiming it provides valuable information for free.
@DailyDoseOfDS_: OpenAI paid $500k for this! > A Kaggle contest to find LLM vulnerabilities DeepTeam does it for free. It implements 20+…
DeepTeam is a free, open-source tool that implements 20+ state-of-the-art attacks to detect over 50 LLM vulnerabilities, including bias and PII leakage, running locally without a dataset.
@evanyou: https://x.com/evanyou/status/2060409444123729935
A developer shares an interesting use case for running LLMs in the browser to inspect internal workings, highlighting a meaningful scenario for client-side AI.
LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness
This paper proposes LURE (Live-Usage Replay Evaluations), a method for constructing realistic, deployment-like evaluations of large language models by replaying real agentic interaction trajectories and appending evaluation prompts, reducing the detectability of evaluations compared to existing benchmarks.