What are you actually evaluating these days: prompts, context, or the whole harness?

Reddit r/AI_Agents News

Summary

A discussion about the focus of AI evaluations, questioning whether practitioners are optimizing prompts, context, or the entire harness, and noting a shift toward holistic optimization.

Asking to the ones who care about evals. What's the main thing you're trying to evaluate and optimize right now? Prompts? Context? The harness itself? Most people I talk to still point evals at individual prompts. But my read is that the frontier has moved: the interesting work now is closing the loop and optimizing the entire context and/or harness, not tuning prompts in isolation. Anyone already doing this in practice? Curious what your setup looks like and where it breaks down.
Original Article

Similar Articles

@AntCaveClub: What exactly is Harness? Harness = Evaluation Harness. In AI, "harness" is industry jargon – a set of tools to "harness" a model and run standardized evaluations. The industry standard is EleutherAI's lm-e…

X AI KOLs Timeline

This article deeply explains the importance of the evaluation framework (Harness) in AI, analyzes the strategic significance of DeepSeek building its own Harness team, and compares the differences between the open-source lm-evaluation-harness and an in-house system.

Can prompting reduce AI sycophancy or is it mostly model behavior?

Reddit r/artificial

A user explores whether prompt engineering can reduce AI sycophancy in models like Gemini, ChatGPT, and Claude, or whether it's fundamentally a model alignment issue. The discussion touches on differences between models in handling disagreement and objective criticism.

Are most LLM eval tools still too prompt-focused?

Reddit r/AI_Agents

The author questions whether current LLM evaluation tools are too focused on isolated prompts rather than full workflows and agent interactions, noting that step-by-step accuracy can mask overall behavioral drift in production.