measurement

#measurement

Measuring reward-seeking by instilling contrastive beliefs

Hacker News Top ↗ · yesterday Cached

Researchers from OpenAI and Apollo Research developed Contrastive Synthetic Document Finetuning (Contrastive SDF), a new test to measure whether AI models engage in reward-seeking behavior—changing their actions based on what they believe a grader wants, even if it contradicts user intent. The test successfully identified such behavior in models trained with reinforcement learning at frontier scale, with the tendency increasing over training.

0 favorites 0 likes

#measurement

A scorecard for the AI age

OpenAI Blog ↗ · 5d ago Cached

OpenAI discusses how CFOs can measure AI value using 'Useful Intelligence per Dollar', a metric that evaluates work accomplished versus cost, rather than just token cost or adoption.

0 favorites 0 likes

#measurement

Evaluating Nonuniform Dependability Across Response Conditions: A Conditional Generalizability Framework Illustrated in Automated Essay Scoring

arXiv cs.CL ↗ · 2026-07-15 Cached

This paper proposes a conditional generalizability framework to evaluate nonuniform dependability across response conditions in automated essay scoring.

0 favorites 0 likes

#measurement

@patrickc: It's been interesting and puzzling to witness the problems with accuracy in UK economic statistics over the past few ye…

X AI KOLs Following ↗ · 2026-07-14 Cached

The article discusses problems with UK economic statistics accuracy, particularly around entrepreneurship, and suggests that official figures may be missing a solopreneur boom as indicated by Stripe data.

0 favorites 0 likes

#measurement

How do you measure semantic cache correctness in production?

Reddit r/AI_Agents ↗ · 2026-07-11

Explores techniques for measuring the correctness of semantic caches in production environments, a key concern for AI/ML systems relying on caching for efficiency.

0 favorites 0 likes

#measurement

Validating LLMs in social science: Epistemic threats and emerging norms

arXiv cs.CL ↗ · 2026-07-10 Cached

This paper analyzes validation practices for using LLMs as measurement instruments in social science, identifying epistemic threats and proposing emerging norms for robust validation.

0 favorites 0 likes

#measurement

Every AI Visibility Tool Is Lying to You

Hacker News Top ↗ · 2026-07-03 Cached

This article critically examines the accuracy of AI visibility tools that claim to measure brand presence in generative AI responses, arguing that they provide false precision due to nondeterminism, personalization, and scraping biases. It calls for transparency in methodology and warns against treating opaque dashboards as stable truth.

0 favorites 0 likes

#measurement

Language Models as Measurement Apparatus for Culture

arXiv cs.CL ↗ · 2026-07-03 Cached

This paper argues that NLP research on culture is a material-discursive practice where language models participate in constituting cultural reality rather than passively recording it, drawing on Barad's concept of agential cut.

0 favorites 0 likes

#measurement

Goals from Loops

Product Hunt ↗ · 2026-07-02

Loops introduces goal tracking features to help users measure whether a campaign drove the desired outcome.

0 favorites 0 likes

#measurement

@ickma2311: Efficient AI Lecture 22: Quantum Machine Learning I Quantum ML starts from a different computational primitive: the Qub…

X AI KOLs Timeline ↗ · 2026-06-26 Cached

Lecture notes on the foundations of quantum machine learning, covering qubits, superposition, measurement, and the Bloch sphere.

0 favorites 0 likes

#measurement

@rohanpaul_ai: New Microsoft + York Univ paper argues that LLMs should not be treated as human-like without clear tests and narrower c…

X AI KOLs Following ↗ · 2026-06-20 Cached

A Microsoft and York University paper argues that attributing human-like attributes to LLMs is problematic due to flawed experimental designs, using Age of Empires II as an analogy to highlight measurement issues.

0 favorites 0 likes

#measurement

Your voice agent probably isn't slow because of the LLM.

Reddit r/AI_Agents ↗ · 2026-06-17

A developer debunks the common belief that LLM latency is the primary cause of slow voice agents, explaining that delays often stem from earlier stages like audio capture, VAD, and STT. They recommend logging specific latency metrics and testing various STT/TTS providers and orchestration frameworks to diagnose issues.

0 favorites 0 likes

#measurement

Linux latency measurements and compositor tuning

Lobsters Hottest ↗ · 2026-06-10 Cached

A detailed investigation of Linux latency in gaming using a Teensy-based LDAT tool, measuring click-to-photon latency with various settings on Nvidia GPUs under KDE Wayland, comparing to Windows.

0 favorites 0 likes

#measurement

Large-scale semantic mapping of learner agency and autonomy reveals what measurement and generative AI research overlook

arXiv cs.AI ↗ · 2026-06-10 Cached

This paper uses large-scale semantic analysis of over 14,000 publications to map definitions of learner agency and autonomy, revealing three dimensions and a systematic underrepresentation of the sociocultural dimension in existing scales. It argues that current generative AI research in education overly focuses on learning regulation, narrowing the behavioral repertoire for AI-mediated learning environments.

0 favorites 0 likes

#measurement

@saranormous: https://x.com/saranormous/status/2064510215056400652

X AI KOLs Following ↗ · 2026-06-10 Cached

Despite rapid advances in AI coding agents like Devin, which have dramatically increased code writing and shipping, the article argues that the most valuable aspects of software engineering remain illegible to benchmarks and require human judgement and organizational coordination that cannot be easily automated.

0 favorites 0 likes

#measurement

The AI Epistemic Deference Index: A Continuous Measure of Sycophancy

arXiv cs.AI ↗ · 2026-06-09 Cached

The paper introduces the AI Epistemic Deference Index (AEDI), a continuous measure of how much a model's expressed support for a factual claim shifts based on the user's stated attitude, and evaluates eight prominent models, finding substantial sycophancy with differences across providers.

0 favorites 0 likes

#measurement

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

arXiv cs.AI ↗ · 2026-06-01 Cached

Introduces PReMISE, a framework for discovering and auditing policy-level rubrics for LLM judges along four axes: structural adequacy, reliability, preference fit, and adversarial robustness.

0 favorites 0 likes

#measurement

AI evaluation may bias perceptions: The importance of context in interpreting academic writing

arXiv cs.CL ↗ · 2026-05-27 Cached

This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences across countries and fields, and proposes context-aware benchmarks for more accurate measurement.

0 favorites 0 likes

#measurement

Our voice agent's p99 was 280ms. Competitor's was 450ms. Users said ours felt slower. We measured why.

Reddit r/AI_Agents ↗ · 2026-05-26

A voice agent team found that despite lower end-to-end latency (280ms vs competitor's 450ms), users perceived it as slower due to poor barge-in interrupt rate (380ms vs 60ms). They identified three fixes—memory pinning, VAD threshold tuning, and smaller TTS chunks—that improved barge-in rate from 41% to 89% at 100ms, making users feel it's faster.

0 favorites 0 likes

#measurement

Screen Ruler

Product Hunt ↗ · 2026-05-23

Screen Ruler is a tool that provides on-screen measurements for designers and developers.

0 favorites 0 likes

measurement

Submit Feedback