Tag
Introduces PReMISE, a framework for discovering and auditing policy-level rubrics for LLM judges along four axes: structural adequacy, reliability, preference fit, and adversarial robustness.
This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences across countries and fields, and proposes context-aware benchmarks for more accurate measurement.
A voice agent team found that despite lower end-to-end latency (280ms vs competitor's 450ms), users perceived it as slower due to poor barge-in interrupt rate (380ms vs 60ms). They identified three fixes—memory pinning, VAD threshold tuning, and smaller TTS chunks—that improved barge-in rate from 41% to 89% at 100ms, making users feel it's faster.
Screen Ruler is a tool that provides on-screen measurements for designers and developers.
The author explores the difficulty of accurately measuring AI proficiency in hiring, arguing that current certifications and tests focus on memorization rather than practical reasoning and evaluation.
The article argues that despite modern scientific instruments, all measurements ultimately derive from two ancient techniques: comparison and counting, illustrated through examples like rulers and sundials.
A technical deep dive into the historical inconsistency of the typographic point unit, explaining why TeX (72.27 pt/inch) and Inkscape (72 pt/inch) use different definitions, rooted in 19th-century standardization and Donald Knuth's pragmatic adjustment.