Tag
This paper introduces Benchmarking-Cultures-25, a dataset analyzing how AI model builders selectively highlight benchmarks in press releases. It finds a fragmented evaluation landscape with limited cross-model comparability, arguing that benchmarks are used as narrative devices for market positioning rather than standardized scientific measurement.
A human review of TranslateGemma-12b's translations revealed that 71% of segments rated clean by automated metrics actually contained errors, highlighting significant gaps in metric-only evaluation for multilingual translation quality.
This paper argues that LLM inference should be evaluated as energy-to-token production under constraints of compute, power, cooling, and operational efficiency, proposing new metrics like joules/token and PUE-adjusted delivered power.
An article discussing the proper use of SI units for measuring request rate in distributed systems, proposing the use of hertz (Hz) for periodic/regular traffic and becquerel (Bq) for stochastic/organic traffic patterns to standardize how request rates are communicated.
Waydev launches a new platform to measure the full AI software development lifecycle, tracking metrics from token-level operations through production deployment.