Tag
A human review of TranslateGemma-12b's translations revealed that 71% of segments rated clean by automated metrics actually contained errors, highlighting significant gaps in metric-only evaluation for multilingual translation quality.
This paper argues that LLM inference should be evaluated as energy-to-token production under constraints of compute, power, cooling, and operational efficiency, proposing new metrics like joules/token and PUE-adjusted delivered power.
An article discussing the proper use of SI units for measuring request rate in distributed systems, proposing the use of hertz (Hz) for periodic/regular traffic and becquerel (Bq) for stochastic/organic traffic patterns to standardize how request rates are communicated.
Waydev launches a new platform to measure the full AI software development lifecycle, tracking metrics from token-level operations through production deployment.