Meet Alice. Alice is impatient

Lobsters Hottest News

Summary

This blog post explains the inspection paradox in system latency and recovery time measurement, showing why customers experience longer average waits than service metrics suggest. It includes an interactive simulation and emphasizes the importance of understanding the tail of the distribution.

<p><a href="https://lobste.rs/s/dswkwr/meet_alice_alice_is_impatient">Comments</a></p>
Original Article
View Cached Full Text

Cached at: 06/20/26, 02:32 PM

# Meet Alice. Alice is impatient. Source: [https://brooker.co.za/blog/2026/06/19/waiting.html](https://brooker.co.za/blog/2026/06/19/waiting.html) ## About Me My name is Marc Brooker\. I like to build things that work, and do cool stuff\. I like building big things\. I also dabble in machining, welding, cooking, and skiing\. I am an engineer at Amazon Web Services \(AWS\) in Seattle, where I work on agentic AI, especially safety and policy for agentic AI\. Before that, I worked on EC2, EBS, databases, serverless, and serverless databases\. All opinions are my own\. ## Links [My Publications and Videos](https://brooker.co.za/blog/publications.html) [@marcbrooker on Mastodon](https://fediscience.org/@marcbrooker)[@MarcJBrooker on Twitter](https://twitter.com/MarcJBrooker) [Is this blog written by AI?](https://brooker.co.za/blog/2026/06/18/my-blog-and-ai.html) What do you mean? Meet Alice\. Alice uses your web service\. Alice, like most humans, measures her time in seconds and minutes\. Alice says your service is slow\. You tell Alice that the mean request to your service completes in 100ms, but Alice says that her mean wait time is 1s\. You’re both right\. Meet Alex\. Alex uses your web service\. Alex, like most humans, measures his time in seconds and minutes\. Alex says that when you have outages, they last a long time and he gets really annoyed\. You tell Alex that your MTTR is less than 1 minute\. Alex says that he sees the mean outage lasting 1 hour\. Again, you’re both right\. What’s going on? What’s going on is that you’re measuring time in requests, or in outages, and Alex and Alice are measuring time in seconds and minutes\. When you have a long request or a long outage, Alex and Alice count that as a long time, with a heavy weight\. But you only count that as one\. More technically, what’s going on here is the*inspection paradox*\. Alex and Alice don’t experience your latency distribution $f\(t\)$, they experience a t\-weighted version of it\. If you have a MTTR or mean request time of $\\mathbb\{E\}\[X\]$, Alex and Alice experience $\\mathbb\{E\}\_a\[X\] = \\frac\{\\mathbb\{E\}\[X^2\]\}\{\\mathbb\{E\}\[X\]\} = \\mathbb\{E\}\[X\] \+ \\frac\{\\mathrm\{Var\}\(X\)\}\{\\mathbb\{E\}\[X\]\}$\. Most of the time they’re waiting, they’re waiting for things that take a long time\. This is \(roughly\) how humans experience time\. Let’s play with this with a little simulation\. Plug in your median latency \(or recovery time\), and 99th percentile latency \(or recovery time\), we’ll fit a log\-normal distribution to it, and then plot both what your service metrics see and what your customers see\. Median: ms p99: ms What your service sees \(mean\):**–ms**\. What your customers experience \(mean\):**–ms**\. For example, put in 30 as the median \(let’s ignore the milliseconds and pretend these are minutes for now\) for a 30 minute Median TTR \(i\.e\. in half of your postmortems you see a recovery time of $\\leq 30$ minutes\), and 600 in as the p99 \(one in every 100 events, recovery takes 10 hours\)\. Your MTTR is just over an hour\. Your customers experience a mean time to recovery of around 6 hours\! There are many arguments for why tail latency \(and long recovery times\) are so important to understand \(e\.g\.[multiple samples](https://brooker.co.za/blog/2021/04/19/latency.html)\), but this is the one that I think is the least widely understood\. For service times, timeout\-and\-retry can hide this latency some of the time \(as long as the running request doesn’t hold locks or other exclusive resources\)\. But, for recovery time, no such hiding is possible\. The heaviness if the tail matters a great deal\. This is also one of the reasons I don’t like trimmed measurements \(like trimmed means\) as a way of thinking about service latency or recovery time\. They throw out some really critical context about the shape of the right tail that dominates the customer experience \(the other reason is related to Little’s Law and capacity usage,[which I’ve written about before](https://brooker.co.za/blog/2017/12/28/mean.html)\)\. *A note on log\-normal:*I chose log\-normal here for numerical convenience\. It has the nice property that $\\mathrm\{lognormal\}\(\\mu, \\sigma^2\)$ becomes $\\mathrm\{lognormal\}\(\\mu \+ \\sigma^2, \\sigma^2\)$\. Also it’s well\-behaved around 0\. I don’t believe that log\-normal is a particularly good choice of distribution for latency or recovery time metrics, and generally would approach these problems entirely non\-parametrically\.

Similar Articles

We should get rid of average CPU utilization

Hacker News Top

The article explains why average CPU utilization is a misleading metric for latency-sensitive workloads, using queueing theory and a real-world production incident. It argues for more nuanced monitoring approaches.

Why Queues Don’t Fix Overload (And What To Do Instead)

Lobsters Hottest

Explains why unbounded queues are a bug in software systems, using Little's Law and the bathtub analogy to show that queues only absorb variance, not sustained load. Discusses latency death spirals and advocates for backpressure instead.

The Return of Rigorous Full-System Timing Simulation

Hacker News Top

This blog post argues for a return to rigorous full-system timing simulation in computer architecture to overcome the 'timing simulation wall' and accurately capture modern system behaviors, advocating for measuring the right execution intervals with statistically sound methods rather than simulating everything in detail.

Notes from evaluating a customer support chat agent system: heuristic evaluators give false signal, retrieval bugs masquerade as LLM failures, and the cost/quality Pareto frontier is rarely where you think [D]

Reddit r/MachineLearning

Practical findings from auditing a production customer support RAG system reveal that heuristic evaluators give false signal, retrieval bugs often masquerade as LLM failures, and the Pareto frontier for cost and quality is often not where expected. Sweeping models showed that replacing the incumbent (Gemini Flash Lite Preview) with Gemma 4 26B achieved a 19% quality improvement at 79% lower cost.