JobBench: Aligning Agent Work With Human Will

arXiv cs.AI 05/27/26, 04:00 AM Papers
benchmark ai-agents human-alignment workplace-automation jobbench o-net
Summary
JobBench is a benchmark built from worker surveys to evaluate AI agents on tasks that workers most want automated, covering 130 tasks across 35 professions with detailed rubrics.
arXiv:2605.26329v1 Announce Type: new Abstract: Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations. Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4.7 under Claude Code, reaches only 45.9 %. We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.
Original Article
View Cached Full Text
Cached at: 05/27/26, 09:04 AM
# JobBench: Aligning Agent Work With Human Will
Source: [https://arxiv.org/html/2605.26329](https://arxiv.org/html/2605.26329)
Yuetai Li1Yichen Feng111footnotemark:1Zhangchen Xu1,10Zixian Ma1Kaiyuan Zheng1 Fengqing Jiang1Xinghua Sun1Rulin Shao1Zichen Chen2,3,10Yue Huang6 Xinyang Han7Brian Lee13Kayla Xu5Shenglai Zeng8Hang Hua9 Xiangliang Zhang6Basel Alomair1,11Ranjay Krishna1Luke Zettlemoyer1Pang Wei Koh1 Bhaskar Ramasubramanian12Luyao Niu1Xiang Yue4Radha Poovendran1 1University of Washington2University of California, Santa Barbara3Stanford University 4Carnegie Mellon University5Northwestern University6University of Notre Dame 7University of California, Berkeley8Michigan State University9MIT\-IBM Watson AI Lab 10Bake AI11King Abdulaziz City for Science and Technology12Western Washington University13University of Chicago Homepage:[https://job\-bench\.github\.io/](https://job-bench.github.io/) Hugging Face:[https://huggingface\.co/datasets/JobBench/job\-bench](https://huggingface.co/datasets/JobBench/job-bench) GitHub:[https://github\.com/Job\-Bench/job\-bench\-eval](https://github.com/Job-Bench/job-bench-eval)

## 1Introduction

The conversation about AI in the workplace has been framed almost entirely in economic terms:What fraction of working hours can agents absorb? How much of GDP is exposed to automation?Workplace benchmarks inherit this framing by design\. GDPVal\[[31](https://arxiv.org/html/2605.26329#bib.bib26)\]selects tasks that represent economic value and scores agents on whether they can deliver professional knowledge work judged against expert reference deliverables\[[31](https://arxiv.org/html/2605.26329#bib.bib26)\]\. The Remote Labor Index measures end\-to\-end remote\-work projects against contractor pay\[[18](https://arxiv.org/html/2605.26329#bib.bib29)\]\. $OneMillion\-Bench prices each of its 400 expert tasks by senior\-expert hours times market wage and scores agents on the share of that expert\-priced work they can reliably deliver\[[41](https://arxiv.org/html/2605.26329#bib.bib28)\]\. All these benchmarks ask the same question in different shapes: which economically valuable deliverables can an agent now produce on its own?

This framing, however, leaves a complementary criterion ignored\. If agents are to share the professional workplace with the humans who hold those jobs, evaluations should weigh, which subset of its duties the workers themselves prefer to delegate\. We treat this as a human\-centered constraint on benchmark design: the professional is positioned not as labor to be displaced, but as a domain expert whose stated preferences over their own work inform which duties merit automation and enhance the productivity\.

JobBenchis a benchmark built on that principle\. Every one of its 130 tasks across 35 professions is constructed from work that experts in that field claim they most want a capable agent to handle\. We design tasks on top of Workbank\[[34](https://arxiv.org/html/2605.26329#bib.bib30)\], a worker\-centered survey in which more than 1,500 workers rate, for every O\*NET111The Occupational Information Network \(O\*NET\), maintained by the U\.S\. Department of Labor, provides a detailed list of work duties for each occupation\.work duty of their own occupation, whether they would prefer an AI agent to take that work over\. We select the 35 occupations with high average willingness for automation, and develop the work duties into full benchmark evaluations\. Figure[1](https://arxiv.org/html/2605.26329#S0.F1)illustrates a reporter task example, grounded by a duty that reporters most want delegated to AI: “checking different source reference materials to obtain relevant facts\.” Dashed lines trace each fact from its source file, through the reasoning challenge it raises, to the deliverable that fact ultimately supports\.

JobBench grades each task with a chained rubric set that carries 35\.6 binary criteria per task on average and 4,631 criteria in total\. Every criterion is anchored to a deterministic number, a specific reasoning step, or a documented professional judgment, and a rubric receives its weight only when every criterion in the chain passes together\. There is no partial credit for surfacing the right fact through a wrong inference\.

We summarize our contributions as follows:

- •Align with human will\.We ground every JobBench task in domain expert delegation\-desire, so that progress on the leaderboard maps onto how agents lift worker satisfaction and productivity together\.
- •Professional reasoning\.Each task provides a workspace of heterogeneous reference files that may contain conflicts and hints for search\. Credit goes to agents that retrieve and reconcile the right sources\.
- •Fact\-anchored chained rubrics\.The 4,631 binary criteria are organized into rubric chains that award credit only when every criterion in the chain passes\. Pooled across sampled runs from different agents, 95\.4% of rubrics are passed at least once, evidence that each criterion is verifiable in practice\.
- •Challenging for frontier models\.Across 36 agent configurations, the strongest setup, Claude Opus 4\.7 underClaude Code, reaches 45\.9%; outside the Claude and GPT families no agent exceeds 19%\.

## 2The JobBench Benchmark

### 2\.1Design Principles of JobBench

Align with human will\.JobBench treats the expert’s own willness and judgment as the selection signal, drawing on Workbank\[[34](https://arxiv.org/html/2605.26329#bib.bib30)\], in which more than 1,500 workers rate every work duty in their occupation for delegation desire\. By targeting duties that experts want delegated and spend the most preparation time on, JobBench measures capability on work whose automation lifts worker satisfaction and productivity together\.

From knowledge delivery to professional reasoning\.GDPVal\[[31](https://arxiv.org/html/2605.26329#bib.bib26)\]evaluates polished deliverables from relatively clean task packets\. JobBench instead issues heterogeneous, sometimes conflicting workspaces where agents must locate, retrieve, and reconcile source evidence before producing the final artifact\. This shifts the evaluation target from presenting plausible professional output to doing the source\-grounded reasoning that makes such output defensible\.

Enhancement, not replacement\.Scoping a task around the economic value and end\-to\-end deliverable explicitly tells a replacement story: the agent as a stand\-in for the human on the job\. JobBench instead scores the work whose automation augments the expert rather than substitutes for them\.

Table[1](https://arxiv.org/html/2605.26329#S2.T1)compares the design principles of JobBench against GDPVal\. For reporters, the duty experts most want offloaded is "cross\-source fact checking" from the WorkBank survey\[[34](https://arxiv.org/html/2605.26329#bib.bib30)\], yet GDPVal task only focuses on a single article edit over a pre\-assembled source packet, while JobBench scopes the cross\-year reconciliation of water\-quality CSVs, EPA guidance, and surveillance data\.

Table 1:Case comparison between JobBench and GDPVal\[[31](https://arxiv.org/html/2605.26329#bib.bib26)\]on the occupations ofReporterandTechnical Sales\.
### 2\.2JobBench Overview

Data distribution\.Table[2](https://arxiv.org/html/2605.26329#S2.T2)reports the full statistics of JobBench\. It contains a 65\-task main set and a 65\-task easy set, covering 35 O\*NET occupations spanning 10 SOC222The Standard Occupational Classification \(SOC\) is the U\.S\. federal taxonomy that groups the labor force into occupations\.groups\. Tasks are backed by 502 reference files in 17 file formats, with each 3\.9 reference files on average\. Most are sourced from real\-world public records, including federal agency releases \(e\.g\., CDC, EPA, EIA, Census, USDA, FRED, EEOC\), state and municipal portals \(city ordinances, court filings, public\-health and procurement records\), academic and research repositories, and open\-data platforms \(city open\-data hubs, Kaggle, GitHub data dumps\)\. Of the reference files in the main set, 51\.7% are from real world public records and the rest are synthesized\. All reference files in the easy set come from real world\.

![[Uncaptioned image]](https://arxiv.org/html/2605.26329v1/x2.png)

Figure 2:JobBench task distribution\. Top: the 10 SOC groups covered, with each pie showing the category’s share of the 130 total tasks\. Bottom: the 35 occupations grouped by category, with bar length giving per\-occupation task count\.
Task specification\.Each JobBench task is packaged as an agentic workplace bundle with:

- •Query: A professional scenario that fixes the context and the deliverables the agent must produce\.
- •Reference files: A workspace of heterogeneous source documents that the agent must read across and reconcile before running the required analysis\.
- •Binary Criterias: Binary checks anchored to verifiable numbers, facts, and documented professional judgments\.
- •Rubrics: A reasoning chain whose nodes are criteria; the chain orders judgments an expert would walk to defend the underlying claim\. A rubric is awarded with weighted scores only when every node passes together, so a fluent answer that quietly drops the methodology check or the threshold flagging fails the chain\.

TMore task examples are shown in Appendix[E](https://arxiv.org/html/2605.26329#A5)\.

Table 2:Key statistics of JobBench\.
### 2\.3Benchmark Construction

Occupation selection\.We target occupations that combine high automation desire with significant economic exposure\. We start from Workbank, which provides crowd\-sourced automation\-desire scores \(1\-5 scale\) for the O\*Net work duties of each occupation\[[34](https://arxiv.org/html/2605.26329#bib.bib30)\], and merge its entries with OEWS 2024 total wages to quantify economic exposure\[[36](https://arxiv.org/html/2605.26329#bib.bib63)\]\. The resulting 35 occupations consist of occupations with average desire score above 3, and ranked by economic exposure\. A feasibility filter then requires each retained work duty to be*digitalizable*,*evaluable*, and*supportable*and the automation desire above 3, to form the source pool for task design\. The final occupation distribution is shown in Figure[2](https://arxiv.org/html/2605.26329#S2.F2)\.

![Refer to caption](https://arxiv.org/html/2605.26329v1/x3.png)Figure 3:JobBench expert onboarding: \(1\) the domain\-expert pool from Prolific, \(2\) onboard experts onto the JobBench annotation platform, and \(3\) refine and annotation\.Expert pool\.Domain experts are recruited through Prolific\[[30](https://arxiv.org/html/2605.26329#bib.bib64)\], a research participant recruitment platform, and Upwork\[[37](https://arxiv.org/html/2605.26329#bib.bib65)\], a freelance marketplace platform\. Prolific pool covers an average of 26\.5 distinct experts per occupation as indexed by participant ID\. In Upwork,, we search by occupation keyword and retain candidates whose job success rate exceeds 90%\. Selected experts complete a structured onboarding flow and are then assigned jobs on our annotation platform, which integrates with AI tools used during annotation and logs the usage for review and tracking\.

Task curation\.For each occupation, annotators draft task sketches grounded in the high\-desire work duties reported in Workbank, specifying the scenario, reasoning challenges, deliverables, and the standards that should be enforced\. Sketches are expanded with AI assistance into a workspace bundling reference files, self\-contained task queries, and rubric chains\. The Easy set tasks require no web\-search evidence and pose fewer reasoning challenges than the Main set\.

Every criterion in the rubric chain is required to satisfy:

- •Self\-contained: a rubric can be judged on its own, without inheriting context outside\.
- •Binary: it resolves to a clean pass or fail, leaving no room for partial credit\.
- •Objective: it ties to a verifiable artifact or a reproducible computation\.
- •Unambiguous: the referent under check is named precisely enough that two graders cannot reasonably disagree on what is being evaluated\. We observed that the ambiguity is the key to making different LLM judges disagree on the same answer during evaluation\.

Refine and filter\.A candidate task passes three quality gates before entering the benchmark:

- •Automated audit:An audit agent checks the consistency between the task instruction and its reference files, the professional plausibility of the task itself, and the correctness of each rubric in the chain\. Tasks or rubrics that fail any of these checks are dropped\.
- •Annotator review:Annotators polish the task instruction and prune low\-quality rubrics, and only tasks that receive positive annotator feedback advance\.
- •Solve trial:Surviving tasks are run with different agents under multiple sampling and then judged by the rubrics\. We score each task by the union of rubrics passed across all runs, and retain only tasks whose union covers more than 90% of their own rubric set\.

71% tasks are finally passed through the 3\-stage quality check pipeline\. The final union pass rate across the accepted benchmark is 95\.4%, meaning that more than 95% of all criteria were passed by at least one agent on one sample, evidence that the rubric set is achievable in practice\.

## 3Experiments

### 3\.1Experimental Setup

Table 3:JobBench\-Main leaderboard across different agentic scaffolds\.#### Models and agentic scaffolds\.

We evaluate a representative set of recent agentic models across major proprietary and open\-source families\. The evaluated models include Anthropic Claude\[[4](https://arxiv.org/html/2605.26329#bib.bib2),[5](https://arxiv.org/html/2605.26329#bib.bib3),[8](https://arxiv.org/html/2605.26329#bib.bib6),[7](https://arxiv.org/html/2605.26329#bib.bib7),[6](https://arxiv.org/html/2605.26329#bib.bib4),[9](https://arxiv.org/html/2605.26329#bib.bib8),[3](https://arxiv.org/html/2605.26329#bib.bib5)\]\(Opus\-4, Opus\-4\.5, Opus\-4\.6, Opus\-4\.7, Sonnet\-4, Sonnet\-4\.5, Sonnet\-4\.6, and Haiku\-4\.5\); the OpenAI GPT\-5 series\[[23](https://arxiv.org/html/2605.26329#bib.bib9),[24](https://arxiv.org/html/2605.26329#bib.bib10),[25](https://arxiv.org/html/2605.26329#bib.bib11),[27](https://arxiv.org/html/2605.26329#bib.bib12),[28](https://arxiv.org/html/2605.26329#bib.bib13)\]\(GPT\-5, 5\.1, 5\.2, 5\.4, 5\.5\) together with its Codex variants\[[26](https://arxiv.org/html/2605.26329#bib.bib14)\]\(GPT\-5\.1\-Codex, 5\.2\-Codex, 5\.3\-Codex\); Google Gemini 3\[[14](https://arxiv.org/html/2605.26329#bib.bib15)\]\(Pro and Flash\); Qwen\-3\.5\-Plus\[[32](https://arxiv.org/html/2605.26329#bib.bib20)\]; MiniMax\-M2\.5\[[20](https://arxiv.org/html/2605.26329#bib.bib21)\]; Kimi\-K2\.5\[[21](https://arxiv.org/html/2605.26329#bib.bib22)\]; and xAI Grok\-4\.2\-Fast\[[39](https://arxiv.org/html/2605.26329#bib.bib23)\]\.

We use four agentic scaffolds that span the major deployment surfaces:Claude Code\(v2\.1\.2\)\[[2](https://arxiv.org/html/2605.26329#bib.bib16)\],Codex CLI\(v0\.125\.0\)\[[22](https://arxiv.org/html/2605.26329#bib.bib17)\],OpenCode\(v1\.14\.18\)\[[35](https://arxiv.org/html/2605.26329#bib.bib18)\], andOpenClaw\(v2026\.3\.8\)\[[29](https://arxiv.org/html/2605.26329#bib.bib19)\]\. Each scaffold wraps a base model with its own tool\-use, planning, and file\-edit policies, including shell execution, multi\-file editing, sub\-agent delegation, context compression, web browsing and fetching\. We always choose the maximum reasoning\-effort level that each model and scaffold supports by default\. The agent evaluation prompt is reported in Appendix[F\.1](https://arxiv.org/html/2605.26329#A6.SS1)\.

#### Task execution\.

Each task is presented as an isolated workspace, containing the local reference documents, a task instruction, and a dedicated output directory\. The agent reads the instructions, operates only within this workspace, and outputs its final deliverables to the output directory\. Tasks run in non\-interactive headless agentic mode\. Agents runners use a 60\-minute wall\-clock timeout per task and grant the agent file access only to the task’s temporary workspace\. We use each scaffold’s default sampling configuration\.

#### Evaluation and judge\.

Each task is graded with rubrics by LLM\-as\-a\-judge\. The judge evaluates each rubric jointly over its binary criteria\. A rubric receives its full weightwrw\_\{r\}only if every criterion passes; with the binary indicatorzr∈\{0,1\}z\_\{r\}\\in\\\{0,1\\\}, the per\-task normalized score isst=∑rwrzr/∑rwrs\_\{t\}=\\sum\_\{r\}w\_\{r\}z\_\{r\}/\\sum\_\{r\}w\_\{r\}\. The leaderboard reports the average1N∑t=1Nst\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{N\}s\_\{t\}as the model performance\. The judge prompt is shown in Appendix[F\.2](https://arxiv.org/html/2605.26329#A6.SS2)\. For cost consideration, we usex\-ai/grok\-4\.1\-fastas the judge by default\. We validate the judge results against a stronger reference model, Opus\-4\.5\. The two judges agree on scores within 0\.7% variance across different configurations\. It reduced a full\-run judge API cost from $38–$46 with the Opus 4\.5 judge to $1\.5–$2 with Grok\-4\.1\-Fast, an approximately20×20\\timesreduction\. See Section[3\.2](https://arxiv.org/html/2605.26329#S3.SS2)for more details\.

![Refer to caption](https://arxiv.org/html/2605.26329v1/x4.png)Figure 4:Comparison between GDPVal and JobBench\. \(a\) Model scores on GDPVal, JobBench\-Easy, and JobBench\-Main\. GDPval scores approach saturation above 70 while JobBench\-Main is underneath 40% and JobBench\-Easy lands in between\. \(b\) Runtime, tool\-call, and trajectory lines complexity for GDPVal, JobBench\-Easy, and JobBench\-Main, normalized to GDPval=1\.0=1\.0\. \(c\) For GPT\-5\.4, JobBench\-Main tasks shift toward lower scores and longer runtimes than the easy set\.### 3\.2Main Results

Table[3\.1](https://arxiv.org/html/2605.26329#S3.SS1)reports the JobBench leaderboard across 36 model–scaffold configurations on the main set\. We merge the four smallest occupational categories:Sales and Related,Legal,Life, Physical, and Social Science, andEducational Instruction and Library, into a singleOtherscolumn\.

#### Challenging nature of JobBench\.

As shown in Table[3\.1](https://arxiv.org/html/2605.26329#S3.SS1), as shown in Table[3\.1](https://arxiv.org/html/2605.26329#S3.SS1), agents still fall short of the tasks that that experts actually want AI to perform\. The strongest configuration, Claude Opus 4\.7 underClaude Code, reaches only 45\.9 on JobBench, and the next two rows—GPT\-5\.5 underCodex\(42\.7\) and GPT\-5\.4 underCodex\(38\.9\)—also remain below 50\. Beyond the Claude and GPT families, no configuration exceeds 19 points, and the weakest, Grok\-4\.2\-Fast, scores just 4\.38\. Today’s agents continue to make substantial errors on the complex, professional reasoning that experts most want AI to handle\.

#### GDPVal vs\. JobBench\.

Figure[4](https://arxiv.org/html/2605.26329#S3.F4)\(a\) shows that GDPval has approached saturation: GPT\-5\.3\-Codex reaches 70\.9 and GPT\-5\.4 reaches 83\.0\.333GDPval scores are taken from the official OpenAI report\[[31](https://arxiv.org/html/2605.26329#bib.bib26)\]and correspond to the win\+tie rate against industry\-expert reference deliverables\.The corresponding JobBench main\-setCodexscores, by contrast, are only 33\.7 and 38\.9\. Figure[4](https://arxiv.org/html/2605.26329#S3.F4)\(b\) further shows that, on JobBench\-Main, GPT\-5\.4 underCodextakes2\.40×2\.40\\timesthe runtime of GDPval, while tool calls and trajectory events both rise to roughly1\.3×1\.3\\timesGDPval levels, indicating that JobBench\-Main demands substantially greater agentic work complexity\.

#### Main vs\. Easy

We compare the main against the easy set with the same occupation coverage\. Figure[4](https://arxiv.org/html/2605.26329#S3.F4)\(c\) shows the shift\. Scores rise by 26–31 points on the easy set, and GPT\-5\.4Codextraces shorten from 16\.4 to 6\.4 minutes of runtime\. Recall that the easy set by design has fewer reasoning challenges, multi\-source file conflicts, deliverable files, and requires no web search during task completion – all reference files are under the local workspace\.

#### Reasoning effort\.

Figure[6](https://arxiv.org/html/2605.26329#S3.F6)shows sweeping GPT\-5\.4 underCodexfromlowtoxhighreasoning effort yields a monotonic JobBench Main performance gain of\+7\.0\+7\.0points\.

![Refer to caption](https://arxiv.org/html/2605.26329v1/x5.png)Figure 5:Cost analysis for JobBench evaluation\. \(a\) Full\-run inference cost \(USD\) on the main set, computed by provider list pricing\. \(b\) Judge score agreement and \(c\) full\-run judge cost for Claude Opus 4\.5 and Grok\-4\.1\-Fast as the judge\. Grok\-4\.1\-Fast tracks the Opus 4\.5 judge within 0\.1–0\.7 points variance, while reducing full\-run judge cost by roughly20×20\\times, from $40 to $2\.
#### Scaffold matters as much as base model\.

Scaffold choice can shift the overall score by several points when the base model is fixed\. Claude Sonnet 4\.6 scores 36\.9 underClaude Codebut only 30\.6 underOpenClaw; Claude Opus 4\.5 scores 32\.3 underClaude Codeversus 29\.1 underOpenCode; GPT\-5\.4 scores 38\.9 underCodexversus 34\.6 underOpenCode\. The induced gaps are largest inOffice / Admin Support,Computer / Mathematical, andManagement\.

![Refer to caption](https://arxiv.org/html/2605.26329v1/x6.png)Figure 6:GPT\-5\.4 underCodexon JobBench\-Main as reasoning effort scales fromlowtoxhigh\.
#### Performance vs\. Cost\.

Figure[5](https://arxiv.org/html/2605.26329#S3.F5)\(a\) plots overall JobBench\-Main score against full\-run inference cost at provider list pricing\. Costs span from $32 for GPT\-5\.3\-Codex underCodex CLIto $210 for Claude Opus\-4\.7 underClaude Code\. GPT\-5\.5 dominates the mid\-tier with 42\.7 points at $44 \(cheaper*and*higher\-scoring in the same scaffold\), and Opus\-4\.7 is the only configuration above 43 points but at a∼\\sim5×5\\timespremium over GPT\-5\.5\.

#### Judge cost optimization\.

Figure[5](https://arxiv.org/html/2605.26329#S3.F5)\(b\) and \(c\) compare Claude Opus\-4\.5 and Grok\-4\.1\-Fast as the rubric judge under different model–scaffold configurations\. Figure[5](https://arxiv.org/html/2605.26329#S3.F5)\(b\) reports that the two judges agree on judge scores within 0\.7 point variance, while Figure[5](https://arxiv.org/html/2605.26329#S3.F5)\(c\) shows Grok\-4\.1\-Fast reduces full\-run judge cost from $38–$46 \(Opus 4\.5\) to $1\.97—an approximately20×20\\timesreduction\.

### 3\.3Occupational Analysis

#### Setup\.

We further study how the research and startup communities allocate attention across high\-automation\-desire occupations\. We tag 3,516 agentic\-related LLM arXiv paper abstracts \(April 2025–March 2026\) and 2,283 AI\-related YC company descriptions \(batches 2006–W26\) with GPT\-5\.4\-mini, deciding for each whether it falls within the 35 high\-desire occupations and corresponding work duties covered by JobBench\. This yields 2,932 papers and 2,089 startups labeled with at least one matching occupation; a single paper or company may map to multiple occupations\. For each occupation, we plot its mean JobBench\-Main score \(averaged across all Claude and GPT family models\) against its WORKBank automation\-desire score in Figure[7](https://arxiv.org/html/2605.26329#S3.F7), with the per\-occupation paper/ startup count encoded as bubble size\. The median split of capability against desire defines the high\-capability*Sweet Zone*and low\-capability*Research & Develop \(R&D\)*quadrants\. We compute the Pearson correlationrrbetween attention \(paper/startup per occupation\) and JobBench score\.

#### Both research papers and startups concentrate more on the R&D quadrant\.

Figure[7](https://arxiv.org/html/2605.26329#S3.F7)\(a\) and \(b\) show that attention correlates*negatively*with model capability \(r=−0\.15r=\-0\.15for papers,r=−0\.34r=\-0\.34for YC\), and the R&D\-to\-Sweet\-Zone attention ratio both exceeds one \(1\.561\.56for papers,1\.621\.62for YC\)\. In other words, R&D\-quadrant occupations attract more attention than the Sweet\-Zone ones where agents already handle well\. Figure[7](https://arxiv.org/html/2605.26329#S3.F7)\(c\) decomposes this attention into a per\-occupation gap, isolating where the two communities focus differently on the same areas\. Research piles disproportionately ontocomputer\_and\_information\_research\_scientists\(\+20\.5\+20\.5pp gap\), likely reflecting the boom in LLM\-related research these years, and also leans toward knowledge\-heavy occupations such associal\_science\_research\_assistants\. Startups instead concentrate oncustomer\_service\_representatives\(\+4\.6\+4\.6\), andfinancial\_managers\(\+4\.4\+4\.4\), spreading attention toward more economically lucrative occupations\.

![Refer to caption](https://arxiv.org/html/2605.26329v1/x7.png)Figure 7:Occupation analysis for research and startup attention over JobBench\. Each occupation is placed by JobBench\-Main scores on the x\-axis \(mean across all Claude and GPT family models\) and by the WorkBank expert automation\-desire score on the y\-axis; dashed median splits define the high capability*Sweet Zone*and low capability*Research & Develop \(R&D\)*quadrants\. \(a\) Bubble size encodes the count of agentic\-LLM arXiv papers \(April 2025–March 2026; 2,932 unique papers tagged by GPT\-5\.4\-mini against the 35 high\-desire occupations\)\. \(b\) Bubble size encodes the per\-occupation count of AI\-related YC companies \(batches 2006–W26; 2,089 unique companies\)\. Attention correlates*negatively*with capability \(r=−0\.15r=\-0\.15for papers,r=−0\.34r=\-0\.34for YC\), and the R&D\-to\-Sweet\-Zone attention ratio exceeds one \(1\.561\.56and1\.621\.62\); external attention therefore concentrates more on the R&D\-quadrant than the Sweet\-Zone, where agents already handle well\. \(c\) Attention divergence: Research leans toward knowledge\-heavy occupations, while startups instead concentrate on more economically\-significant areas\.## 4Conclusion

In this paper, we introduce JobBench, a benchmark that aligns agentic evaluation with human will instead of only focusing on economic values\. Its 130 tasks span 35 occupations, each built from a Workbank\-elicited delegation preference, packaged as a workspace of heterogeneous reference files, and graded by chained rubrics whose 4,631 binary criteria award credit only when every step in the chain holds together\. Across 36 models, the strongest setup, Claude Opus 4\.7 underClaude Code, reaches 45\.9, and outside the Claude and GPT families, no models exceed 20%\. JobBench is a leaderboard for tracking agent capability on work duties whose automation would most directly enhance the workers’ satisfaction and productivity\. We hope it shifts the community’s labour\-market target from replacement to enhancement, building agents that do what humans actually want delegated\.

## Acknowledgment

This work is partially supported by the National Science Foundation \(NSF\) AI Institute for Agent\-based Cyber Threat Intelligence and Operation \(ACTION\) under grant IIS 2229876, the Office of Naval Research \(ONR\) under grant N0014\-23\-1\-2386, and the Air Force Office of Scientific Research \(AFOSR\) under grant FA9550\-23\-1\-0208\.

This work is supported in part by funds provided by the National Science Foundation, Department of Homeland Security, and IBM\. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author\(s\) and do not necessarily reflect the views of the NSF or its federal agency and industry partners\.

## References

- \[1\]A\. F\. Akyürek, A\. Gosai, C\. B\. C\. Zhang, V\. Gupta, J\. Jeong, A\. Gunjal, T\. Rabbani, M\. Mazzone, D\. Randolph, M\. M\. Meymand,et al\.\(2025\)PRBench: large\-scale expert rubrics for evaluating high\-stakes professional reasoning\.arXiv preprint arXiv:2511\.11562\.External Links:[Link](https://arxiv.org/abs/2511.11562)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px3.p1.1)\.
- \[2\]Anthropic\(2025\)Claude Code: an agentic coding tool for the terminal\.Note:[https://docs\.claude\.com/en/docs/claude\-code/overview](https://docs.claude.com/en/docs/claude-code/overview)Source:[https://github\.com/anthropics/claude\-code](https://github.com/anthropics/claude-code)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p2.1)\.
- \[3\]Anthropic\(2025\-10\)System card: Claude Haiku 4\.5\.Note:[https://www\.anthropic\.com/claude\-haiku\-4\-5\-system\-card](https://www.anthropic.com/claude-haiku-4-5-system-card)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[4\]Anthropic\(2025\-05\)System card: Claude Opus 4 & Claude Sonnet 4\.Note:[https://www\.anthropic\.com/claude\-4\-system\-card](https://www.anthropic.com/claude-4-system-card)Covers Claude Opus 4, Sonnet 4, and the broader Claude 4 model familyCited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[5\]Anthropic\(2025\-11\)System card: Claude Opus 4\.5\.Note:[https://www\.anthropic\.com/claude\-opus\-4\-5\-system\-card](https://www.anthropic.com/claude-opus-4-5-system-card)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[6\]Anthropic\(2025\-09\)System card: Claude Sonnet 4\.5\.Note:[https://www\.anthropic\.com/claude\-sonnet\-4\-5\-system\-card](https://www.anthropic.com/claude-sonnet-4-5-system-card)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[7\]Anthropic\(2026\-04\)Introducing Claude Opus 4\.7\.Note:[https://www\.anthropic\.com/news/claude\-opus\-4\-7](https://www.anthropic.com/news/claude-opus-4-7)Release announcement; the corresponding system card is listed at[https://www\.anthropic\.com/system\-cards](https://www.anthropic.com/system-cards)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[8\]Anthropic\(2026\-02\)System card: Claude Opus 4\.6\.Note:[https://www\.anthropic\.com/claude\-opus\-4\-6\-system\-card](https://www.anthropic.com/claude-opus-4-6-system-card)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[9\]Anthropic\(2026\-02\)System card: Claude Sonnet 4\.6\.Note:[https://www\.anthropic\.com/claude\-sonnet\-4\-6\-system\-card](https://www.anthropic.com/claude-sonnet-4-6-system-card)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[10\]E\. Brynjolfsson, D\. Li, and L\. R\. Raymond\(2023\)Generative AI at work\.Note:NBER Working Paper No\. 31161External Links:[Link](https://arxiv.org/abs/2304.11771)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px4.p1.1),[Appendix G](https://arxiv.org/html/2605.26329#A7.SS0.SSS0.Px2.p1.1)\.
- \[11\]W\. X\. Chen, S\. Srinivasan, and S\. Zakerinia\(2024\)Displacement or complementarity? the labor market impact of generative AI\.Note:Harvard Business School Working Paper 25\-039External Links:[Link](https://www.hbs.edu/ris/Publication%20Files/25-039_05fbec84-1f23-459b-8410-e3cd7ab6c88a.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px4.p1.1)\.
- \[12\]A\. Drouin, M\. Gasse, M\. Caccia, I\. H\. Laradji, M\. Del Verme, T\. Marty, L\. Boisvert, M\. Thakkar, Q\. Cappart, D\. Vazquez, N\. Chapados, and A\. Lacoste\(2024\)WorkArena: how capable are web agents at solving common knowledge work tasks?\.External Links:[Link](https://arxiv.org/abs/2403.07718)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px2.p1.1)\.
- \[13\]T\. Eloundou, S\. Manning, P\. Mishkin, and D\. Rock\(2023\)GPTs are GPTs: an early look at the labor market impact potential of large language models\.External Links:[Link](https://arxiv.org/abs/2303.10130)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px4.p1.1)\.
- \[14\]Google DeepMind\(2025\-12\)Gemini 3 model documentation\.Note:[https://storage\.googleapis\.com/deepmind\-media/Model\-Cards/Gemini\-3\-Pro\-Model\-Card\.pdf](https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf)Gemini 3 Pro model card; Gemini 3 Flash documentation is available at[https://deepmind\.google/models/gemini/flash/](https://deepmind.google/models/gemini/flash/)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[15\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2009.03300)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px1.p1.1)\.
- \[16\]C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan\(2024\)SWE\-bench: can language models resolve real\-world GitHub issues?\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2310.06770)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px2.p1.1)\.
- \[17\]X\. Liuet al\.\(2024\)AgentBench: evaluating LLMs as agents\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2308.03688)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px2.p1.1)\.
- \[18\]M\. Mazeikaet al\.\(2025\)Remote labor index: measuring ai automation of remote work\.External Links:[Link](https://arxiv.org/abs/2510.26787),2510\.26787Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.26329#S1.p1.1)\.
- \[19\]G\. Mialon, C\. Fourrier, C\. Swift, T\. Wolf, Y\. LeCun, and T\. Scialom\(2024\)GAIA: a benchmark for general AI assistants\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2311.12983)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px2.p1.1)\.
- \[20\]MiniMax\(2026\-02\)MiniMax M2\.5: built for real\-world productivity\.Note:[https://www\.minimax\.io/news/minimax\-m25](https://www.minimax.io/news/minimax-m25)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[21\]Moonshot AI\(2026\-01\)Kimi K2\.5\.Note:[https://github\.com/MoonshotAI/Kimi\-K2\.5](https://github.com/MoonshotAI/Kimi-K2.5)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[22\]OpenAI\(2025\)Codex CLI: a lightweight coding agent for the terminal\.Note:[https://developers\.openai\.com/codex/cli](https://developers.openai.com/codex/cli)Source:[https://github\.com/openai/codex](https://github.com/openai/codex)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p2.1)\.
- \[23\]OpenAI\(2025\-08\)GPT\-5 system card\.Note:[https://openai\.com/index/gpt\-5\-system\-card/](https://openai.com/index/gpt-5-system-card/)Also available as arXiv:2601\.03267Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[24\]OpenAI\(2025\-11\)GPT\-5\.1 instant and GPT\-5\.1 thinking system card addendum\.Note:[https://openai\.com/index/gpt\-5\-system\-card\-addendum\-gpt\-5\-1/](https://openai.com/index/gpt-5-system-card-addendum-gpt-5-1/)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[25\]OpenAI\(2025\-12\)Update to GPT\-5 system card: GPT\-5\.2\.Note:[https://openai\.com/index/gpt\-5\-system\-card\-update\-gpt\-5\-2/](https://openai.com/index/gpt-5-system-card-update-gpt-5-2/)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[26\]OpenAI\(2026\-02\)GPT\-5 Codex system cards\.Note:[https://cdn\.openai\.com/pdf/23eca107\-a9b1\-4d2c\-b156\-7deb4fbc697c/GPT\-5\-3\-Codex\-System\-Card\-02\.pdf](https://cdn.openai.com/pdf/23eca107-a9b1-4d2c-b156-7deb4fbc697c/GPT-5-3-Codex-System-Card-02.pdf)Primary citation for GPT\-5\.3\-Codex; GPT\-5\-Codex and GPT\-5\.2\-Codex are documented at[https://cdn\.openai\.com/pdf/97cc5669\-7a25\-4e63\-b15f\-5fd5bdc4d149/gpt\-5\-codex\-system\-card\.pdf](https://cdn.openai.com/pdf/97cc5669-7a25-4e63-b15f-5fd5bdc4d149/gpt-5-codex-system-card.pdf)and[https://openai\.com/index/gpt\-5\-2\-codex\-system\-card/](https://openai.com/index/gpt-5-2-codex-system-card/)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[27\]OpenAI\(2026\-03\)GPT\-5\.4 thinking system card\.Note:[https://openai\.com/index/gpt\-5\-4\-thinking\-system\-card/](https://openai.com/index/gpt-5-4-thinking-system-card/)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[28\]OpenAI\(2026\-04\)GPT\-5\.5 system card\.Note:[https://openai\.com/index/gpt\-5\-5\-system\-card/](https://openai.com/index/gpt-5-5-system-card/)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[29\]OpenClaw Project\(2026\)OpenClaw: a self\-hosted personal AI assistant platform\.Note:[https://github\.com/openclaw/openclaw](https://github.com/openclaw/openclaw)Version 2026\.3\.8 used in the experimentsCited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p2.1)\.
- \[30\]S\. Palan and C\. Schitter\(2018\)Prolific\.ac—a subject pool for online experiments\.Journal of Behavioral and Experimental Finance17,pp\. 22–27\.External Links:[Document](https://dx.doi.org/10.1016/j.jbef.2017.12.004)Cited by:[§2\.3](https://arxiv.org/html/2605.26329#S2.SS3.p2.1)\.
- \[31\]T\. Patwardhan, R\. Dias, E\. Proehl, G\. Kim, M\. Wang, O\. Watkins, S\. P\. Fishman, M\. Aljubeh, P\. Thacker, L\. Fauconnet,et al\.\(2025\)GDPVal: evaluating AI model performance on real\-world economically valuable tasks\.arXiv preprint arXiv:2510\.04374\.External Links:[Link](https://arxiv.org/abs/2510.04374)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.26329#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26329#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2605.26329#S2.T1),[Table 1](https://arxiv.org/html/2605.26329#S2.T1.12.2),[footnote 3](https://arxiv.org/html/2605.26329#footnote3)\.
- \[32\]Qwen Team\(2026\-02\)Qwen3\.5\-Plus model release\.Note:[https://qwen\.ai/blog?id=qwen3\.5](https://qwen.ai/blog?id=qwen3.5)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[33\]D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman\(2024\)GPQA: a graduate\-level google\-proof q&a benchmark\.InConference on Language Modeling \(COLM\),External Links:[Link](https://arxiv.org/abs/2311.12022)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px1.p1.1)\.
- \[34\]Y\. Shao, H\. Zope, Y\. Jiang, J\. Pei, D\. Nguyen, E\. Brynjolfsson, and D\. Yang\(2025\)Future of work with AI agents: auditing automation and augmentation potential across the U\.S\. workforce\.arXiv preprint arXiv:2506\.06576\.External Links:[Link](https://arxiv.org/abs/2506.06576),[Document](https://dx.doi.org/10.48550/arXiv.2506.06576)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px4.p1.1),[Figure 1](https://arxiv.org/html/2605.26329#S0.F1),[Figure 1](https://arxiv.org/html/2605.26329#S0.F1.3.2),[§1](https://arxiv.org/html/2605.26329#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.26329#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.26329#S2.SS1.p4.1),[§2\.3](https://arxiv.org/html/2605.26329#S2.SS3.p1.1)\.
- \[35\]SST\(2025\)OpenCode: the open source AI coding agent\.Note:[https://opencode\.ai](https://opencode.ai/)Source:[https://github\.com/sst/opencode](https://github.com/sst/opencode)Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p2.1)\.
- \[36\]U\.S\. Bureau of Labor Statistics\(2025\-04\)Occupational employment and wage statistics, may 2024\.Note:[https://www\.bls\.gov/oes/tables\.htm](https://www.bls.gov/oes/tables.htm)National employment and wage estimates by occupation \(SOC\), released April 2025Cited by:[§2\.3](https://arxiv.org/html/2605.26329#S2.SS3.p1.1)\.
- \[37\]Upwork Inc\.\(2026\)Upwork: the world’s work marketplace\.Note:[https://www\.upwork\.com/](https://www.upwork.com/)Online freelance labor platform; accessed 2026Cited by:[§2\.3](https://arxiv.org/html/2605.26329#S2.SS3.p2.1)\.
- \[38\]Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang,et al\.\(2024\)MMLU\-Pro: a more robust and challenging multi\-task language understanding benchmark\.Advances in Neural Information Processing Systems37,pp\. 95266–95290\.Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px1.p1.1)\.
- \[39\]xAI\(2026\-03\)Grok 4\.2 Fast\.Cited by:[§3\.1](https://arxiv.org/html/2605.26329#S3.SS1.SSS0.Px1.p1.1)\.
- \[40\]T\. Xieet al\.\(2024\)OSWorld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track,External Links:[Link](https://arxiv.org/abs/2404.07972)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px2.p1.1)\.
- \[41\]Q\. Yang, Y\. Liu, J\. Li, J\. Bai, H\. Chen, K\. Chen, T\. Duan, J\. Dong, X\. Hu, Z\. Jia, Y\. Liu, T\. Peng, Y\. Ren, R\. Tian, Z\. Wang, Y\. Xiao, G\. Yao, L\. Yin, G\. Zhang, C\. Zhang, J\. Jiao, Z\. Zheng, and Y\. Gong\(2026\)$OneMillion\-Bench: how far are language agents from human experts?\.arXiv preprint arXiv:2603\.07980\.External Links:2603\.07980,[Link](https://arxiv.org/abs/2603.07980)Cited by:[§1](https://arxiv.org/html/2605.26329#S1.p1.1)\.
- \[42\]S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan\(2022\)WebShop: towards scalable real\-world web interaction with grounded language agents\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2207.01206)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px2.p1.1)\.
- \[43\]S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan\(2024\)τ\\tau\-Bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.External Links:[Link](https://arxiv.org/abs/2406.12045)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px2.p1.1)\.
- \[44\]X\. Yue, Y\. Ni, K\. Zhang, T\. Zheng, R\. Liu, G\. Zhang, S\. Stevens, D\. Jiang, W\. Ren, Y\. Sun,et al\.\(2024\)MMMU: a massive multi\-discipline multimodal understanding and reasoning benchmark for expert AGI\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 9556–9567\.Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px1.p1.1)\.
- \[45\]S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig\(2024\)WebArena: a realistic web environment for building autonomous agents\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2307.13854)Cited by:[Appendix A](https://arxiv.org/html/2605.26329#A1.SS0.SSS0.Px2.p1.1)\.

ARelated Work\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A](https://arxiv.org/html/2605.26329#A1)

BLimitations, Ethics, Broader Impact, and LLM Usage\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B](https://arxiv.org/html/2605.26329#A2)

CLeaderboard\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C](https://arxiv.org/html/2605.26329#A3)

DTask Split by Occupation\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D](https://arxiv.org/html/2605.26329#A4)

ERepresentative Task Examples

E\.1Reporters and Correspondents\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E\.1](https://arxiv.org/html/2605.26329#A5.SS1)

E\.2Data Entry Keyers\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E\.2](https://arxiv.org/html/2605.26329#A5.SS2)

E\.3Lawyers\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E\.3](https://arxiv.org/html/2605.26329#A5.SS3)

E\.4Web Administrators\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E\.4](https://arxiv.org/html/2605.26329#A5.SS4)

FPrompt Templates and Evaluation Interface

F\.1Agent Evaluation Prompt\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[F\.1](https://arxiv.org/html/2605.26329#A6.SS1)

F\.2Judge Prompt\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[F\.2](https://arxiv.org/html/2605.26329#A6.SS2)

GSupplementary Analysis on JobBench’s Position in AI Labour Market\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[G](https://arxiv.org/html/2605.26329#A7)

## Appendix ARelated Work

#### Frontier knowledge and reasoning benchmarks\.

Many benchmarks score whether a model reaches the correct answer on harder exam\-style or code\-style prompts\. MMLU\[[15](https://arxiv.org/html/2605.26329#bib.bib31)\]established a 57\-subject baseline, GPQA\[[33](https://arxiv.org/html/2605.26329#bib.bib35)\]pushes to graduate\-level science designed to resist Web search, and MMLU\-Pro\[[38](https://arxiv.org/html/2605.26329#bib.bib25)\]expands the answer space to reward multi\-step reasoning\. MMMU\[[44](https://arxiv.org/html/2605.26329#bib.bib24)\]extends expert\-level evaluation to multimodal inputs\.

#### Agentic benchmarks in interactive environments\.

A parallel line of work targets agents that act, rather than answer\. SWE\-bench\[[16](https://arxiv.org/html/2605.26329#bib.bib44)\]measures whether agents resolve real GitHub issues with a working patch, WebArena and WorkArena\[[45](https://arxiv.org/html/2605.26329#bib.bib42),[12](https://arxiv.org/html/2605.26329#bib.bib47)\]instantiate enterprise web stacks, and OSWorld\[[40](https://arxiv.org/html/2605.26329#bib.bib43)\]extends the setting to full desktop environments\. GAIA\[[19](https://arxiv.org/html/2605.26329#bib.bib41)\], AgentBench\[[17](https://arxiv.org/html/2605.26329#bib.bib40)\], WebShop\[[42](https://arxiv.org/html/2605.26329#bib.bib46)\], andτ\\tau\-bench\[[43](https://arxiv.org/html/2605.26329#bib.bib48)\]cover assistant tool use, multi\-environment generally, grounded shopping, and tool\-agent\-user interaction\. These benchmarks score whether an agent can*reach*a target state in an environment

#### Workplace and occupational benchmarks\.

Recent benchmarks target the work that experts actually perform\. GDPVal\[[31](https://arxiv.org/html/2605.26329#bib.bib26)\]frames AI progress in terms of economically valuable tasks judged by expert graders, PRBench\[[1](https://arxiv.org/html/2605.26329#bib.bib27)\]supplies expert\-authored rubrics for legal and financial reasoning, and the Remote Labor Index\[[18](https://arxiv.org/html/2605.26329#bib.bib29)\]scores end\-to\-end remote\-work projects and surfaces deliverable\-level failure clusters\. These benchmarks are scoped primarily by economic values, telling a replacement story\. JobBench instead evaluates AI agents on the workflows that experts identify as high\-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value\.

#### Labor\-market exposure\.

Recent work measures how AI maps onto the U\.S\. workforce\.Eloundouet al\.\[[13](https://arxiv.org/html/2605.26329#bib.bib55)\]estimate task\-level occupational exposure to LLMs,Brynjolfssonet al\.\[[10](https://arxiv.org/html/2605.26329#bib.bib56)\]measure productivity gains from a customer\-support deployment, andChenet al\.\[[11](https://arxiv.org/html/2605.26329#bib.bib62)\]study early displacement and complementarity effects in the labor market\. Closer to task design, Workbank\[[34](https://arxiv.org/html/2605.26329#bib.bib30)\]surveys over 1,500 U\.S\. workers on the O\*NET task taxonomy and labels each task with a reported delegation preference and a desired human\-AI collaboration mode\. However, it only covers worker surveys for the specific work duties\. JobBench is built on top of these signals and designs them into benchmark task packages\.

## Appendix BLimitations, Ethics, Broader Impact, and LLM Usage

### Limitations

This study is limited to U\.S\.\-centered, digital, document\-heavy professional tasks across 35 selected O\*NET occupations\. It does not represent all occupations, non\-U\.S\. labor markets, non\-English workplaces, physical work, real\-time collaboration, or long\-term organizational workflows\. The dataset is designed for benchmark evaluation, not deployment validation\. It is strictly not recommended for deciding whether an AI system can replace workers, make professional judgments, or operate without human oversight in legal, medical, financial, engineering, public\-sector, or other high\-stakes settings\.

### Ethics Statement

This study involved human participants providing data annotation through an online annotation platform\. The University Human Subjects Division \(HSD\) reviewed the protocol and determined that the research qualifies as exempt human subjects research \(Category 3; minimal\-risk behavioral research\) with approved IRB under U\.S\. federal regulations\. Participation was voluntary, and participants could stop at any time\. No sensitive personal information was collected\. Only aggregated benchmark results are publicly released, and no identifiable participant data are included in any publications or datasets\. The exempt determination letter is on file with the authors\. All annotator participating were informed of and consented to the use of their work for research\. We are committed to crediting and fairly compensating all human annotators in accordance with occupational wage standards\.

This study is not intended to include direct personal identifiers of private individuals\. Names and person\-specific details of private individuals appearing in task text or reference files are fictional, redacted, or replaced with synthetic placeholders\. Real reference files, where included, are derived from public records, official materials, or public reporting\. These files may describe public activities of public officials acting in their official capacity or appearing in public contexts\. The dataset may still contain geography, language, socio\-economic status, experience or seniority, occupational context, sourced from public records with masked or redacted personally identifiable information\. Public references are included only as contextual source material and should not be used for impersonation, profiling, harassment, surveillance, or making decisions\.

### Broader Impact

Positive impact: This study supports the evaluation of AI agents on work that professionals report wanting help with, encouraging augmentation rather than replacement, and highlighting where current systems still fail on realistic workplace tasks\.

Risks: the dataset could be misused as evidence that agents are ready to replace workers, or as a proxy for safe deployment in legal, financial, engineering, or public\-sector settings\. Because it is U\.S\.\-centric and document\-heavy, it may underrepresent non\-U\.S\., non\-English, physical\-labor, and low\-resource work contexts\.

Mitigations: This study is framed as an evaluation benchmark only, with explicit limitations against deployment certification or worker replacement claims\. The release should include usage terms, data provenance notes, and guidance requiring expert validation and human oversight for any high\-stakes use\.

### LLM Usage

We used LLMs to support the presentation of this manuscript, including assistance with writing, editing, and improving the clarity of presentation\.

## Appendix CLeaderboard

We present the leaderboard for the JobBench main set\. Figure[8](https://arxiv.org/html/2605.26329#A3.F8)ranks 15 models on the main set\.

![[Uncaptioned image]](https://arxiv.org/html/2605.26329v1/x8.png)

Figure 8:Leaderboard on the JobBench main set \(Claude Code for Anthropic models, Codex for OpenAI models, OpenCode for the remaining models\)\.

## Appendix DTask Split by Occupation

In this appendix, we report the per\-occupation task split\. Table[4](https://arxiv.org/html/2605.26329#A4.T4)reports the number of tasks each occupation contributes to the 65\-task main set and the 65\-task easy set\.

Table 4:Per\-occupation task counts on the JobBench\.## Appendix ERepresentative Task Examples

In this appendix, we show four representative JobBench tasks drawn from the main\-set: one Reporters task \(Section[E\.1](https://arxiv.org/html/2605.26329#A5.SS1)\), one Data Entry Keyers task \(Section[E\.2](https://arxiv.org/html/2605.26329#A5.SS2)\), one Lawyers task \(Section[E\.3](https://arxiv.org/html/2605.26329#A5.SS3)\), and one Web Administrators task \(Section[E\.4](https://arxiv.org/html/2605.26329#A5.SS4)\)\. For each task, we present three views: a one\-page schematic \(assignment brief, evidence files, reasoning burden, required deliverables, and rubric pressure points\), the verbatim agent instruction, and a reviewer\-facing parse of the task rubrics and the task card\.

### E\.1Reporters: Lead Exposure Editorial Package

Reporters and Correspondents – Task 1Pre\-publication editorial packageAssignment BriefAct as an investigative reporter at a Connecticut regional outlet\. Build a front\-page package on lead in drinking water and childhood health, following the editor’s assignment email and checking interview claims against public data\.Evidence FilesEditor assignment emailHartford\-area water lead samples CSVCT 2024 lead surveillance reportNRDC Newark drinking\-water factsheetDr\. Martinez interview notesReasoning BurdenCross\-source reconciliation Map FOIA water samples to city\-level surveillance claims\.Quantitative validation Compute 2020–2024 lead\-level trends and CT\-vs\-national blood\-lead comparisons\.Editorial grounding Pair quotes with verified data and avoid unsupported water\-versus\-paint claims\.Regulatory precision Use the EPA LCRI action\-level timeline correctly\.Required Deliverablespitch\_memo\.docx: thesis and evidentiary backbone\.data\_analysis\.xlsx: water trends, CT vs national, and city cross\-reference sheets\.source\_log\.csv: 15\+ verification entries across all sources\.Rubric Pressure PointsHigh\-risk community mapping; lead\-level percentage changes; CT surveillance extraction; Newark numerical parallel; interview claim verification; 15\-entry source log coverage; workbook sheet completeness\.

Figure 9:Task schematic for Reporters\. The agent receives an editor’s assignment and a heterogeneous evidence pack of water\-quality CSVs, state surveillance reports, regulatory documents, and interview notes, and must deliver a pitch memo, a workbook of cross\-source analyses, and a 15\-entry source log\.

Instruction[⬇](data:text/plain;base64,WW91IGFyZSBhbiBpbnZlc3RpZ2F0aXZlIHJlcG9ydGVyIGF0IGEgcmVnaW9uYWwgQ29ubmVjdGljdXQgbmV3cyBvdXRsZXQuIFlvdXIgZWRpdG9yLCBTYXJhaCBDaGVuLCBoYXMgZW1haWxlZCB5b3UgKGVkaXRvcnNfYXNzaWdubWVudF9lbWFpbC50eHQpIHJlcXVlc3RpbmcgYSBwcmUtcHVibGljYXRpb24gZWRpdG9yaWFsIHBhY2thZ2UgZm9yIGEgZnJvbnQtcGFnZSBzdG9yeSBvbiBsZWFkIGluIGRyaW5raW5nIHdhdGVyIGFuZCBjaGlsZGhvb2QgaGVhbHRoLiBSZWFkIGhlciBlbWFpbCBjYXJlZnVsbHkgLS0gaXQgY29udGFpbnMgc3BlY2lmaWMgcmVxdWlyZW1lbnRzIGFuZCBlZGl0b3JpYWwgY29uY2VybnMgeW91IG11c3QgYWRkcmVzcy4KClByZXBhcmUgdGhyZWUgZGVsaXZlcmFibGVzLiBUaGUgZmlyc3QgaXMgYSBzdG9yeSBwaXRjaCBtZW1vIChwaXRjaF9tZW1vLmRvY3gpLCBhIFdvcmQgZG9jdW1lbnQgb2Ygbm8gbW9yZSB0aGFuIHR3byBwYWdlcyB0aGF0IHByZXNlbnRzIHlvdXIgdGhlc2lzIGFuZCBldmlkZW50aWFyeSBiYWNrYm9uZS4gVGhlIG1lbW8gc2hvdWxkIGNvbm5lY3Qgd2F0ZXIgbGVhZCBjb250YW1pbmF0aW9uIHRvIGNoaWxkaG9vZCBoZWFsdGggb3V0Y29tZXMgaW4gQ29ubmVjdGljdXQsIGdyb3VuZCBpdHMga2V5IGNsYWltcyBpbiB0aGUgRk9JQSB3YXRlciB0ZXN0aW5nIGRhdGEgLS0gaW5jbHVkaW5nIHRoZSBpbmZyYXN0cnVjdHVyZSBjb2x1bW5zIC0tIGFuZCB0aGUgc3RhdGUgc3VydmVpbGxhbmNlIHJlcG9ydCwgYW5kIGRyYXcgYSBzcGVjaWZpYyBudW1lcmljYWwgcGFyYWxsZWwgYmV0d2VlbiBXYXRlcmJ1cnkncyB0cmFqZWN0b3J5IGFuZCBOZXdhcmsncyBjcmlzaXMgLS0gbm90IG1lcmVseSBzdW1tYXJpemUgTmV3YXJrLiBBbnkgcXVvdGVzIGZyb20gRHIuIE1hcnRpbmV6J3MgaW50ZXJ2aWV3IG11c3QgYmUgcGFpcmVkIHdpdGggdmVyaWZpYWJsZSBkYXRhIHBvaW50cywgYW5kIGhlciBmYWN0dWFsIGNsYWltcyBzaG91bGQgYmUgY2hlY2tlZCBhZ2FpbnN0IHRoZSBzdGF0ZSBzdXJ2ZWlsbGFuY2UgbnVtYmVycy4gVGhlIGVkaXRvciBmbGFnZ2VkIGNvbmNlcm5zIGFib3V0IHRoZSB3YXRlci12ZXJzdXMtcGFpbnQgYW5nbGUgYW5kIHRoZSByZWd1bGF0b3J5IHRpbWVsaW5lIGZvciB0aGUgTENSSSdzIG5ldyBhY3Rpb24gbGV2ZWw7IHRoZSBFUEEncyBwdWJsaWMgZmFjdHNoZWV0IG9uIHRoZSBMZWFkIGFuZCBDb3BwZXIgUnVsZSBJbXByb3ZlbWVudHMsIGF2YWlsYWJsZSBmcm9tIEVQQSwgcHJvdmlkZXMgdGhlIGVuZm9yY2VtZW50IHNwZWNpZmljcyB5b3UnbGwgbmVlZCBmb3IgcHJlY2lzaW9uLgoKVGhlIHNlY29uZCBkZWxpdmVyYWJsZSBpcyBhIGRhdGEgYW5hbHlzaXMgd29ya2Jvb2sgKGRhdGFfYW5hbHlzaXMueGxzeCkgd2l0aCB0aHJlZSBzaGVldHM6ICJXYXRlciBMZWFkIFRyZW5kcyIgY292ZXJpbmcgdGhlIGZpdmUgd2F0ZXIgc3lzdGVtcyBhY3Jvc3MgdGhlIDIwMjAtLTIwMjQgbW9uaXRvcmluZyBwZXJpb2RzIHdpdGggdGhyZXNob2xkIGZsYWdzIGFuZCB0cmVuZCBjYWxjdWxhdGlvbnM7ICJDVCB2cyBOYXRpb25hbCIgY29tcGFyaW5nIENvbm5lY3RpY3V0J3MgYmxvb2QgbGVhZCBkYXRhIGFnYWluc3QgbmF0aW9uYWwgdG90YWxzIGZyb20gdGhlIENEQydzIENoaWxkaG9vZCBCbG9vZCBMZWFkIFN1cnZlaWxsYW5jZSBuYXRpb25hbCBkYXRhc2V0ICgyMDE3LS0yMDIyKSwgc291cmNlZCBmcm9tIENEQzsgYW5kICJDaXR5IENyb3NzLVJlZmVyZW5jZSIgbWFwcGluZyBlYWNoIHdhdGVyIHN5c3RlbSdzIHNlcnZpY2UgY2l0eSB0byB0aGUgY29ycmVzcG9uZGluZyB0b3duLWxldmVsIGJsb29kIGxlYWQgZGF0YSBpbiB0aGUgc3RhdGUgc3VydmVpbGxhbmNlIHJlcG9ydCwgaW5jbHVkaW5nIGhpZ2gtcmlzayByYW5raW5ncy4KClRoZSB0aGlyZCBkZWxpdmVyYWJsZSBpcyBhIHNvdXJjZSB2ZXJpZmljYXRpb24gbG9nIChzb3VyY2VfbG9nLmNzdikgd2l0aCBjb2x1bW5zIERhdGFfUG9pbnQsIFNvdXJjZV9GaWxlLCBQYWdlX29yX0xvY2F0aW9uLCBWZXJpZmllZCAoWWVzL05vL1BhcnRpYWwpLCBhbmQgTm90ZXMuIExvZyBhdCBtaW5pbXVtIGZpZnRlZW4gZW50cmllcyBzcGFubmluZyBhbGwgc291cmNlIGZpbGVzIHVzZWQsIGFuZCBmbGFnIGFueSBjbGFpbXMgd2hlcmUgdGhlIGRhdGEgZG9lc24ndCBjbGVhbmx5IHN1cHBvcnQgdGhlIGFzc2VydGlvbi4KCkxvY2FsIFJlZmVyZW5jZSBGaWxlczoKLSBDVF8yMDI0X0xlYWRfU3VydmVpbGxhbmNlX1JlcG9ydC5wZGYKLSBOUkRDX0xlYWRfTmV3YXJrX0RyaW5raW5nX1dhdGVyX0ZhY3RzaGVldC5wZGYKLSBlZGl0b3JzX2Fzc2lnbm1lbnRfZW1haWwudHh0Ci0gaGFydGZvcmRfYXJlYV93YXRlcl9sZWFkX3NhbXBsZXNfMjAyMF8yMDI0LmNzdgotIGludGVydmlld19ub3Rlc19kcl9tYXJ0aW5lei50eHQK)YouareaninvestigativereporterataregionalConnecticutnewsoutlet\.Youreditor,SarahChen,hasemailedyou\(editors\_assignment\_email\.txt\)requestingapre\-publicationeditorialpackageforafront\-pagestoryonleadindrinkingwaterandchildhoodhealth\.Readheremailcarefully\-\-itcontainsspecificrequirementsandeditorialconcernsyoumustaddress\.Preparethreedeliverables\.Thefirstisastorypitchmemo\(pitch\_memo\.docx\),aWorddocumentofnomorethantwopagesthatpresentsyourthesisandevidentiarybackbone\.ThememoshouldconnectwaterleadcontaminationtochildhoodhealthoutcomesinConnecticut,grounditskeyclaimsintheFOIAwatertestingdata\-\-includingtheinfrastructurecolumns\-\-andthestatesurveillancereport,anddrawaspecificnumericalparallelbetweenWaterbury’strajectoryandNewark’scrisis\-\-notmerelysummarizeNewark\.AnyquotesfromDr\.Martinez’sinterviewmustbepairedwithverifiabledatapoints,andherfactualclaimsshouldbecheckedagainstthestatesurveillancenumbers\.Theeditorflaggedconcernsaboutthewater\-versus\-paintangleandtheregulatorytimelinefortheLCRI’snewactionlevel;theEPA’spublicfactsheetontheLeadandCopperRuleImprovements,availablefromEPA,providestheenforcementspecificsyou’llneedforprecision\.Theseconddeliverableisadataanalysisworkbook\(data\_analysis\.xlsx\)withthreesheets:"WaterLeadTrends"coveringthefivewatersystemsacrossthe2020\-\-2024monitoringperiodswiththresholdflagsandtrendcalculations;"CTvsNational"comparingConnecticut’sbloodleaddataagainstnationaltotalsfromtheCDC’sChildhoodBloodLeadSurveillancenationaldataset\(2017\-\-2022\),sourcedfromCDC;and"CityCross\-Reference"mappingeachwatersystem’sservicecitytothecorrespondingtown\-levelbloodleaddatainthestatesurveillancereport,includinghigh\-riskrankings\.Thethirddeliverableisasourceverificationlog\(source\_log\.csv\)withcolumnsData\_Point,Source\_File,Page\_or\_Location,Verified\(Yes/No/Partial\),andNotes\.Logatminimumfifteenentriesspanningallsourcefilesused,andflaganyclaimswherethedatadoesn’tcleanlysupporttheassertion\.LocalReferenceFiles:\-CT\_2024\_Lead\_Surveillance\_Report\.pdf\-NRDC\_Lead\_Newark\_Drinking\_Water\_Factsheet\.pdf\-editors\_assignment\_email\.txt\-hartford\_area\_water\_lead\_samples\_2020\_2024\.csv\-interview\_notes\_dr\_martinez\.txt

Rubrics \(8 checks, 60 total points\)1\.8 pts\.Does the response correctly map which of the five CSV water system cities appear on the CT surveillance report’s 2024 high\-risk communities list, and which do not?•The response must correctly state that all five water system cities from the CSV \(Hartford, New Haven, Bridgeport, Waterbury, Meriden\) appear on the CT 2024 Surveillance Report’s top 10 high\-risk communities list, with their rankings: New Haven \(\#1\), Bridgeport \(\#2\), Waterbury \(\#3\), Hartford \(\#4\), Meriden \(\#5\)\.•The response must explain the significance of the high\-risk designation – that it is based on rates of newly poisoned children, age of housing, and income levels, not water lead levels specifically\.•The response must acknowledge that the remaining 5 high\-risk communities \(New Britain, Norwich, West Haven, Stamford, Manchester\) are not represented in the water testing CSV data, representing a potential gap in the story’s coverage\.2\.8 pts\.Does the data analysis workbook’s ’Water Lead Trends’ sheet correctly compute the percentage change in 90th\-percentile lead levels from 2020 to 2024 for each of the five water systems?•The workbook must compute percentage changes in 90th\-percentile lead levels from 2020 to 2024 for all five water systems, with approximately correct values: Hartford ~\+26\.8% \(8\.2 to 10\.4 ppb\), New Haven ~\+14\.1% \(6\.4 to 7\.3 ppb\), Bridgeport ~\+15\.7% \(5\.1 to 5\.9 ppb\), Waterbury ~\+31\.6% \(9\.8 to 12\.9 ppb\), Meriden ~\+41\.9% \(4\.3 to 6\.1 ppb\)\.•Alternative valid calculation methodologies \(e\.g\., using 2020 averages or 2024 averages instead of single periods\) are acceptable, but the methodology must be clearly stated and the math must be correct\.•The workbook should flag which systems exceed 15 ppb and 10 ppb thresholds across which monitoring periods\.3\.10 pts\.Does the ’CT vs National’ sheet correctly extract Connecticut’s data from the CDC dataset and compare it to national totals, noting that CT data is only available for 2017\-2019?•The sheet must note that Connecticut data in the CDC dataset is only available for 2017, 2018, and 2019, with years 2020\-2022 marked as not submitted\.•The sheet must present correct figures: 2017 CT had 1,666 tested with 93 \(5\.6%\) at \>=5 g/dL vs\. national 2,580,144 tested with 38,427 \(1\.5%\); 2018 CT had 19,946 tested with 112 \(0\.6%\) vs\. national 2,923,721 with 36,990 \(1\.3%\); 2019 CT had 35,881 tested with 101 \(0\.3%\) vs\. national 2,691,749 with 29,137 \(1\.1%\)\.•The sheet must note the 2017 CT anomaly \(only 1,666 tested, yielding an artificially high 5\.6% percentage\) and explain that the BLL \>=3\.5 threshold data is only available nationally for 2022 \(31,767 or 1\.7%\) and is unavailable for CT in the CDC dataset\.4\.6 pts\.Does the ’City Cross\-Reference’ sheet correctly map each water system city to the town\-level blood lead data from the CT surveillance report, including total confirmed tests, and counts at each BLL threshold?•The sheet must include correct town\-level data for all five cities: Hartford \(3,539 total, 150 at \>=3\.5, 92 at \>=5, 29 at \>=10, 12 at \>=15, 7 at \>=20 g/dL\), New Haven \(3,185 total, 224 at \>=3\.5, 103 at \>=5, 21 at \>=10, 5 at \>=15, 2 at \>=20\), Bridgeport \(4,188 total, 215 at \>=3\.5, 113 at \>=5, 29 at \>=10, 14 at \>=15, 5 at \>=20\), Waterbury \(3,662 total, 196 at \>=3\.5, 92 at \>=5, 23 at \>=10, 15 at \>=15, 9 at \>=20\), and Meriden \(1,923 total, 71 at \>=3\.5, 29 at \>=5, 6 at \>=10, 2 at \>=15, 2 at \>=20\)\.•The sheet must include the high\-risk rankings for each city: New Haven \(\#1\), Bridgeport \(\#2\), Waterbury \(\#3\), Hartford \(\#4\), Meriden \(\#5\)\.•Significant errors in the town\-level numbers \(beyond minor rounding\) fail this rubric\.5\.8 pts\.Does the source verification log contain at least 15 entries and use the correct CSV format with columns Data\_Point, Source\_File, Page\_or\_Location, Verified \(Yes/No/Partial\), and Notes?•The source log must be a CSV file with exactly five columns: Data\_Point, Source\_File, Page\_or\_Location, Verified \(Yes/No/Partial\), and Notes\.•The log must contain at least 15 distinct entries covering key statistics, quotes, and claims used in the pitch memo and data workbook\.•Entries must span multiple source files – at minimum including entries from: the CSV water data, the CT surveillance report, the CDC national data, the EPA LCRI factsheet, the NRDC Newark factsheet, and Dr\. Martinez’s interview\.•Each entry must have a specific page/location reference \(e\.g\., ’CT report p\.3’, ’CSV row 38’, ’EPA factsheet p\.1’\) rather than vague references like ’report’\.6\.6 pts\.Does the response reference the correct legislative basis for Connecticut’s lowered blood lead reference value – Public Act 22\-49, effective January 1, 2023, lowering the threshold from 5 g/dL to 3\.5 g/dL?•The response must correctly identify Connecticut Public Act 22\-49 as the legislative basis for the lowered blood lead reference value\.•The response must state the correct effective date \(January 1, 2023\) and the correct threshold change \(from 5 g/dL to 3\.5 g/dL\)\.•Attributing the change to the wrong act number, wrong date, or wrong threshold values fails this rubric\.7\.8 pts\.Does the pitch memo discuss the lead service line counts and replacement burden across the five water systems, connecting this to the LCRI’s 10\-year replacement mandate?•The pitch memo must include lead service line counts and population served for each system from the 2024 CSV data: Hartford \(2,500 lines, 400,000 people, 0\.63% ratio\), New Haven \(1,250 lines, 430,000 people, 0\.29%\), Bridgeport \(1,550 lines, 350,000 people, 0\.44%\), Waterbury \(1,180 lines, 220,000 people, 0\.54%\), Meriden \(650 lines, 120,000 people, 0\.54%\)\.•The pitch memo must identify Hartford as having both the most absolute lead service lines \(2,500\) and the highest ratio of lines to population served \(0\.63%\)\.•The pitch memo must connect these service line counts to the LCRI’s requirement that systems replace lead service lines within 10 years, identifying which systems face the largest replacement burdens\.8\.6 pts\.Does the data analysis workbook flag 90th\-percentile readings that exceed both the 15 ppb and 10 ppb thresholds for each monitoring period, correctly identifying all exceedance periods?•The workbook must correctly identify the 15 ppb exceedances: Waterbury Jan\-Jun 2023 \(15\.3 ppb\) and Jul\-Dec 2023 \(16\.1 ppb\)\.•The workbook must correctly identify periods exceeding 10 ppb but below 15 ppb, including: Hartford Jan\-Jun 2022 \(11\.3\), Jul\-Dec 2022 \(12\.7\), Jan\-Jun 2023 \(13\.2\), Jul\-Dec 2023 \(14\.1\), Jan\-Jun 2024 \(11\.8\), Jul\-Dec 2024 \(10\.4\); and Waterbury Jul\-Dec 2020 \(10\.4\), Jan\-Jun 2021 \(11\.2\), Jul\-Dec 2021 \(12\.1\), Jan\-Jun 2022 \(13\.5\), Jul\-Dec 2022 \(14\.8\), Jan\-Jun 2024 \(14\.2\), Jul\-Dec 2024 \(12\.9\)\.•The workbook must clearly distinguish which periods exceed the 15 ppb current action level versus which exceed only the 10 ppb LCRI threshold\.•The CSV’s ’Action\_Level\_Exceedance’ column marks Hartford’s 2024 periods and Waterbury’s 2023\-2024 periods as ’Yes’ \(under the current 15 ppb standard\), and the workbook should reflect this\.

Task CardSummary\.This task asks the agent to act as an investigative reporter at a Connecticut outlet, building a pre\-publication editorial package on lead in drinking water and childhood health: a thesis\-driven pitch memo, a three\-sheet data analysis workbook, and a 15\+ entry source verification log – all internally cross\-checked against the FOIA water data, the CT surveillance report, and an interview with Dr\. Martinez\.ONET taskCheck reference materials, such as books, news files, or public records, to obtain relevant facts\.Expert desire4\.00 \(scale: 0–5, from WorkBank\)In practiceharvesting verifiable data from heterogeneous reference materials \(FOIA CSVs, state PDF reports, NRDC factsheets, EPA regulatory factsheets, federal CDC datasets, interview transcripts\), pairing every claim with a sourceable data point, and assembling a fact\-tracked analytical package that an editor can defend on the front page\.Why desired\.Investigative beat reporting is gated by source\-verification time – extracting numbers from PDFs, joining FOIA CSVs to surveillance tables, and cross\-checking interview claims against public records\. An agent that handles the verification layer lets the reporter focus on the editorial argument and the framing\.Reasoning challenges\.•Reconciling contradictory water\-vs\-paint evidence\.The CT 2024 surveillance report shows 0% of investigated homes identified water as a lead hazard, yet the FOIA CSV shows multiple Hartford\-area systems exceeding the 15 ppb federal action level\. The agent must reason about home\-investigation methodology limits versus system\-level monitoring data and produce a coherent editorial argument, not paper over the contradiction\.•Fact\-checking Dr\. Martinez’s "30% increase" quote\.Dr\. Martinez claims a 30% post\-threshold\-change increase in referrals\. The CT surveillance report shows ~66% statewide prevalence increase \(2022\-\>2023\) and ~84% incidence increase \(2022\-\>2024\)\. The agent must classify the quote as directionally correct but quantitatively conservative – likely clinic\-specific – rather than accept or reject it outright\.•Waterbury\-vs\-Newark data parallel\.Waterbury peaked at 16\.1 ppb \(just above the 15 ppb threshold\); Newark hit 47\.9 ppb \(over 3 x the threshold\)\. The agent must construct a structural\-trajectory parallel \(sustained exceedances, delayed enforcement, vulnerable populations\) rather than a point\-in\-time comparison that obscures the gap\.•City\-to\-surveillance mapping with a credibility gap\.All five FOIA water\-system service cities appear on the CT surveillance Top\-10 high\-risk communities list, but five high\-risk communities \(New Britain, Norwich, West Haven, Stamford, Manchester\) are NOT in the water data – a sourcing gap the agent must flag for story credibility\.•LCRI: finalized vs\. enforceable\.The Lead and Copper Rule Improvements were finalized October 2024, but the new 10 ppb action level does not yet apply to current monitoring cycles\. The agent must read the EPA factsheet carefully and distinguish "finalized" from "in enforcement" – the distinction the editor explicitly flagged\.•CDC data submission gaps for Connecticut\.CDC childhood blood lead surveillance for CT is only available 2017–2019; 2020–2022 are dagger\-marked non\-submissions\. The 2017 row covers only 1,666 children tested, producing an artificially elevated 5\.6% rate that the agent must flag for sampling bias\.•Counterintuitive racial\-disparity ordering\.The CT report shows Non\-Hispanic Asian children \(4\.0%\) and Hispanic children \(4\.0%\) above Non\-Hispanic Black children \(2\.9%\) in incidence\. The agent must report these accurately rather than default to common\-assumption ordering, and connect them to Dr\. Martinez’s quote about children of color\.Files requiring search\.External federal data and regulatory factsheets the agent must locate\.•CDC\_2017\_2022\_Blood\_Lead\_National\_Data\.xlsx: State\-level CDC childhood blood lead surveillance data – provides Connecticut 2017–2019 rows and national totals for the "CT vs National" sheet\.*Source:*cdc\.gov \([https://www\.cdc\.gov/lead\-prevention/media/files/2025/08/2017\-2022\-cbls\-national\-data\-508\-1\.xlsx](https://www.cdc.gov/lead-prevention/media/files/2025/08/2017-2022-cbls-national-data-508-1.xlsx)\)\.•EPA\_LCRI\_General\_Public\_Factsheet\.pdf: EPA factsheet on the Lead and Copper Rule Improvements – supplies the finalized 10 ppb action level, 10\-year service\-line replacement mandate, and enforcement timing needed to discuss current vs\. forthcoming regulation accurately\.*Source:*epa\.gov \([https://www\.epa\.gov/system/files/documents/2024\-10/final\_lcri\_fact\-sheet\_general\_public\.pdf](https://www.epa.gov/system/files/documents/2024-10/final_lcri_fact-sheet_general_public.pdf)\)\.

### E\.2Data Entry Keyers: Q3 Sales Order Reconciliation

Data Entry Keyers – Task 1Three\-way order reconciliationAssignment BriefAct as a senior data entry keyer before a quarterly compliance audit\. Reconcile Q3 2025 sales orders across CRM exports, scanned handwritten forms, and ERP records using the specified source hierarchy\.Evidence FilesCRM sales\-order workbookERP data\-entry workbook10 scanned order formsCustomer master listError\-log and compliance templatesReasoning BurdenSource hierarchy Use scans as authoritative when present, otherwise prioritize CRM over ERP\.Reverse calculations Validate units, prices, and totals bidirectionally\.Entity validation Check names and countries against the customer master list\.Pattern finding Separate systematic errors from random data\-entry mistakes\.Required DeliverablesDiscrepancy log: field\-level conflicts, source system, and reconciled values\.Reconciled master: clean Q3 order table ready for audit\.Findings report: systematic error patterns and compliance checklist\.Rubric Pressure PointsKnown calculation errors; typo and transposition detection; unregistered customers; country normalization; systematic\-vs\-random classification; audit\-ready reconciled master\.

Figure 10:Task schematic for Data Entry Keyers\. The agent receives Q3 sales orders across CRM exports, scanned handwritten forms, and ERP records, and must reconcile them under a specified source hierarchy and deliver a discrepancy log, a clean reconciled master, and a findings report\.

Instruction[⬇](data:text/plain;base64,QXMgU2VuaW9yIERhdGEgRW50cnkgS2V5ZXIgYXQgR2xvYmFsIFRyYWRlIFNvbHV0aW9ucyBJbmMuLCByZWNvbmNpbGUgUTMgMjAyNSBzYWxlcyBvcmRlciBkYXRhIGFjcm9zcyB0aGUgQ1JNIGV4cG9ydCAoU291cmNlX1NhbGVzX09yZGVyc19RM18yMDI1Lnhsc3gpLCBzY2FubmVkIGhhbmR3cml0dGVuIG9yZGVyIGZvcm1zIGZvciBvcmRlcnMgMDAwMS0wMDEwLCBhbmQgRVJQIGRhdGEgZW50cnkgcmVjb3JkcyAoRGF0YV9FbnRyeV9SZWNvcmRzX1EzXzIwMjUueGxzeCkgYmVmb3JlIHRoZSBxdWFydGVybHkgY29tcGxpYW5jZSBhdWRpdC4KCklkZW50aWZ5IGFsbCBkaXNjcmVwYW5jaWVzIHdoZXJlIHZhbHVlcyBkaWZmZXIgYmV0d2VlbiBzb3VyY2VzLiBGb3Igb3JkZXJzIHdpdGggc2Nhbm5lZCBvcmlnaW5hbHMgYXZhaWxhYmxlLCB0cmVhdCB0aGVzZSBhcyB0aGUgYXV0aG9yaXRhdGl2ZSBzb3VyY2U7IG90aGVyd2lzZSByZWx5IG9uIENSTSBkYXRhIG92ZXIgRVJQIGVudHJpZXMuIERvY3VtZW50IGVhY2ggZGlzY3JlcGFuY3kgaW4gdGhlIHByb3ZpZGVkIGVycm9yIGxvZyB0ZW1wbGF0ZSwgaW5jbHVkaW5nIHdoaWNoIHNvdXJjZSBzeXN0ZW0gaW50cm9kdWNlZCB0aGUgZXJyb3IgYW5kIHdoZXRoZXIgdGhlIGVycm9yIHJlcHJlc2VudHMgYSBzeXN0ZW1hdGljIHBhdHRlcm4gb3IgcmFuZG9tIG9jY3VycmVuY2UuIFJldmVudWUgY2FsY3VsYXRpb24gZGV2aWF0aW9ucyBzaG91bGQgaW5jbHVkZSB0aGUgZXhwZWN0ZWQgZm9ybXVsYSBhbmQgdmFyaWFuY2UgYW1vdW50LgoKVmFsaWRhdGUgY3VzdG9tZXIgaW5mb3JtYXRpb24gYWdhaW5zdCBHbG9iYWwgVHJhZGUgU29sdXRpb25zIEluYy4ncyByZWdpc3RlcmVkIGN1c3RvbWVyIG1hc3RlciBkYXRhIGluIEN1c3RvbWVyX01hc3Rlcl9MaXN0LmNzdiwgd2hpY2ggY292ZXJzIDM1IHJlZ2lzdGVyZWQgY3VzdG9tZXJzIChDVVNULTAwMSB0aHJvdWdoIENVU1QtMDM1KS4gT3JkZXJzIGZyb20gdW5yZWdpc3RlcmVkIGVudGl0aWVzIHJlcXVpcmUgZmxhZ2dpbmcgZm9yIGZvbGxvdy11cC4KClByb2R1Y2UgYSByZWNvbmNpbGVkIG1hc3RlciBkYXRhc2V0IGluZGljYXRpbmcgZWFjaCByZWNvcmQncyB2ZXJpZmljYXRpb24gc3RhdHVzIGFuZCBhbnkgc291cmNlIGNvbmZsaWN0cy4gUHJlcGFyZSBhbiBleGVjdXRpdmUgYXVkaXQgZmluZGluZ3MgcmVwb3J0IGNvdmVyaW5nIGRhdGEgcXVhbGl0eSBtZXRyaWNzLCBpZGVudGlmaWVkIHN5c3RlbWF0aWMgZXJyb3IgcGF0dGVybnMsIGFuZCBwcm9jZXNzIGltcHJvdmVtZW50IHJlY29tbWVuZGF0aW9ucy4gQ29tcGxldGUgdGhlIGNvbXBsaWFuY2UgY2hlY2tsaXN0IHdpdGggcGFzcy9mYWlsIGFzc2Vzc21lbnRzIGJhc2VkIG9uIHlvdXIgdmVyaWZpY2F0aW9uIHJlc3VsdHMuCgpMb2NhbCBSZWZlcmVuY2UgRmlsZXM6Ci0gQ29tcGxpYW5jZV9DaGVja2xpc3QueGxzeAotIEN1c3RvbWVyX01hc3Rlcl9MaXN0LmNzdgotIERhdGFfRW50cnlfUmVjb3Jkc19RM18yMDI1Lnhsc3gKLSBFcnJvcl9Mb2dfVGVtcGxhdGUueGxzeAotIE9yZGVyX0Zvcm1fT1JEXzIwMjUwN18wMDAxLnBuZwotIE9yZGVyX0Zvcm1fT1JEXzIwMjUwN18wMDAyLnBuZwotIE9yZGVyX0Zvcm1fT1JEXzIwMjUwN18wMDAzLnBuZwotIE9yZGVyX0Zvcm1fT1JEXzIwMjUwN18wMDA0LnBuZwotIE9yZGVyX0Zvcm1fT1JEXzIwMjUwN18wMDA1LnBuZwotIE9yZGVyX0Zvcm1fT1JEXzIwMjUwOF8wMDA2LnBuZwotIE9yZGVyX0Zvcm1fT1JEXzIwMjUwOF8wMDA3LnBuZwotIE9yZGVyX0Zvcm1fT1JEXzIwMjUwOF8wMDA4LnBuZwotIE9yZGVyX0Zvcm1fT1JEXzIwMjUwOV8wMDA5LnBuZwotIE9yZGVyX0Zvcm1fT1JEXzIwMjUwOV8wMDEwLnBuZwotIFNvdXJjZV9TYWxlc19PcmRlcnNfUTNfMjAyNS54bHN4Cg==)AsSeniorDataEntryKeyeratGlobalTradeSolutionsInc\.,reconcileQ32025salesorderdataacrosstheCRMexport\(Source\_Sales\_Orders\_Q3\_2025\.xlsx\),scannedhandwrittenorderformsfororders0001\-0010,andERPdataentryrecords\(Data\_Entry\_Records\_Q3\_2025\.xlsx\)beforethequarterlycomplianceaudit\.Identifyalldiscrepancieswherevaluesdifferbetweensources\.Fororderswithscannedoriginalsavailable,treattheseastheauthoritativesource;otherwiserelyonCRMdataoverERPentries\.Documenteachdiscrepancyintheprovidederrorlogtemplate,includingwhichsourcesystemintroducedtheerrorandwhethertheerrorrepresentsasystematicpatternorrandomoccurrence\.Revenuecalculationdeviationsshouldincludetheexpectedformulaandvarianceamount\.ValidatecustomerinformationagainstGlobalTradeSolutionsInc\.’sregisteredcustomermasterdatainCustomer\_Master\_List\.csv,whichcovers35registeredcustomers\(CUST\-001throughCUST\-035\)\.Ordersfromunregisteredentitiesrequireflaggingforfollow\-up\.Produceareconciledmasterdatasetindicatingeachrecord’sverificationstatusandanysourceconflicts\.Prepareanexecutiveauditfindingsreportcoveringdataqualitymetrics,identifiedsystematicerrorpatterns,andprocessimprovementrecommendations\.Completethecompliancechecklistwithpass/failassessmentsbasedonyourverificationresults\.LocalReferenceFiles:\-Compliance\_Checklist\.xlsx\-Customer\_Master\_List\.csv\-Data\_Entry\_Records\_Q3\_2025\.xlsx\-Error\_Log\_Template\.xlsx\-Order\_Form\_ORD\_202507\_0001\.png\-Order\_Form\_ORD\_202507\_0002\.png\-Order\_Form\_ORD\_202507\_0003\.png\-Order\_Form\_ORD\_202507\_0004\.png\-Order\_Form\_ORD\_202507\_0005\.png\-Order\_Form\_ORD\_202508\_0006\.png\-Order\_Form\_ORD\_202508\_0007\.png\-Order\_Form\_ORD\_202508\_0008\.png\-Order\_Form\_ORD\_202509\_0009\.png\-Order\_Form\_ORD\_202509\_0010\.png\-Source\_Sales\_Orders\_Q3\_2025\.xlsx

Rubrics \(12 checks, 92 total points\)1\.10 pts\.Does the Error Log correctly document the calculation error for ORD\-202507\-0002 where ERP shows Total\_Revenue of $6794\.25 instead of the correct $6749\.25?•The Error Log must document that for ORD\-202507\-0002, ERP shows Total\_Revenue as $6794\.25 while the correct calculation is 75 x $89\.99 = $6749\.25 \(matching CRM and scanned form\)\.•The error type must be classified as ’Calculation’ with the source system identified as ’ERP’\.•The reconciled value must be recorded as $6749\.25\.•Root cause notes should show the expected formula \(75 x $89\.99\) and the $45\.00 deviation\.2\.6 pts\.Does the Error Log document the customer name typo for ORD\-202507\-0004 where ERP has ’Pacific Rim Export’ instead of ’Pacific Rim Exports’?•The Error Log must document that for ORD\-202507\-0004, ERP has Customer\_Name as ’Pacific Rim Export’ \(missing ’s’\) while CRM and scanned form have ’Pacific Rim Exports’\.•The error type must be classified as ’Typo’ with the source system identified as ’ERP’\.•The reconciled value must be ’Pacific Rim Exports’, confirmed by Customer\_Master\_List\.csv where CUST\-004 is registered as ’Pacific Rim Exports’\.3\.6 pts\.Does the Error Log identify the digit transposition error for ORD\-202507\-0005 where ERP shows 1020 units sold instead of the correct 1200 units?•The Error Log must document that for ORD\-202507\-0005, ERP shows Units\_Sold as 1020 while CRM and scanned form show 1200\.•The error type must be classified as ’Transposition’ \(digits 0 and 2 swapped\) with the source system identified as ’ERP’\.•The reconciled value must be 1200\.4\.10 pts\.Does the Error Log correctly flag the country naming inconsistency for ORD\-202508\-0006 where CRM shows ’Italia’ instead of ’Italy’?•The Error Log must document that for ORD\-202508\-0006, CRM shows Country as ’Italia’ while ERP and scanned form show ’Italy’\.•The Error Log must reference Customer\_Master\_List\.csv to confirm CUST\-006 \(Mediterranean Trade Co\) is registered with Country ’Italy’\.•The error type must be classified as ’Format’ \(non\-standard naming\) with the source system identified as ’CRM’\.•The reconciled value must be ’Italy’\.5\.10 pts\.Does the Error Log identify the Unit Price discrepancy for ORD\-202508\-0008 where ERP shows $5\.05 instead of $5\.50?•The Error Log must document that for ORD\-202508\-0008, ERP shows Unit\_Price as $5\.05 while CRM and scanned form show $5\.50\.•The log should note that ERP Total\_Revenue of $4400\.00 matches 800 x $5\.50, indicating the ERP Unit\_Price field is internally inconsistent with its own revenue\.•The error type must be classified as ’Typo’ with the source system identified as ’ERP’\.•The reconciled value must be $5\.50\.6\.6 pts\.Does the Error Log correctly identify the date error for ORD\-202509\-0009 where CRM shows 09/30/2025 but ERP and scanned form show 09/03/2025?•The Error Log must document that for ORD\-202509\-0009, CRM shows Order\_Date as ’09/30/2025’ while ERP shows ’09/03/2025’ and scanned form confirms ’09/03/2025’\.•Per source priority \(scanned highest\), the reconciled date must be 09/03/2025\.•The error type must be classified as ’Transposition’ \(day digits 03 vs 30 swapped\) with the source system identified as ’CRM’\.7\.6 pts\.Does the Error Log document the country name typo for ORD\-202509\-0010 where ERP has ’Canda’ instead of ’Canada’?•The Error Log must document that for ORD\-202509\-0010, ERP has Country as ’Canda’ while CRM and scanned form show ’Canada’\.•The Error Log must reference Customer\_Master\_List\.csv to confirm CUST\-010 \(Canadian Wholesale Inc\) is registered with Country ’Canada’\.•The error type must be classified as ’Typo’ with the source system identified as ’ERP’ and the reconciled value as ’Canada’\.8\.6 pts\.Does the Error Log identify the Item Type typo for ORD\-202508\-0018 where ERP has ’Houshold’ instead of ’Household’?•The Error Log must document that for ORD\-202508\-0018, ERP has Item\_Type as ’Houshold’ \(missing ’e’\) while CRM shows ’Household’\.•The error type must be classified as ’Typo’ with the source system identified as ’ERP’\.•The reconciled value must be ’Household’\.9\.6 pts\.Does the Error Log identify the Unit Price transposition for ORD\-202509\-0023 where ERP shows $108\.00 instead of $180\.00?•The Error Log must document that for ORD\-202509\-0023, ERP shows Unit\_Price as $108\.00 while CRM shows $180\.00\.•The Error Log must use Total\_Revenue of $8100\.00 to validate the CRM price \(45 x $180\.00 = $8100\.00\), not the ERP price \(45 x $108\.00 = $4860\.00\)\.•The error type must be classified as ’Transposition’ with the source system identified as ’ERP’\.•The reconciled value must be $180\.00\.10\.6 pts\.Does the Error Log document the Sales Channel discrepancy for ORD\-202509\-0026 where ERP shows ’Online’ but CRM shows ’Offline’?•The Error Log must document that for ORD\-202509\-0026, ERP has Sales\_Channel as ’Online’ while CRM shows ’Offline’\.•Since no scanned form exists for this order, CRM takes priority; the source system causing the error must be identified as ’ERP’ with the reconciled value as ’Offline’\.•The error pattern should be classified as ’Systematic’ as multiple Sales Channel errors from ERP exist \(e\.g\., ORD\-202508\-0007\)\.11\.10 pts\.Does the Audit Findings Report include a data quality score calculated as percentage of records with no discrepancies?•The Audit\_Findings\_Report\.txt must include a data quality score calculated as \(records with no discrepancies / total records\) x 100%\.•The calculation must be based on 30 total records, with discrepancies identified in all 17 of the following orders: 0001\-0010, 0012, 0015, 0018, 0020, 0023, 0026, 0029\.•The data quality score should accurately reflect the count of clean vs\. discrepant records\.•The report should clearly state the methodology used for the data quality score calculation\.12\.10 pts\.Does the Compliance Checklist correctly assess CRT\-005 \(Total Revenue Calculation Accuracy\) as ’Fail’?•The Compliance\_Checklist\.xlsx must mark CRT\-005 ’Total Revenue calculations are accurate \(Units x Unit Price\)’ as ’Fail’\.•The Compliance\_Checklist\.xlsx must justify the failure by referencing ORD\-202507\-0002 \(ERP: $6794\.25 vs correct $6749\.25\) having a calculation error exceeding the \+/\-$0\.01 tolerance\.•The Compliance\_Checklist\.xlsx must also reference ORD\-202508\-0015 \(ERP: $4480\.00 vs correct $4880\.00\) as an additional calculation error exceeding the \+/\-$0\.01 tolerance\.

Task CardSummary\.This task asks the agent to act as Senior Data Entry Keyer at Global Trade Solutions, three\-way reconciling Q3 2025 sales orders across the CRM export, ten scanned handwritten order forms, and the ERP entry records ahead of a quarterly compliance audit – producing a discrepancy log, a reconciled master, an executive findings report, and a completed compliance checklist\.ONET taskCompare data with source documents, or re\-enter data in verification format to detect errors\.Expert desire3\.50 \(scale: 0–5, from WorkBank\)In practiceOCR\-reading the ten scanned PNG order forms \(orders 0001–0010\), comparing them againstSource\_Sales\_Orders\_Q3\_2025\.xlsxandData\_Entry\_Records\_Q3\_2025\.xlsx, validating customer entities against the 35 registered customers inCustomer\_Master\_List\.csv, applying source priority \(scans \> CRM \> ERP\) to each conflict, and tagging each error as systematic vs random in the error log template\.Why desired\.A keyer’s pre\-audit day is dominated by line\-by\-line comparison across three formats and by mechanical checks like Total = Units x Price and country\-name normalization – not the audit judgment\. Automating the three\-way diff, the calculation reverse\-validation, and the master\-list cross\-check lets the keyer focus on the systematic\-pattern findings and the process\-improvement recommendations the audit committee actually reads\.Reasoning challenges\.•Three\-way reconciliation with missing sources\.Only orders 0001–0010 have scanned originals\. For orders 0011–0030 the agent must fall back to CRM\-vs\-ERP with CRM as priority, applying the source hierarchy correctly when the authoritative source is absent\.•Reverse\-validation of calculation errors\.Some records have correct Total\_Revenue but wrong Unit\_Price or Units\_Sold\. The agent must apply Total = Units x Price bidirectionally to identify which component field is in error rather than flagging the total\.•Unregistered customer detection\.Two orders \(ORD\-202507\-0021 and ORD\-202508\-0022\) name "Unknown Supplier Co" and "Mystery Trading LLC" – neither in the 35\-customer master\. The agent must cross\-reference all 30 orders against the master to surface them\.•Country naming standardization\."Italia" vs "Italy" is only catchable by anchoring against the registered country list inCustomer\_Master\_List\.csv, not by a CRM\-vs\-ERP diff alone\.•Systematic vs random error classification\.The agent must aggregate individual findings into pattern categories – transposition errors \(IDs, digits, dates\), calculation errors, Sales Channel encoding errors, typos – and report frequencies, not just a flat list of discrepancies\.Files requiring search\.None – all required references are provided intask\_folder/\.

### E\.3Lawyers: Short\-Term Rental Ordinance Settlement Analysis

Lawyers – Task 1STR ordinance settlement responseAssignment BriefAct as a North Carolina real\-estate and land\-use attorney\. Evaluate a town settlement offer for six short\-term\-rental properties and prepare constitutional arguments, comparative regulation analysis, and a counter\-proposal\.Evidence FilesTown settlement letterClient property databaseBelvidere STR ordinanceCornell STR regulations PDFCase\-law and Penn Central contextReasoning BurdenLegal synthesis Compare local ordinance provisions with relevant STR litigation and doctrine\.Financial exposure Quantify fines, revenue loss, permit denials, and settlement pressure\.Property\-specific analysis Separate facts across six properties rather than arguing generically\.Negotiation framing Translate legal and financial analysis into settlement terms\.Required DeliverablesLegal memo: constitutional and settlement\-risk analysis\.Regulatory table: jurisdiction\-by\-jurisdiction STR comparison\.Counter\-proposal: client\-oriented settlement response\.Rubric Pressure PointsFine exposure; seven\-category regulatory table; owner\-occupancy and permit caps; Penn Central factors; property\-by\-property financials; viable counterterms\.

Figure 11:Task schematic for Lawyers\. The agent receives a town settlement letter, a six\-property client database, and a packet of ordinances and case law, and must deliver a constitutional and settlement\-risk memo, a jurisdiction\-by\-jurisdiction regulatory table, and a counter\-proposal\.

Instruction[⬇](data:text/plain;base64,WW91IGFyZSBhIHJlYWwgZXN0YXRlIGFuZCBsYW5kIHVzZSBhdHRvcm5leSBhdCBhIG1pZC1zaXplZCBsYXcgZmlybSBpbiBOb3J0aCBDYXJvbGluYS4gWW91ciBjbGllbnRzIC0tIENvYXN0YWwgUHJvcGVydHkgR3JvdXAsIExMQyAob3duZWQgYnkgSmFtZXMgYW5kIFBhdHJpY2lhIE1vcmdhbikgYW5kIFNhbmRyYSBXaGl0ZmllbGQgLS0gb3duIHNpeCByZXNpZGVudGlhbCBwcm9wZXJ0aWVzIGluIHRoZSBUb3duIG9mIE1pbGxicm9vaywgTkMgdGhhdCB0aGV5IGhhdmUgYmVlbiBvcGVyYXRpbmcgYXMgc2hvcnQtdGVybSByZW50YWxzLiBJbiBNYXkgMjAyNSwgdGhlIFRvd24gZW5hY3RlZCBPcmRpbmFuY2UgMjAyNS0xNCwgYSBjb21wcmVoZW5zaXZlIFNUUiByZWd1bGF0aW9uIHRoYXQgZWZmZWN0aXZlbHkgc2h1dHMgZG93biB0aGVpciBvcGVyYXRpb25zLiBQZXJtaXQgYXBwbGljYXRpb25zIHVuZGVyIHRoZSBuZXcgb3JkaW5hbmNlIGhhdmUgYmVlbiBkZW5pZWQuCgpUaHJlZSBvZiB0aGUgY2xpZW50cycgcHJvcGVydGllcyBoYXZlIGJlZW4gb3BlcmF0aW5nIHBhc3QgdGhlIE5vdmVtYmVyIDI4LCAyMDI1IGFtb3J0aXphdGlvbiBkZWFkbGluZSBhbmQgYXJlIGFjY3J1aW5nIGRhaWx5IGZpbmVzLiBUaGUgVG93biBBdHRvcm5leSBoYXMgc2VudCBhIHNldHRsZW1lbnQgcHJvcG9zYWwgb2ZmZXJpbmcgdHJhbnNpdGlvbmFsIHBlcm1pdHMuIFlvdXIgY2xpZW50cyBuZWVkIHlvdSB0byBldmFsdWF0ZSB0aGlzIHByb3Bvc2FsIGFuZCBhZHZpc2Ugb24gd2hldGhlciB0byBhY2NlcHQsIHJlamVjdCwgb3IgY291bnRlci1wcm9wb3NlLiBTYW5kcmEgV2hpdGZpZWxkIGhhcyBwcml2YXRlbHkgbWVudGlvbmVkIHNoZSBtaWdodCBjb25zaWRlciBzZWxsaW5nIGhlciBwcm9wZXJ0aWVzIGlmIHRoZSBudW1iZXJzIG1ha2Ugc2Vuc2UsIGJ1dCB3YW50cyB5b3VyIGFuYWx5c2lzIGJlZm9yZSBkZWNpZGluZy4gVGhlIHJlc3BvbnNlIGRlYWRsaW5lIGlzIEphbnVhcnkgNSwgMjAyNiwgYW5kIHRoZSBmaW5lIHdhaXZlciBpcyBjb25kaXRpb25lZCBvbiBleGVjdXRpbmcgYSBzZXR0bGVtZW50IGFncmVlbWVudCBieSBKYW51YXJ5IDMxLCAyMDI2LgoKVGhlIFRvd24gQXR0b3JuZXkncyBsZXR0ZXIgcmVsaWVzIGhlYXZpbHkgb24gSGlnbmVsbC1TdGFyayB2LiBDaXR5IG9mIE5ldyBPcmxlYW5zLCBOby4gMjEtMzA2NDMgKDV0aCBDaXIuIDIwMjIpIGFuZCBjaXRlcyBhbW9ydGl6YXRpb24gcHJlY2VkZW50IHRoYXQgbWF5IG5vdCB3aXRoc3RhbmQgc2NydXRpbnkgZ2l2ZW4gdGhlIGNsaWVudHMnIGRpc3RpbmN0IHByb3BlcnR5IGludGVyZXN0cyBhbmQgdGhlIFBlbm4gQ2VudHJhbCByZWd1bGF0b3J5IHRha2luZ3MgZnJhbWV3b3JrIGFwcGxpY2FibGUgaGVyZS4gWW91IHdpbGwgbmVlZCB0byBvYnRhaW4gdGhlIGFjdHVhbCBGaWZ0aCBDaXJjdWl0IG9waW5pb24gYW5kIGNvbnN1bHQgdGhlIFBlbm4gU3RhdGUgTGF3IFJldmlldyBhcnRpY2xlIG9uIHRoZSBQZW5uIENlbnRyYWwgdGFraW5ncyB0ZXN0IChWb2wuIDExOCwgSXNzdWUgMywgQXJ0aWNsZSAzKSB0byBldmFsdWF0ZSB0aGUgVG93bidzIGxlZ2FsIHBvc2l0aW9uIGFuZCBidWlsZCB0aGUgY291bnRlci1hcmd1bWVudC4gQSBjb21wYXJhdGl2ZSByZWd1bGF0b3J5IGFuYWx5c2lzIGFnYWluc3Qgb3RoZXIgbXVuaWNpcGFsIFNUUiBvcmRpbmFuY2VzIC0tIGluY2x1ZGluZyB0aGUgQmVsdmlkZXJlIFRvd25zaGlwIG9yZGluYW5jZSBwcm92aWRlZCAtLSB3aWxsIHN0cmVuZ3RoZW4gdGhlIGNsaWVudHMnIG5lZ290aWF0aW5nIHBvc3R1cmUgYnkgZGVtb25zdHJhdGluZyB0aGF0IGxlc3MgcmVzdHJpY3RpdmUgYWx0ZXJuYXRpdmVzIGV4aXN0LiBUaGUgY2xpZW50IHByb3BlcnR5IGRhdGFiYXNlIGNvbnRhaW5zIGFsbCBmaW5hbmNpYWwgZGF0YSwgc2V0dGxlbWVudCB0ZXJtcywgYW5kIGZpbmUgYWNjcnVhbCByZWNvcmRzIGFjcm9zcyBlaWdodCB0YWJsZXMuCgpQcm9kdWNlIHRocmVlIGRlbGl2ZXJhYmxlczoKCkEgc2V0dGxlbWVudCBldmFsdWF0aW9uIG1lbW8gKFBERiwgbWF4IDYgcGFnZXMpIGFuYWx5emluZyB0aGUgVG93bidzIHByb3Bvc2FsIGFzIGF0dG9ybmV5IHdvcmsgcHJvZHVjdC4gVGhlIG1lbW8gc2hvdWxkIGV2YWx1YXRlIHRoZSBlY29ub21pYyB2YWx1ZSBvZiB0aGUgc2V0dGxlbWVudCBvZmZlciB2ZXJzdXMgY29udGludWVkIGxpdGlnYXRpb24gb24gYSBwZXItcHJvcGVydHkgYmFzaXMsIGRyYXdpbmcgb24gdGhlIGRhdGFiYXNlIGZvciBpbmNvbWUgcHJvamVjdGlvbnMgdW5kZXIgdGhlIHByb3Bvc2VkIHRyYW5zaXRpb25hbCBjYXAsIGN1cnJlbnQgZmluZSBleHBvc3VyZSwgYW5kIGxvbmctdGVybSByZW50YWwgY29udmVyc2lvbiBjb21wYXJhYmxlcy4gQWRkcmVzcyB0aGUgUi0xIHpvbmUgZXhjbHVzaW9uIGZvciBQcm9wZXJ0eSA0IGFuZCB0aGUgdmlhYmlsaXR5IG9mIHRoZSBwcm9wb3NlZCByZXpvbmluZyBwYXRoIGdpdmVuIHRoYXQgcHJvcGVydHkncyBmaW5hbmNpYWwgcHJvZmlsZS4KCkEgY29tcGFyYXRpdmUgcmVndWxhdG9yeSBhbmFseXNpcyB0YWJsZSAoQ1NWKSBjb3ZlcmluZyBNaWxsYnJvb2sncyBvcmRpbmFuY2UsIHRoZSBCZWx2aWRlcmUgVG93bnNoaXAgb3JkaW5hbmNlLCBhbmQgdGhlIE5ldyBPcmxlYW5zIHJlZ3VsYXRvcnkgc2NoZW1lIGFzIGRlc2NyaWJlZCBpbiB0aGUgSGlnbmVsbC1TdGFyayBvcGluaW9uLiBUaGUgdGFibGUgc2hvdWxkIGluY2x1ZGUgYSBjb2x1bW4gZXhwbGFpbmluZyB0aGUgbGVnYWwgc2lnbmlmaWNhbmNlIG9mIGVhY2ggY29tcGFyaXNvbiBmb3IgdGhlIGNsaWVudHMnIHBvc2l0aW9uLgoKQSBzZXR0bGVtZW50IGNvdW50ZXItcHJvcG9zYWwgKFRYVCkgd2l0aCB0ZXJtcyBncm91bmRlZCBpbiBjYXNlIGxhdyBhbmQgdGhlIGZpbmFuY2lhbCBkYXRhLCBhZGRyZXNzaW5nIHRoZSB0cmFuc2l0aW9uYWwgcGVyaW9kIGR1cmF0aW9uLCB0aGUgUi0xIHByb3BlcnR5J3MgaW5jbHVzaW9uLCB0aGUgYW5udWFsIHJldmlldyBtZWNoYW5pc20gZm9yIHRoZSBuaWdodCBjYXAsIGFuZCBwcmVzZXJ2YXRpb24gb2YgbGl0aWdhdGlvbiByaWdodHMgZHVyaW5nIG5lZ290aWF0aW9ucy4KClF1ZXJ5IHRoZSBTUUxpdGUgZGF0YWJhc2UgdG8gZXh0cmFjdCBhbGwgcmVsZXZhbnQgZmluYW5jaWFsIGZpZ3VyZXMsIGZpbmUgYWNjcnVhbHMsIHNldHRsZW1lbnQgdGVybXMsIGFuZCBwcm9wZXJ0eSBkZXRhaWxzIG5lZWRlZCBhY3Jvc3MgYWxsIHRocmVlIGRlbGl2ZXJhYmxlcy4KCkxvY2FsIFJlZmVyZW5jZSBGaWxlczoKLSBiZWx2aWRlcmVfc3RyX29yZGluYW5jZS5wZGYKLSBjbGllbnRfcHJvcGVydGllc19zdHIuZGIKLSBjb3JuZWxsX3N0cl9yZWd1bGF0aW9ucy5wZGYKLSB0b3duX2F0dG9ybmV5X3NldHRsZW1lbnRfbGV0dGVyLnR4dAo=)Youarearealestateandlanduseattorneyatamid\-sizedlawfirminNorthCarolina\.Yourclients\-\-CoastalPropertyGroup,LLC\(ownedbyJamesandPatriciaMorgan\)andSandraWhitfield\-\-ownsixresidentialpropertiesintheTownofMillbrook,NCthattheyhavebeenoperatingasshort\-termrentals\.InMay2025,theTownenactedOrdinance2025\-14,acomprehensiveSTRregulationthateffectivelyshutsdowntheiroperations\.Permitapplicationsunderthenewordinancehavebeendenied\.Threeoftheclients’propertieshavebeenoperatingpasttheNovember28,2025amortizationdeadlineandareaccruingdailyfines\.TheTownAttorneyhassentasettlementproposalofferingtransitionalpermits\.Yourclientsneedyoutoevaluatethisproposalandadviseonwhethertoaccept,reject,orcounter\-propose\.SandraWhitfieldhasprivatelymentionedshemightconsidersellingherpropertiesifthenumbersmakesense,butwantsyouranalysisbeforedeciding\.TheresponsedeadlineisJanuary5,2026,andthefinewaiverisconditionedonexecutingasettlementagreementbyJanuary31,2026\.TheTownAttorney’sletterreliesheavilyonHignell\-Starkv\.CityofNewOrleans,No\.21\-30643\(5thCir\.2022\)andcitesamortizationprecedentthatmaynotwithstandscrutinygiventheclients’distinctpropertyinterestsandthePennCentralregulatorytakingsframeworkapplicablehere\.YouwillneedtoobtaintheactualFifthCircuitopinionandconsultthePennStateLawReviewarticleonthePennCentraltakingstest\(Vol\.118,Issue3,Article3\)toevaluatetheTown’slegalpositionandbuildthecounter\-argument\.AcomparativeregulatoryanalysisagainstothermunicipalSTRordinances\-\-includingtheBelvidereTownshipordinanceprovided\-\-willstrengthentheclients’negotiatingposturebydemonstratingthatlessrestrictivealternativesexist\.Theclientpropertydatabasecontainsallfinancialdata,settlementterms,andfineaccrualrecordsacrosseighttables\.Producethreedeliverables:Asettlementevaluationmemo\(PDF,max6pages\)analyzingtheTown’sproposalasattorneyworkproduct\.Thememoshouldevaluatetheeconomicvalueofthesettlementofferversuscontinuedlitigationonaper\-propertybasis,drawingonthedatabaseforincomeprojectionsundertheproposedtransitionalcap,currentfineexposure,andlong\-termrentalconversioncomparables\.AddresstheR\-1zoneexclusionforProperty4andtheviabilityoftheproposedrezoningpathgiventhatproperty’sfinancialprofile\.Acomparativeregulatoryanalysistable\(CSV\)coveringMillbrook’sordinance,theBelvidereTownshipordinance,andtheNewOrleansregulatoryschemeasdescribedintheHignell\-Starkopinion\.Thetableshouldincludeacolumnexplainingthelegalsignificanceofeachcomparisonfortheclients’position\.Asettlementcounter\-proposal\(TXT\)withtermsgroundedincaselawandthefinancialdata,addressingthetransitionalperiodduration,theR\-1property’sinclusion,theannualreviewmechanismforthenightcap,andpreservationoflitigationrightsduringnegotiations\.QuerytheSQLitedatabasetoextractallrelevantfinancialfigures,fineaccruals,settlementterms,andpropertydetailsneededacrossallthreedeliverables\.LocalReferenceFiles:\-belvidere\_str\_ordinance\.pdf\-client\_properties\_str\.db\-cornell\_str\_regulations\.pdf\-town\_attorney\_settlement\_letter\.txt

Rubrics \(8 checks, 48 total points\)1\.6 pts\.Does the analysis correctly identify and quantify the fine exposure for each of the three properties accruing $500/day fines, using the database data showing fines started accruing on November 29, 2025, and correctly project ongoing exposure through the January 31, 2026 fine\-waiver deadline?•Identifies the three specific properties accruing fines at $500/day: Property 1 \(28 Oceanfront Lane\), Property 3 \(15 Dune Ridge Court\), and Property 5 \(203 Lighthouse Road\)•Calculates that as of December 10, 2025, each property had accrued 12 days x $500 = $6,000 in fines•Projects forward that if settlement is not reached by January 31, 2026, fines will have accrued for 64 days per property totaling $32,000 per property or $96,000 aggregate•Notes that the fine\-waiver is conditioned on executing a settlement agreement by January 31, 2026, creating significant time pressure2\.6 pts\.Does the comparative regulatory analysis table include at minimum the seven required provision categories \(owner\-occupancy requirements, permit caps, night limits, spacing rules, amortization/grandfathering, fine structures, and zone restrictions\) across all three jurisdictions \(Millbrook Ordinance 2025\-14, Belvidere Township, and New Orleans per Hignell\-Stark\)?•The CSV table includes all seven required provision categories: owner\-occupancy requirements, permit caps, night limits, spacing rules, amortization/grandfathering, fine structures, and zone restrictions•Accurately compares provisions across all three jurisdictions \(Millbrook Ordinance 2025\-14, Belvidere Township, and New Orleans per Hignell\-Stark\)•Owner\-occupancy data is accurate: Millbrook requires primary residence/owner\-occupancy; Belvidere requires only a local\-agent within 45 miles \(no owner\-occupancy\); New Orleans required homestead exemption \(found facially discriminatory under dormant Commerce Clause\)•Night limits, permit caps, and spacing data is accurate: Millbrook caps at 90 nights/year \(120 transitional\), 25 town\-wide permits, 1,000\-ft spacing; Belvidere has no annual night cap \(27\-night max per stay\), no permit cap, no spacing; New Orleans had none referenced•Amortization, fines, and zone data is accurate: Millbrook 180 days with no grandfathering and $500/day fines, R\-1 prohibition; Belvidere 30\-day grace period with escalating fines \($100/$500/$1,500 plus $250/day\), all zones allowed; New Orleans distinguished residential from non\-residential3\.6 pts\.Does the counter\-proposal include a request to extend the transitional period beyond the proposed 3 years \(ending December 1, 2028\), with economic justification grounded in the financial data?•Counter\-proposal argues for extending the transitional period beyond the proposed 3 years \(ending December 1, 2028\)•Justifies extension with reference to clients’ substantial mortgage balances \(ranging from $198,000 to $385,000\) and property\-specific investment\-backed expectations based on years of established STR operations \(earliest permit from 2017\)•Suggests a specific longer period \(e\.g\., 5\-7 years\) with justification•References the financial data showing the gap between STR income and LTR income \(e\.g\., Property 3 earned ~$67,860/year net from STR vs\. $33,600/year gross from LTR\)4\.6 pts\.Does the counter\-proposal include a request for a mutual tolling agreement on the statute of limitations during settlement negotiations?•Counter\-proposal includes a request for a mutual tolling agreement that pauses the running of applicable statutes of limitations on the clients’ potential constitutional claims \(Takings, Due Process, dormant Commerce Clause under 42 U\.S\.C\. 1983\)•Provides rationale that the January 5, 2026 response deadline and January 31, 2026 fine\-waiver deadline create time pressure that could force premature settlement•Notes that the Town’s proposal requires dismissal "with prejudice," meaning if negotiations fail after limitations period expires, clients lose their litigation option•Explains that a tolling agreement protects both parties by allowing good\-faith negotiation without the pressure of expiring claims5\.6 pts\.Does the analysis apply the Penn Central regulatory takings factors \(economic impact, investment\-backed expectations, and character of government action\) to the clients’ specific circumstances to assess litigation strength?•Applies the economic impact factor: the ordinance eliminates clients’ primary income stream, with historical STR net income ranging from $47,130 to $67,860 annually versus LTR gross income of $21,000 to $33,600 \(roughly 50%\+ income reduction\), though properties retain residual value for other uses•Applies the investment\-backed expectations factor: clients purchased properties between 2016\-2021 specifically for or relying on STR operations, held valid permits under prior regime \(earliest from 2017\), and the regulatory climate before the 2023 study commission did not signal imminent prohibition•Applies the character of government action factor: the ordinance is a broad public\-welfare regulation but the 180\-day amortization with no grandfathering and $500/day fines create a punitive character; retroactive elimination of previously permitted use weighs toward finding a taking•Explicitly identifies and names all three Penn Central factors being applied6\.6 pts\.Does the analysis identify the dormant Commerce Clause vulnerability in Millbrook’s owner\-occupancy requirement by drawing on the Hignell\-Stark holding that New Orleans’ similar residency/homestead requirement was facially discriminatory against interstate commerce?•Identifies that Ordinance 2025\-14’s owner\-occupancy requirement \(primary residence\) is analogous to New Orleans’ homestead exemption requirement held facially discriminatory against interstate commerce in Hignell\-Stark•Notes the Fifth Circuit found such requirements forbid out\-of\-state property owners from participating in the STR market in residential zones and applied strict scrutiny \(not Pike balancing\), under which the law was "virtually per se invalid"•Notes that reasonable nondiscriminatory alternatives existed \(enforcement, penalties, night caps, permit caps\) as the court found•Identifies this as a significant legal weakness in the Town’s position that strengthens the clients’ litigation posture, noting the Fourth Circuit has not yet addressed this but the Fifth Circuit reasoning is persuasive authority7\.6 pts\.Does the settlement evaluation correctly identify that the Town’s settlement requires dismissal of all claims ’with prejudice’ and analyze the strategic implications of this term?•Flags that under the "Mutual Terms" section, clients would be required to "dismiss any pending or contemplated legal claims against the Town with prejudice"•Explains that dismissal with prejudice permanently extinguishes the clients’ constitutional challenges \(Takings, dormant Commerce Clause, Due Process\) and cannot be refiled•Analyzes strategic significance: the dormant Commerce Clause claim is strong based on Hignell\-Stark, and a successful constitutional challenge could invalidate the entire ordinance \(not just provide transitional relief\)•The counter\-proposal addresses this by proposing dismissal without prejudice, or alternatively a carve\-out preserving the right to challenge future amendments8\.6 pts\.Does the analysis correctly identify that Millbrook’s 180\-day amortization period with no grandfathering clause is legally vulnerable when compared to amortization periods upheld in the cases cited by the Town Attorney?•Identifies that the Town Attorney cites AVR, Inc\. v\. City of St\. Louis Park and City of Los Angeles v\. Gage as support, but these cases involved different types of uses \(adult entertainment, industrial\) with different investment profiles than residential STR operations•Notes that both cited cases provided longer amortization than Millbrook’s 180 days•Notes Millbrook provides no grandfathering for established operators with years of history, and under Penn Central the adequacy of amortization must be assessed in light of specific investment\-backed expectations – clients with substantial mortgages cannot reasonably recoup investments in 180 days•Contrasts with Belvidere Township’s less restrictive approach as further evidence of Millbrook’s unreasonableness

Task CardSummary\.This task asks the agent to act as a North Carolina real estate attorney evaluating a Town’s settlement offer on six short\-term rental properties, building constitutional counter\-arguments and producing a memo, comparative regulatory table, and counter\-proposal grounded in case law and per\-property financials\.ONET taskStudy Constitution, statutes, decisions, regulations, and ordinances of quasi\-judicial bodies to determine ramifications for cases\.Expert desire3\.17 \(scale: 0–5, from WorkBank\)In practicepulling the Hignell\-Stark Fifth Circuit opinion and the Penn State Law Review Penn Central article, comparing the Millbrook ordinance against Belvidere’s STR rules and the New Orleans scheme, and reconciling all of that with the financial and permitting records sitting in the eight\-table client database – then deriving the constitutional theory that drives the settlement response\.Why desired\.Constitutional litigation work is research\-saturated – attorneys spend hours reading opinions, scholarly articles, and comparative ordinances before writing a single line of advocacy\. An agent that handles the doctrinal sourcing, ordinance comparison, and per\-property financial roll\-up frees the attorney to focus on strategy and the judgment call about whether to settle, counter, or litigate\.Reasoning challenges\.•Turning the Town’s own cited case against the Town\.The settlement letter cites Hignell\-Stark v\. New Orleans \(5th Cir\. 2022\) defensively, but the same opinion’s dormant Commerce Clause holding struck down an owner\-occupancy requirement closely matching Millbrook’s\. The agent must distinguish the adverse takings holding on license\-vs\-fee\-simple grounds and mine the favorable holding the Town’s attorney quietly omitted\.•Penn Central three\-factor framework recovery\.The instructions only hint at "regulatory takings framework\." The agent must independently invoke Penn Central Transportation Co\. v\. New York City and apply the three\-factor test \(economic impact, investment\-backed expectations, character of government action\) to each of the six properties\.•Per\-property financial modeling across five database tables\.Settlement valuation requires joining str\_income, ltr\_comparables, settlement\_offers, fine\_accruals, and properties to compute transitional STR income at the 120\-night cap, LTR conversion income, and fine exposure of $500/day from Nov 29 through Jan 31 – across six properties with different ownership and zoning\.•R\-1 equal protection angle on Property 4\.77 Seaside Terrace is excluded from the transitional permits but had a valid prior STR permit visible only in the str\_permits table\. The agent must recognize that singling out one property creates an arbitrary classification raising equal protection concerns and counter\-propose either inclusion or a binding rezoning commitment\.•Amortization vulnerability across jurisdictions\.The Town cites AVR, Inc\. and City of Los Angeles v\. Gage, but those involved 1\-year and 5\-year periods for adult entertainment and industrial uses\. The agent must build the argument that 180 days with no grandfathering is constitutionally inadequate for residential STRs carrying mortgages\.•Dismissal\-with\-prejudice and tolling traps\.The settlement letter buries dismissal\-with\-prejudice language that would extinguish the dormant Commerce Clause and takings claims permanently\. The agent must flag this and demand a mutual tolling agreement so the 1983 limitations clock does not expire during negotiations\.Files requiring search\.External legal references the agent must locate to evaluate the Town’s position and ground the counter\-proposal\.•hignell\_stark\_v\_new\_orleans\_5thcir\_2022\.pdf: Distinguish the adverse takings holding and mine the favorable dormant Commerce Clause holding against owner\-occupancy rules; populates the New Orleans column of the comparative table\.*Source:*ca5\.uscourts\.gov \([https://www\.ca5\.uscourts\.gov/opinions/pub/21/21\-30643\-CV0\.pdf](https://www.ca5.uscourts.gov/opinions/pub/21/21-30643-CV0.pdf)\)\.•penn\_central\_takings\_test\.pdf: Apply the Penn Central three\-factor regulatory takings test to assess litigation strength and ground counter\-proposal terms\.*Source:*pennstatelawreview\.org \([https://www\.pennstatelawreview\.org/footnotes/3996/](https://www.pennstatelawreview.org/footnotes/3996/)\)\.

### E\.4Web Administrators: ShopVault Incident Reconstruction

Web Administrators – Task 1Incident reconstruction and hardeningAssignment BriefAct as a web administrator investigating ShopVault after a hardening change request was closed without verification\. Reconstruct attack chains from logs, audit configs, and produce hardened nginx and firewall rules\.Evidence FilesAccess logs and threat indicatorsChange request CR\-2025\-0347nginx\.conf and iptables\.rulesNetwork topology JSONSSL audit and incident templateReasoning BurdenKill\-chain reconstruction Sequence SQLi, brute force, IDOR, XSS, traversal, and destructive actions\.Compliance audit Map missed change\-request controls to observed exploit outcomes\.Topology\-aware repair Respect zone CIDRs when hardening nginx and iptables\.Incident reporting Convert technical evidence into NIST\-style findings and actions\.Required Deliverablesfindings\_summary\.csv: attack\-chain events with phases and success flags\.cr\_compliance\_audit\.csv: control gaps tied to outcomes\.hardened configs: nginx, iptables, and incident report\.Rubric Pressure PointsFull kill\-chain timeline; CR gap\-to\-outcome mapping; dotfile/debug/admin\-control fixes; topology\-safe firewall rules; TLS findings; NIST incident report completeness\.

Figure 12:Task schematic for Web Administrators\. The agent receives access logs, threat indicators, a closed change request, live nginx and iptables configuration, and a network topology, and must reconstruct the kill chain, audit the missed controls, and deliver hardened configurations and a NIST\-style incident report\.

Instruction[⬇](data:text/plain;base64,T24gTm92ZW1iZXIgMTAsIDIwMjUsIFNob3BWYXVsdCdzIG1vbml0b3JpbmcgZmxhZ2dlZCBhbm9tYWxvdXMgdHJhZmZpYyBwYXR0ZXJucyBhY3Jvc3MgdGhlIGUtY29tbWVyY2UgcGxhdGZvcm0uIENvbmN1cnJlbnRseSwgdGhlIENJU08gZGlzY292ZXJlZCB0aGF0IGNoYW5nZSByZXF1ZXN0IENSLTIwMjUtMDM0NyAtLSBhIHNlY3VyaXR5IGhhcmRlbmluZyBzcHJpbnQgd2l0aCBhbiBPY3RvYmVyIDMxIGRlYWRsaW5lIC0tIHdhcyBtYXJrZWQgIkNsb3NlZCAoSW1wbGVtZW50ZWQpIiB3aXRob3V0IHBvc3QtaW1wbGVtZW50YXRpb24gdmVyaWZpY2F0aW9uLiBZb3UgYXJlIHRhc2tlZCB3aXRoIGEgZnVsbCBpbmNpZGVudCBpbnZlc3RpZ2F0aW9uIGFuZCByZW1lZGlhdGlvbi4KClJlY29uc3RydWN0IHRoZSBhdHRhY2sgY2hhaW5zIGZyb20gdGhlIGFjY2VzcyBsb2dzIGJ5IGNvcnJlbGF0aW5nIHRoZW0gYWdhaW5zdCB0aGUgdGhyZWF0IGludGVsbGlnZW5jZSBmZWVkLiBQcm9kdWNlIGZpbmRpbmdzX3N1bW1hcnkuY3N2IGNhcHR1cmluZyBlYWNoIGV2ZW50IHdpdGggaXRzIG1hdGNoZWQgc2lnbmF0dXJlLCBraWxsIGNoYWluIHBoYXNlLCBhbmQgd2hldGhlciBpdCBzdWNjZWVkZWQuIFRoZSBjb21wbGlhbmNlIGF1ZGl0IG11c3QgYXNzZXNzIGV2ZXJ5IGl0ZW0gaW4gdGhlIGNoYW5nZSByZXF1ZXN0IGFnYWluc3QgdGhlIGxpdmUgY29uZmlndXJhdGlvbnMgYW5kIGV4cGxhaW4sIGZvciBlYWNoIGdhcCwgaG93IGl0IGNvbnRyaWJ1dGVkIHRvIHRoZSBpbmNpZGVudCdzIG91dGNvbWUuIERlbGl2ZXIgdGhpcyBhcyBjcl9jb21wbGlhbmNlX2F1ZGl0LmNzdi4gQXNzZXNzIHRoZSBUTFMgcG9zdHVyZSBieSBjcm9zcy1yZWZlcmVuY2luZyB0aGUgY2VydGlmaWNhdGUgaW52ZW50b3J5IHdpdGggdGhlIGNoYW5nZSByZXF1ZXN0J3MgVExTIHJlcXVpcmVtZW50cyBhbmQgdGhlIGFjdHVhbCB3ZWIgc2VydmVyIGNvbmZpZ3VyYXRpb24uCgpQcm9kdWNlIGhhcmRlbmVkIHJlcGxhY2VtZW50cyBmb3IgYm90aCB0aGUgd2ViIHNlcnZlciBhbmQgZmlyZXdhbGwgY29uZmlndXJhdGlvbnMuIFRoZSBmaXJld2FsbCBydWxlcyBtdXN0IGVuZm9yY2UgdGhlIGFwcHJvdmVkIHpvbmUgc2VnbWVudGF0aW9uIC0tIGNvbnN1bHQgdGhlIG5ldHdvcmsgdG9wb2xvZ3kgZm9yIHRoZSBjb3JyZWN0IHNvdXJjZS1JUCByZXN0cmljdGlvbnMgcGVyIHpvbmUuIFRoZSBpcHRhYmxlcy5maXJld2FsbC5ydWxlcyBmaWxlIGZyb20gdGhlIGFiZHVzZmF1emkvd29yZHByZXNzLW5naW54IEdpdEh1YiByZXBvc2l0b3J5IG1heSBiZSB1c2VmdWwgYXMgYSBzdHJ1Y3R1cmFsIGJhc2VsaW5lLCB0aG91Z2ggaXQgd2lsbCByZXF1aXJlIGFkYXB0YXRpb24gdG8gU2hvcFZhdWx0J3MgYXJjaGl0ZWN0dXJlLgoKUG9wdWxhdGUgdGhlIGluY2lkZW50IHJlcG9ydCB0ZW1wbGF0ZSBhbmQgc2F2ZSBpdCBhcyBpbmNpZGVudF9yZXBvcnQuaHRtbC4gVGhlIGV4ZWN1dGl2ZSBzdW1tYXJ5IG11c3QgdHJhY2UgY2F1c2F0aW9uIGZyb20gdW52ZXJpZmllZCBjaGFuZ2UgcmVxdWVzdCBpdGVtcyB0byBzcGVjaWZpYywgb2JzZXJ2ZWQgYXR0YWNrIG91dGNvbWVzLgoKRGVsaXZlcmFibGVzOiBmaW5kaW5nc19zdW1tYXJ5LmNzdiwgY3JfY29tcGxpYW5jZV9hdWRpdC5jc3YsIG5naW54X2hhcmRlbmVkLmNvbmYsIGlwdGFibGVzX2hhcmRlbmVkLnJ1bGVzLCBpbmNpZGVudF9yZXBvcnQuaHRtbAoKTG9jYWwgUmVmZXJlbmNlIEZpbGVzOgotIGFjY2Vzc19sb2dzLmNzdgotIGNoYW5nZV9yZXF1ZXN0X0NSMjAyNS0wMzQ3LnR4dAotIGluY2lkZW50X3JlcG9ydF90ZW1wbGF0ZS5odG1sCi0gaXB0YWJsZXMucnVsZXMKLSBuZXR3b3JrX3RvcG9sb2d5Lmpzb24KLSBuZ2lueC5jb25mCi0gc3NsX2NlcnRfYXVkaXQuY3N2Ci0gdGhyZWF0X2luZGljYXRvcnMuanNvbgo=)OnNovember10,2025,ShopVault’smonitoringflaggedanomaloustrafficpatternsacrossthee\-commerceplatform\.Concurrently,theCISOdiscoveredthatchangerequestCR\-2025\-0347\-\-asecurityhardeningsprintwithanOctober31deadline\-\-wasmarked"Closed\(Implemented\)"withoutpost\-implementationverification\.Youaretaskedwithafullincidentinvestigationandremediation\.Reconstructtheattackchainsfromtheaccesslogsbycorrelatingthemagainstthethreatintelligencefeed\.Producefindings\_summary\.csvcapturingeacheventwithitsmatchedsignature,killchainphase,andwhetheritsucceeded\.Thecomplianceauditmustassesseveryiteminthechangerequestagainsttheliveconfigurationsandexplain,foreachgap,howitcontributedtotheincident’soutcome\.Deliverthisascr\_compliance\_audit\.csv\.AssesstheTLSposturebycross\-referencingthecertificateinventorywiththechangerequest’sTLSrequirementsandtheactualwebserverconfiguration\.Producehardenedreplacementsforboththewebserverandfirewallconfigurations\.Thefirewallrulesmustenforcetheapprovedzonesegmentation\-\-consultthenetworktopologyforthecorrectsource\-IPrestrictionsperzone\.Theiptables\.firewall\.rulesfilefromtheabdusfauzi/wordpress\-nginxGitHubrepositorymaybeusefulasastructuralbaseline,thoughitwillrequireadaptationtoShopVault’sarchitecture\.Populatetheincidentreporttemplateandsaveitasincident\_report\.html\.Theexecutivesummarymusttracecausationfromunverifiedchangerequestitemstospecific,observedattackoutcomes\.Deliverables:findings\_summary\.csv,cr\_compliance\_audit\.csv,nginx\_hardened\.conf,iptables\_hardened\.rules,incident\_report\.htmlLocalReferenceFiles:\-access\_logs\.csv\-change\_request\_CR2025\-0347\.txt\-incident\_report\_template\.html\-iptables\.rules\-network\_topology\.json\-nginx\.conf\-ssl\_cert\_audit\.csv\-threat\_indicators\.json

Rubrics \(8 checks, 50 total points\)1\.8 pts\.Does the findings\_summary\.csv reconstruct the full kill chain progression for 203\.0\.113\.42 showing the correct sequence: initial probing \(SQL injection at 08:13:01\-05Z\), credential access \(brute force at 08:13:08\-16Z\), enumeration \(IDOR user scraping at 08:16:00\-09Z\), exploitation \(XSS at 08:19:00\-08Z\), data exfiltration \(path traversal \+ config download at 08:21:00\-04Z\), and post\-compromise actions \(DELETE/PUT at 08:25:00\-05Z\)?•In findings\_summary\.csv, the kill chain must show initial SQL injection probing at 08:13:01\-05Z \(reconnaissance/weaponization\) followed by brute force authentication at 08:13:08\-16Z \(credential access\), correctly sequenced and labeled\.•In findings\_summary\.csv, IDOR enumeration of /api/users/1 through /api/users/10 at 08:16:00\-09Z must be mapped to a discovery/enumeration phase\.•In findings\_summary\.csv, XSS injection attempts at 08:19:00Z, 08:19:05Z, and 08:19:08Z using script tags, onerror handlers, and SVG onload, plus a stored XSS review submission at 08:19:02Z, must be mapped to an exploitation phase\.•In findings\_summary\.csv, path traversal and successful config file download at 08:21:00\-04Z, including the critical /api/download?file=/etc/nginx/nginx\.conf returning 200 with 2840 bytes, must be mapped to an exfiltration phase\.•In findings\_summary\.csv, destructive actions at 08:25:00\-05Z including DELETE of products 1042\-1044, PUT to /api/users/1\-2, and PUT to /api/config/settings must be mapped to an impact/post\-compromise phase\.2\.6 pts\.Does the cr\_compliance\_audit\.csv include risk assessments that correctly connect non\-compliance to specific observed attack outcomes – particularly that the missing dotfile blocking \(1c\) enabled \.env and \.git/config exposure, the missing debug endpoint removal \(1f\) enabled information leakage of 8921 bytes, and the missing admin IP restriction \(1e\) left the admin panel accessible to external attackers?•Item 1c’s risk description must reference that the Nikto scanner \(198\.51\.100\.77\) successfully retrieved \.env \(487 bytes of configuration/credentials\) and \.git/config \(312 bytes of repository metadata\) due to missing dotfile blocking\.•Item 1f’s risk description must note that /api/v1/debug was accessed and returned 8921 bytes of debug information\.•Item 1e’s risk description must note that the /admin/ path was accessible to external IPs because Nginx had no source IP restriction, with the 403 coming only from the application itself\.•In cr\_compliance\_audit\.csv, firewall items \(2a\-2d\) must connect non\-compliance to the fact that internal services are directly reachable from the internet, violating the approved zone segmentation in network\_topology\.json\.3\.6 pts\.Does the hardened iptables configuration \(iptables\_hardened\.rules\) use zone\-based source IP restrictions matching network\_topology\.json – specifically restricting port 3000 to 10\.100\.0\.0/24 \(DMZ\), port 3001 to 10\.200\.0\.0/24 \(Management\), ports 5432 and 6379 to 10\.100\.1\.0/24 \(Application zone\), and SSH to 10\.200\.0\.0/24?•Port 3000 \(Node\.js\) must only accept connections from DMZ zone 10\.100\.0\.0/24, and port 3001 \(admin panel\) must only accept from Management zone 10\.200\.0\.0/24\.•Port 5432 \(PostgreSQL\) and port 6379 \(Redis\) must only accept connections from Application zone 10\.100\.1\.0/24\.•SSH must be restricted to Management zone 10\.200\.0\.0/24 and include rate limiting \(e\.g\., ’\-m recent’ or ’\-m limit’\) enforcing a maximum of 3 new connections per minute as specified in CR item 2e\.•The rules must NOT use ’ACCEPT’ from anywhere for any of these ports, and must NOT rely solely on localhost/127\.0\.0\.1 bindings\.4\.6 pts\.Does the TLS posture assessment correctly identify all four certificate issues: \(1\) shopvault\.example\.com and api\.shopvault\.example\.com using TLSv1\.0 minimum instead of TLSv1\.2, \(2\) admin\.shopvault\-internal\.com certificate expired on 2025\-11\-01, \(3\) staging\.shopvault\.example\.com using a self\-signed certificate with 1024\-bit key on SSLv3, and \(4\) the nginx\.conf having no TLS/SSL configuration at all \(no port 443 listener\)?•The TLS assessment must identify that shopvault\.example\.com and api\.shopvault\.example\.com both have protocol\_min set to TLSv1\.0, violating CR item 3c which requires minimum TLSv1\.2\.•The assessment must identify that admin\.shopvault\-internal\.com certificate expired on 2025\-11\-01 \(9 days before the incident on Nov 10\), violating CR item 3a\.•The assessment must identify that staging\.shopvault\.example\.com has a self\-signed certificate with only 1024\-bit key size and SSLv3 protocol minimum, and should have been decommissioned per CR item 3b\.•The assessment must note that nginx\.conf only has a port 80 listener with no SSL/TLS server block, meaning HTTPS is not configured at the web server level regardless of certificate availability\.•The assessment must note that payments\.shopvault\.example\.com \(4096\-bit, TLSv1\.2\) is the only properly configured certificate\.5\.6 pts\.Does the incident report’s executive summary explicitly explain how the unimplemented change request CR\-2025\-0347 directly enabled the observed attacks – specifically linking missing dotfile protection to \.env/\.git exposure, missing rate limiting to successful brute force, missing admin restrictions to admin panel accessibility, and missing firewall zone segmentation to direct service exposure?•The executive summary must state that CR item 1c \(dotfile blocking\) was not implemented, directly allowing 198\.51\.100\.77 to download \.env \(containing credentials/secrets\) and \.git/config \(containing repository metadata\)\.•The summary must state that the absence of rate limiting \(CR item 2e and web\-layer\) allowed 203\.0\.113\.42 to perform 18\+ rapid brute force login attempts without throttling, ultimately succeeding\.•The summary must state that CR item 1e \(/admin/ IP restriction\) was not implemented, leaving the admin panel accessible from any IP, and CR items 2a\-2d \(firewall port restrictions\) were not implemented, meaning ports 3000, 3001, 5432, and 6379 are directly accessible from the internet\.•The summary must explicitly state that the CR was marked ’Closed \(Implemented\)’ without verification, and this process failure enabled the entire attack surface\.6\.6 pts\.Does the incident report contain a properly populated incident timeline table with the correct chronological sequence of events, covering all three malicious IPs \(203\.0\.113\.42, 198\.51\.100\.77, 45\.33\.32\.156\) and their respective attack categories?•The timeline must include 203\.0\.113\.42’s attack progression: SQL injection at 08:13:01Z, brute force starting at 08:13:08Z with success at 08:13:16Z, IDOR enumeration at 08:16:00Z, XSS attempts at 08:19:00Z, path traversal/config exfiltration at 08:21:00Z, second brute force wave at 08:23:00Z, and post\-compromise destructive actions \(DELETE, PUT\) at 08:25:00Z\.•The timeline must include 198\.51\.100\.77’s Nikto scanning starting at 08:14:00Z\.•The timeline must include 45\.33\.32\.156’s Nmap scanning starting at 08:17:00Z and Shellshock attempts at 08:28:00Z\.•All events must be in chronological order with correct timestamps, source IPs, and attack categories\.7\.6 pts\.Does the incident report’s vulnerabilities section correctly classify exploited vulnerabilities using OWASP categories, distinguishing between successful exploitations \(brute force login, IDOR user enumeration, \.env/\.git exposure, config file download, XSS reflected, post\-auth destructive actions\) and merely attempted attacks \(SQL injection returning 500, Shellshock against non\-existent CGI paths\)?•SQL Injection \(A03:2021\-Injection\) must be classified as attempted but failed \(500 status codes\), and Shellshock \(A03:2021\-Injection\) must be classified as attempted but failed \(404, CGI paths don’t exist\)\.•Brute Force \(A07:2021\-Identification and Authentication Failures\) must be classified as successful \(attacker gained authenticated session\), and IDOR/Broken Access Control \(A01:2021\-Broken Access Control\) must be classified as successful \(10 user records enumerated with 200 status\)\.•XSS \(A03:2021\-Injection\) must be classified as successful \(requests returned 200 with 15230\-byte response bodies\), and Path Traversal \(A01:2021\-Broken Access Control\) must be classified as partially successful \(\.\./\.\./etc/passwd failed with 500, but /api/download?file=/etc/nginx/nginx\.conf succeeded with 200/2840 bytes\)\.•Sensitive File Exposure \(A05:2021\-Security Misconfiguration\) must be classified as successful for \.env and \.git/config, and the distinction between attempted and successful exploitation must be clearly made for each vulnerability\.8\.6 pts\.Does the incident report include a comprehensive remediation actions table with prioritized items covering immediate containment \(block attacker IPs, revoke compromised sessions\), short\-term fixes \(deploy hardened configs, renew expired certs\), and long\-term improvements \(WAF deployment, rate limiting, automated CR verification\)?•P1/Immediate actions must include: block malicious IPs \(203\.0\.113\.42, 198\.51\.100\.77, 45\.33\.32\.156\) at the firewall, revoke all active sessions \(especially the brute\-forced session\), reset compromised user credentials for at least users 1 and 2, and verify/restore deleted products \(1042\-1044\) and revert the config settings change\.•P2/Short\-term actions must include: deploy the hardened nginx\.conf and iptables\.rules, renew the expired admin\.shopvault\-internal\.com certificate, decommission staging\.shopvault\.example\.com, and enforce TLS 1\.2 minimum\.•P3/Long\-term actions must include: deploy a Web Application Firewall \(WAF\), implement rate limiting at application and infrastructure layers, establish automated post\-implementation verification for change requests, and enhance logging and monitoring capabilities\.•All remediation actions must be specific to the findings from this incident, not generic security recommendations\.

Task CardSummary\.This task asks the agent to act as the web administrator investigating ShopVault’s November 10, 2025 incident – change request CR\-2025\-0347 was closed without verification – by reconstructing attack chains from access logs against threat intel, auditing every CR item against live nginx and iptables configurations, hardening both configs against the actual zone topology, and populating a NIST\-structured incident report\.ONET taskMonitor systems for intrusions or denial of service attacks, and report security breaches to appropriate personnel\.Expert desire3\.80 \(scale: 0\-5, from WorkBank\)In practicecorrelating access\_logs\.csv against threat\_indicators\.json to label each event with a kill\-chain phase and success flag, walking every CR\-2025\-0347 item against the running nginx\.conf and iptables\.rules to find the gap that enabled each observed exploit, and producing hardened replacement configs that respect the network\_topology\.json zone CIDRs\.Why desired\.Incident reconstruction is an evidence\-joining job – match an IP across hundreds of log lines, line up the matching IOC, trace it to the configuration weakness, then write it up\. Doing this by hand under post\-incident pressure is slow and error\-prone\. An agent that produces the audited matrix and hardened configs lets the administrator focus on containment decisions and executive communication\.Reasoning challenges\.•Causal chain from CR non\-compliance to attack success\.CR item 1c \(dotfile blocking\) was never implemented, so nginx\.conf still allows \.env access and Nikto scanner 198\.51\.100\.77 retrieved \.env \(HTTP 200, 487 bytes\)\. Each compliance gap in cr\_compliance\_audit\.csv must be tied to specific log entries showing exploitation\.•Zone segmentation reconciliation for hardened iptables\.network\_topology\.json CIDRs must drive source\-IP restrictions: port 3000 to DMZ \(10\.100\.0\.0/24\), port 3001 to Management \(10\.200\.0\.0/24\), ports 5432/6379 to Application \(10\.100\.1\.0/24\), SSH to Management \(10\.200\.0\.0/24\)\.•Adapting the WordPress\-Nginx reference, not copying it\.The external iptables\_reference\.rules contains FTP rules on port 21 that are not part of ShopVault’s approved architecture\. The agent must exclude these rather than blindly inherit them\.•Multi\-actor kill chain differentiation\.Three malicious IPs \(203\.0\.113\.42, 198\.51\.100\.77, 45\.33\.32\.156\) and three legitimate users are interleaved in the logs\. Each threat actor’s kill chain must be reconstructed independently while a unified chronological timeline is maintained for the incident report\.•TLS gap as absence, not misconfiguration\.Cross\-referencing ssl\_cert\_audit\.csv against CR items 3a\-3c against nginx\.conf reveals that nginx has no port 443 listener and no SSL/TLS directives at all – HTTPS is entirely unconfigured despite valid certificates existing\.•SSH rate\-limit threshold extraction\.CR item 2e specifies SSH rate limiting at 3 connections per minute \(embedded in the change request text\) – the hardened iptables must implement this exactly using\-m recentor\-m limit, not generic rate\-limiting\.•Kill chain phase taxonomy applied independently\.The instructions name "kill chain phase" without defining the phases – the agent must apply Cyber Kill Chain or MITRE ATT&CK labels \(reconnaissance, credential access, enumeration, exploitation, exfiltration, post\-compromise/impact\) on its own\.•NIST SP 800\-61 report structure inferred from template\.The incident\_report\_template\.html follows NIST SP 800\-61 Rev\. 2 \(preparation, detection, containment, eradication, recovery, lessons learned\) – the agent must recognize the structure and populate sections accordingly, with the executive summary tracing causation from unverified CR items to observed attack outcomes\.Files requiring search\.External firewall reference the agent must locate as a structural baseline for the hardened iptables ruleset\.•iptables\_reference\.rules: WordPress\-Nginx iptables hardening reference providing default\-deny baseline and service exposure patterns; must be adapted to drop FTP and incorporate ShopVault’s zone\-based architecture\.*Source:*github\.com/abdusfauzi/wordpress\-nginx \([https://raw\.githubusercontent\.com/abdusfauzi/wordpress\-nginx/master/etc/iptables\.firewall\.rules](https://raw.githubusercontent.com/abdusfauzi/wordpress-nginx/master/etc/iptables.firewall.rules)\)\.

## Appendix FPrompt Templates and Evaluation Interface

In this appendix, we record the runtime prompts used by the JobBench evaluation\. Section[F\.1](https://arxiv.org/html/2605.26329#A6.SS1)reports the agent evaluation prompt used by all CLI runners\. Section[F\.2](https://arxiv.org/html/2605.26329#A6.SS2)reports the rubric\-level judge prompt used for automatic evaluation\.

### F\.1Agent Evaluation Prompt

The OpenCode, Claude Code, and Codex CLI runners share a single task prompt\. Before each evaluation, the runner copies the task into an isolated temporary workspace and substitutes the path placeholders shown below\. The prompt directs the agent to the task instructions, the reference files, and the output directory\.

`Agent Evaluation Prompt`

`F\.2 Judge Prompt The judge evaluates one rubric at a time\. Before each call, JobBench extracts text from the model’s output directory and normalizes the rubric criteria\. The prompt then supplies the rubric, all criteria, and the extracted output contents, and requires a structured JSON response\. Judge Prompt When the rubric wording requires visual evidence, the judge call additionally attaches image files from the model’s output directory\. The evidence bullet then appends “and the attached images”, and the user content includes the following multimodal attachment block before the image payloads\. Vision Attachment Block Appendix G Supplementary Analyses on JobBench’s Position in the AI Labour Market Worker perspective\. JobBench starts from what workers actually want to delegate, not just from which jobs look economically exposed\. That matters because demand for automation and model capability do not always line up: some tasks workers want to offload are already easy for current systems, while others still fail even when demand is high\. JobBench is useful because it shows this gap at the task level instead of hiding it inside occupation averages\. Labour\-market outlook\. Current AI can already speed up routine work, consistent with prior evidence of 15–34% productivity gains in customer\-service settings \[10\], but it still struggles with the judgment\-heavy details needed for reliable delegation\. JobBench is useful for tracking whether future capability gains are expanding human productivity or increasing replacement pressure\.`
JobBench: Aligning Agent Work With Human Will

Similar Articles

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Good Benchmarks

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Submit Feedback

Similar Articles

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces
HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents
EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions