A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets
Summary
This paper introduces a benchmark for predicting spreadsheet user actions, addressing challenges in edit history availability and complex action spaces through manual curation and online evaluation methodology.
View Cached Full Text
Cached at: 06/18/26, 03:58 PM
Paper page - A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets
Source: https://huggingface.co/papers/2606.13802
Abstract
A benchmark for predicting spreadsheet user actions is introduced, addressing challenges in edit history availability and complex action spaces through manual curation and online evaluation methodology.
Predictivecodecompletiongreatlyaccelerateshowquicklydeveloperswork.Inspreadsheets,despitebeingmuchmorecommon,suchauto-completionfeaturesarevirtuallynon-existent.Toaddressthisgap,weintroduceabenchmarkforsystemsthatobserveasequenceofuseractionsinaspreadsheetandpredictfutureactions.Twochallengesare(1)theabsenceofedithistoriesinpublicspreadsheetcorporaand(2)thecomplexspaceofspreadsheetactions(spatial,temporal,composite).Toaddress(1),wemanuallycurate52sequencesof12Kactionsthatrecreatespreadsheetsfrompubliccorpora,seededbyparametrizedheuristicsandLLMrefinement.Toaddress(2),weproposeanonlineevaluationthatexpectsapredictionaftereachuseraction,acceptsorrejectsthatprediction,updatesthefutureactionsuponacceptance,andrepeatsthisuntilthetargetspreadsheetisobtained.Weusemultiplebaselinepredictors(includingzero-shotLLMs,fine-tunedSLMs,andclassicalmodels)andanalyzedifferentpropertiesthatourbenchmarkteachesus,includingbutnotlimitedto:propertiesofsavedactionsandfalsepositives,efficiency,effectofuserprofiles,effectoftriggers,andeffectofcontext.
View arXiv pageView PDFProject pageGitHub0Add to collection
Get this paper in your agent:
hf papers read 2606\.13802
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.13802 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.13802 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.13802 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
PreAct-Bench: Benchmarking Predictive Monitoring in LLMs
PreAct-Bench is a benchmark of 1,000 paired ethical and unethical action trajectories across five domains, designed to evaluate the ability of LLMs to predict harmful outcomes from partial trajectories (predictive monitoring). Results show that while humans perform well, current LLMs struggle, highlighting the need for future-oriented risk reasoning.
BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces
BehaviorBench is a benchmark for evaluating personalized decision modeling from real-world behavioral traces, using prediction-market and on-chain records to test belief and trade prediction tasks.
TabClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning
TabClaw is an open-source interactive AI agent for spreadsheet manipulation and table reasoning that uses LLMs to automate data analysis, support multi-table reasoning, and adapt to user preferences through memory and skill extraction.
From Heuristics to Analytics: Forecasting Effort and Progress in Online Learning
This paper introduces engagement forecasting for intelligent tutoring systems, predicting weekly minutes practiced and new skills mastered using interaction logs from 425 middle-school students. Feature-based models reduce error by 22-33% over heuristic baselines, offering explainable patterns for tutor-learner goal setting.
ForecastBench-Sim: A Simulated-World Forecasting Benchmark
Introduces ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, designed to provide controlled, immediately resolvable tasks for evaluating probabilistic reasoning in AI systems.