Abstracting Cross-Domain Action Sequences into Interpretable Workflows

arXiv cs.AI Papers

Summary

This paper introduces WorkflowView, a framework that uses large language models to abstract low-level, noisy user action sequences into interpretable high-level activities, demonstrating effectiveness across browser logs, MOOC dropout prediction, and privacy-preserving document workflow analysis.

arXiv:2606.14654v1 Announce Type: new Abstract: Sequential or time-stamped interaction logs provide objective records of digital application usage, yet their granularity and noise often obscure meaningful insights into people's work. Such insights are essential for improving digital products in ways grounded in real-world user interactions. Prior research has applied deep learning models to cluster user actions into high-level activities, but these approaches are highly sensitive to noise and struggle to generalize across applications. To address this limitation, we introduce WorkflowView, a framework that uses large language models (LLMs) to abstract low-level action sequences into high-level activities. We establish the effectiveness and generality of our approach across three distinct, challenging sequential tasks and diverse domains: (a) zero-shot task description reconstruction from browser logs (achieving high semantic similarity, $\mu_{sim} = 0.91$), (b) few-shot student dropout prediction using MOOC interaction logs (reaching weighted $F_1 = 0.90$ with only five few-shot examples), and (c) anonymized, privacy-preserving analysis of AI tool integration within document workflows in Microsoft Word. Our work demonstrates that LLM-based abstraction is a robust and efficient path forward for transforming low-level behavioral data into high-level, interpretable, and actionable insights. We also discuss practical considerations for deploying LLM-based inferences within logging infrastructures, including computational efficiency and user privacy.
Original Article
View Cached Full Text

Cached at: 06/15/26, 09:13 AM

# Abstracting Cross-Domain Action Sequences into Interpretable Workflows
Source: [https://arxiv.org/html/2606.14654](https://arxiv.org/html/2606.14654)
###### Abstract

Sequential or time‑stamped interaction logs provide objective records of digital application usage, yet their granularity and noise often obscure meaningful insights into people’s work\. Such insights are essential for improving digital products in ways grounded in real‑world user interactions\. Prior research has applied deep learning models to cluster user actions into high\-level activities, but these approaches are highly sensitive to noise and struggle to generalize across applications\. To address this limitation, we introduce WorkflowView, a framework that uses large language models \(LLMs\) to abstract low\-level action sequences into high\-level activities\. We establish the effectiveness and generality of our approach across three distinct, challenging sequential tasks and diverse domains: \(a\) zero\-shot task description reconstruction from browser logs \(achieving high semantic similarity,μs​i​m=0\.91\\mu\_\{sim\}=0\.91\), \(b\) few\-shot student dropout prediction using MOOC interaction logs \(reaching weightedF1=0\.90F\_\{1\}=0\.90with only five few\-shot examples\), and \(c\) anonymized, privacy\-preserving analysis of AI tool integration within document workflows in Microsoft Word\. Our work demonstrates that LLM\-based abstraction is a robust and efficient path forward for transforming low\-level behavioral data into high\-level, interpretable, and actionable insights\. We also discuss practical considerations for deploying LLM‑based inferences within logging infrastructures, including computational efficiency and user privacy\.

Abstracting Cross\-Domain Action Sequences into Interpretable Workflows

Gaurav Verma and Scott CountsMicrosoft Corporation\{gauravverma,counts\}@microsoft\.com

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.14654v1/x1.png)Figure 1:We propose an LLM\-based framework for hierarchical abstraction of user action sequences into interpretable high\-level activities \(WorkflowView\)\. The left panel illustrates raw action sequences from three domains\. WorkflowView enables downstream inferences with task\-specific steerability, such as reconstructing user intent in browsers, predicting student dropout in MOOCs, and privacy\-preserving categorization of document\-centric workflows\.Terabytes of user interface \(UI\) interaction logs are captured every hour as users interact with digital applications\. These logs enable unobtrusive analysis of usage patterns, facilitate bug identification, and support iterative deployment of product improvements that better align with users’ needs\. UI interaction logs provide an objective account of what actions users perform and when they perform them \(e\.g\.,\(DD/MM/YY HH:MM:SS, ClickedLayoutRibbon\)\)\. However, such timestamped action sequences are often too granular and noisy to yield a clear view of the high\-level task a user is performing within the application\. A single high\-level task \(e\.g\., formatting the content of a document\) may comprise hundreds of actions executed over a 10\-15 minute interval, making the action sequence highlygranular\. Moreover, these sequences may include actions that are not directly related to the user’s underlying intent, introducingnoise\. For example, when users briefly click on unrelated features to intentionally or unintentionally explore the interface\.

Earlier studies that model time\-stamped interaction logs to understand user behavior have relied on statistical techniques such as frequent itemset mining and sequential pattern miningMannilaet al\.\([1997](https://arxiv.org/html/2606.14654#bib.bib25)\); Cukeet al\.\([2009](https://arxiv.org/html/2606.14654#bib.bib26)\); Agrawal and Srikant \([1995](https://arxiv.org/html/2606.14654#bib.bib27)\); Agrawalet al\.\([1993](https://arxiv.org/html/2606.14654#bib.bib34)\)\. These approaches have been noted to struggle with incorporating domain context and with explicitly modeling noise in user behaviorDev and Liu \([2017](https://arxiv.org/html/2606.14654#bib.bib62)\)\. More recent work has explored adapting language modeling techniques to sequential log data; for example, using LSTMs to preemptively identify when users might need assistance within an applicationNambhiet al\.\([2019](https://arxiv.org/html/2606.14654#bib.bib61)\), or training BERT\- and LLM\-based classifiers to detect anomalies in logsGuoet al\.\([2021](https://arxiv.org/html/2606.14654#bib.bib60)\); Zhouet al\.\([2024](https://arxiv.org/html/2606.14654#bib.bib59)\)\. While these methods demonstrate the promise of interpreting interaction logs using language models, they typically operate in settings that require task\-specific fine\-tuning on thousands of annotated training samples\.

Motivated by the strong generalization capabilities of large language models \(LLMs\) across tasks and domains, this work investigates whether state\-of\-the\-art LLMs can interpret real\-world timestamped action sequences that do not follow the usual syntax or semantics of natural language and infer the high\-level activities that users perform as part of their workflows\. LLMs are also known to integrate presented instances with broader encoded knowledgeBaiet al\.\([2024](https://arxiv.org/html/2606.14654#bib.bib42)\), which may further enrich observability of system and user states\. To this end, we propose WorkflowView, a hierarchical abstraction of granular action sequences using LLMs\. In WorkflowView, the initial layer generates natural language descriptions of the observed actions, while subsequent layers infer high\-level activities and, optionally, categorize them into a set of discovered or predefined categories\. To demonstrate the generality of the proposed approach, we evaluate WorkflowView across three domains that differ in action set cardinality and in the degree to which user behaviors are mutually exclusive\. Figure[1](https://arxiv.org/html/2606.14654#S1.F1)shows an overview of the method\.

Our results show that WorkflowView provides a reliable abstraction over action sequences across diverse tasks and domains\. Specifically, we find that the method\(a\)generates task descriptions that closely align with ground\-truth tasks performed in a browser \(e\.g\., prediction: the user is trying to “find a car while sorting by lowest price”; ground truth: the user wants to “find the cheapest car”\),\(b\)predicts student dropout in MOOCs with a weightedF1F\_\{1\}score of 0\.90 while using only five in\-context examples \(a performance comparable to several state\-of\-the\-art predictive models trained on thousands of labeled instances\), and\(c\)contextualizes the use of AI tools in Microsoft Word \(i\.e\., a document creation, collaboration, and consumption application\) by interpreting action sequences, discovering task categories, and performing multi\-class classification\. We further show that such anonymous, privacy\-preserving, and aggregated insights can inform user\-centric product improvements\.

Because WorkflowView relies on LLM\-based inference over action sequences, we discuss practical considerations around deployment, including cost, latency, and user\-privacy, as well as the limitations of our approach\. We also outline a broader vision in which LLM capabilities are embedded deeper into the logging infrastructure\. This vision is especially relevant in the context of human–AI collaboration, while maintaining strong guarantees around user privacy and security\.

## 2Related work

Below, we categorize and discuss related work into three themes:\(a\)modeling interaction logs,\(b\)discovering user intents from user utterances, and\(c\)using LLMs to model non\-language data\.

Modeling interaction logs: Prior work on interpreting timestamped UI logs has largely framed the problem as pattern mining or sequence modeling\. Techniques such as frequent itemset mining and sequential pattern mining have been widely used to extract common action patterns from large log corpora \(e\.g\., identifying frequently occurring operation groups\)Mannilaet al\.\([1997](https://arxiv.org/html/2606.14654#bib.bib25)\); Cukeet al\.\([2009](https://arxiv.org/html/2606.14654#bib.bib26)\); Agrawal and Srikant \([1995](https://arxiv.org/html/2606.14654#bib.bib27)\); Agrawalet al\.\([1993](https://arxiv.org/html/2606.14654#bib.bib34)\)\. While effective at identifying recurring structures, these statistical approaches are largely domain\-agnostic: they treat UI actions as abstract tokens without semantic groundingDev and Liu \([2017](https://arxiv.org/html/2606.14654#bib.bib62)\)and are sensitive to noise and spurious correlations in action sequencesYanget al\.\([2002](https://arxiv.org/html/2606.14654#bib.bib41)\)\. Subsequent work addressed some of these limitations through learning\-based approaches, including RNN/LSTM\- and Transformer\-based modelsHochreiter and Schmidhuber \([1997](https://arxiv.org/html/2606.14654#bib.bib48)\); Vaswaniet al\.\([2017](https://arxiv.org/html/2606.14654#bib.bib46)\), applied to domain\- and task\-specific applicationsNambhiet al\.\([2019](https://arxiv.org/html/2606.14654#bib.bib61)\); Krishnaet al\.\([2018](https://arxiv.org/html/2606.14654#bib.bib40)\); Zhuet al\.\([2021](https://arxiv.org/html/2606.14654#bib.bib47)\)\. However, these methods rely on task\-specific training data and hand\-crafted labels, making them costly to deploy across new domains, tasks, or evolving user behaviors\. In contrast, WorkflowView relies on LLM\-based inference via prompting, enabling flexible adaptation across tasks and domains without fine\-tuning or annotating data, while explicitly abstracting away low\-level noise through hierarchical reasoning\.

Intent discovery over user utterances: A related line of work focuses on inferring user intent from textual interactions, such as search queriesWanget al\.\([2022](https://arxiv.org/html/2606.14654#bib.bib43)\)or conversational utterances in dialogue systemsSchuurmans and Frasincar \([2019](https://arxiv.org/html/2606.14654#bib.bib45)\)\. Modern dialogue systems and virtual assistants typically include an intent classification module that maps user input to predefined task labels \(e\.g\., booking a flight or checking the weather\), often trained using supervised learning over large annotated corporaSerbanet al\.\([2015](https://arxiv.org/html/2606.14654#bib.bib44)\)\. More recent work explores discovering new or evolving intents by clustering user queries that fall outside known categoriesShahet al\.\([2025](https://arxiv.org/html/2606.14654#bib.bib31)\); Wanet al\.\([2024](https://arxiv.org/html/2606.14654#bib.bib28)\)\. A key distinction between this body of work and ours lies in the nature of the input: textual utterances are already semantic and human\-interpretable, and often explicitly encode user goals \(e\.g\., “find the cheapest car” or “schedule a meeting”\)\. In contrast, our work operates on telemetry data consisting of low\-level UI events, where intent must be inferred indirectly from noisy, granular action sequences\. This setting is both more challenging and more ubiquitous in modern applications, motivating the need for methods that can bridge raw interaction logs and high\-level intent\.

LLMs for non\-language sequential data: Beyond text, recent work has examined the ability of LLMs to reason over non\-language data\. Existing approaches include learning projection layers to map image or numeric sensor data into representations suitable for LLM\-based inferenceVermaet al\.\([2024](https://arxiv.org/html/2606.14654#bib.bib37)\); Moonet al\.\([2024](https://arxiv.org/html/2606.14654#bib.bib38)\), adapting LLM embeddings for time\-series classificationKauret al\.\([2025](https://arxiv.org/html/2606.14654#bib.bib35)\), and reprogramming time series into textual prototype representations that align more naturally with LLM pretrainingJinet al\.\([2024](https://arxiv.org/html/2606.14654#bib.bib36)\)\. Liu et al\.[2024a](https://arxiv.org/html/2606.14654#bib.bib32)demonstrate that off\-the\-shelf LLMs such as GPT\-4Achiamet al\.\([2023](https://arxiv.org/html/2606.14654#bib.bib33)\)can outperform pre\-trained zero\-shot baselines \(and, in many cases, supervised models\) on forecasting numeric sequences across domains including epidemiology, finance, and weather\. These results suggest that LLMs can, to some extent, interpret structured sequences with limited linguistic content by leveraging patterns learned during large\-scale pretraining\. Building on this insight, WorkflowView extends zero\-shot and few\-shot LLM prompting to the domain of user interaction logs\.

## 3WorkflowView: Hierarchical abstraction of action sequences with LLMs

WorkflowView is a simple yet effective framework that leverages large language models \(LLMs\) to reason over action sequences\. The method demonstrates that LLMs can be prompted to address a range of sequence modeling tasks across domains in zero\-shot or few\-shot settings, highlighting ease of customization without fine\-tuning\. To encourage stage\-wise abstraction from low\-level actions to high\-level activities, WorkflowView adopts a hierarchical design\. We distinguish three levels of behavioral granularity: individualactions\(atomic UI events, such as a single click or keystroke\);high\-level activities\(coherent, interpretable units of behavior abstracted from a span of actions, such as “reviewing comments”\); andworkflows\(goal\-directed processes that these activities compose, such as collaborating on a document\)\. Specifically, action sequences are first converted into detailed natural language descriptions \(Layer 1\), after which the high\-level activity captured by these descriptions is inferred \(Layer 2\)\. If required by the task, additional layers can be introduced to further categorize the inferred high\-level activity into known or discovered classes—for example, predicting student dropout in a MOOC or distinguishing between active document editing and text formatting\. Figure[1](https://arxiv.org/html/2606.14654#S1.F1)provides an overview of the approach along with example outputs from the datasets used in this work\.

The hierarchical LLM\-based inference is motivated by two principles:modularityandprogressive denoising\. Modularity ensures that the outputs of lower layers \(i\.e\., action sequence→\\rightarrownatural language description→\\rightarrowhigh\-level task inference\) can be reused across multiple objectives \(such as frequent task discovery at the population level or categorization of individual sequences\) by adapting only the higher layers\. Progressive denoising is essential for modeling action sequences with LLMs, as it enables the transformation of raw timestamped actions into coherent textual representations that are better suited for higher\-order reasoning\. For instance, lower layers may capture temporal patterns in natural language, such as “the user responded to a collaborator’s comment after no significant activity forNNminutes\.” In this case, low\-salience actions are denoised at earlier layers, and depending on the value ofNN\(e\.g\., 2vs\.10 minutes\), subsequent layers can characterize the level of deliberation involved in responding to the comment\. See our discussion on the effectiveness of progressive denoising in Appendix[A\.2](https://arxiv.org/html/2606.14654#A1.SS2)\.

We provide the prompts used in WorkflowView in Appendix Tables[7](https://arxiv.org/html/2606.14654#A1.T7),[8](https://arxiv.org/html/2606.14654#A1.T8),[9](https://arxiv.org/html/2606.14654#A1.T9), and[10](https://arxiv.org/html/2606.14654#A1.T10)to support reproducibility and future work\. In the following section, we evaluate WorkflowView on three tasks spanning three domains: inferring browser tasks, predicting student dropout in MOOCs, and contextualizing the use of an AI tool in Microsoft Word\. Given the substantial variation in action spaces and behavioral patterns across these domains, our experiments are designed to evaluate WorkflowView’s effectiveness and generalizability\.

## 4Applications & Evaluation of WorkflowView

### 4\.1Inferring tasks from browser interaction logs

Task and dataset: We evaluate the ability of WorkflowView to infer the tasks people do on browsers using observed interaction logs alone\. We use the Mind2Web datasetDenget al\.\([2023](https://arxiv.org/html/2606.14654#bib.bib30)\), which contains an ordered sequence of web actions taken in a browser to complete 2,022 general\-purpose web tasks described in natural language\. The tasks span 137 websites and 5 different domains: service \(e\.g\.,gov\.uk\), shopping \(e\.g\.,instacart\.com\), entertainment \(e\.g\.,espn\.com\), travel \(e\.g\.,delta\.com\), and information \(e\.g\.,finance\.yahoo\.com\)\. The action space for this dataset is characterized by HTML UI elements \(for instance, \[button\] ‘Go Back‘, \[textbox\] ‘Enter your name’\) that the user interacts with on a webpage and the operation they perform \(like CLICK, TYPE, or SCROLL\)\. Our goal is to perform LLM abstractions over the action sequences, as exemplified in Table[2](https://arxiv.org/html/2606.14654#S4.T2)\(action sequences\), to predict the task the users are doing across different websites \(task\)\. Methodologically, for this task, we operationalize WorkflowView using the prompts shown in the Appendix, where Layer 1 \(shown in Table[7](https://arxiv.org/html/2606.14654#A1.T7)\) provides a detailed description of the action sequences in natural language and Layer 2 \(shown in Table[8](https://arxiv.org/html/2606.14654#A1.T8)\) infers the overall task the user is doing and generates its succinct description\. It is worth noting that this evaluation is a zero\-shot setting\. All our key results are based on experimentation with GPT\-4o \(more specificallygpt\-4o\-2024\-05\-13\), a leading proprietary state\-of\-the\-art large language model released by OpenAIOpenAI \([2024](https://arxiv.org/html/2606.14654#bib.bib57)\)\. However, we also demonstrate that WorkflowView works effectively with smaller & open\-weights models like Phi\-4Abdinet al\.\([2024](https://arxiv.org/html/2606.14654#bib.bib29)\)andgpt\-oss\-20bOpenAI \([2025](https://arxiv.org/html/2606.14654#bib.bib5)\)in App Table[6](https://arxiv.org/html/2606.14654#A1.T6)\.

Table 1:Embedding\-based retrieval of ground\-truth task descriptions using task descriptions generated using WorkflowView; the candidate set varies as ‘global’ or ‘website\-specific’ across the two settings\.μ​\(±σ\)\\mu\(\\pm\\sigma\)\.Action Sequence\[svg\] → CLICK, \[link\] Your lists → CLICK, \[link\] Create a list → CLICK, \[svg\] → CLICK, \[span\] Walgreens → CLICK, \[textbox\] Add a title \(Required\) → TYPE: Walgreens, \[img\] → CLICK, \[button\] Next → CLICK, \[link\] Personal Care → CLICK, \[svg\] → CLICK, \[img\] → CLICK, \[span\] Add to list → CLICK, \[checkbox\] Walgreens New → CLICK, \[button\] Done → CLICK, \[path\] → CLICK, \[path\] → CLICK, \[path\] → CLICK, \[svg\] → CLICK, \[img\] → CLICK, \[span\] Add to list → CLICK, \[checkbox\] Walgreens New → CLICK, \[button\] Done → CLICK, \[path\] → CLICK, \[link\] View More → CLICK, \[img\] → CLICK, \[span\] Add to list → CLICK, \[checkbox\] Walgreens New → CLICK, \[button\] Done → CLICK, \[button\] Back → CLICK, \[path\] → CLICK, \[link\] Shower Essentials → CLICK, \[img\] → CLICK, \[span\] Add to list → CLICK, \[checkbox\] Walgreens New → CLICK, \[button\] Done → CLICK, \[button\] Back → CLICK, \[link\] Lists → CLICKGenerated Task DescriptionCreate a Walgreens shopping list and add personal care and shower essentials items\.Ground\-truth Task DescriptionCreate a new list and add four items from the personal care category at Walgreens\.Table 2:Qualitative example of task description generated using WorkflowView using the action sequence, and the corresponding ground\-truth task description\.Evaluation settings: As the first measure to compare the generated task descriptions from sequences of web interactions and their corresponding ground\-truth descriptions, we compute the cosine similarity between the embeddings of the descriptions obtained from thetext\-embedding\-ada\-002modelOpenAI \([2022](https://arxiv.org/html/2606.14654#bib.bib58)\)\. Additionally, we compute retrieval metrics like Mean Reciprocal Rank \(MRR\) and Recall@KK\(K∈\{1,3,5,10\}K\\in\\\{1,3,5,10\\\}\) under two settings\. In the first setting \(i\.e\., ‘global’\), we retrieve the most similar ground\-truth task description across the entire dataset for each of the generated task descriptions; whereas, in the second setting \(i\.e\., ‘website\-specific’\), we retrieve the most similar ground\-truth description across a website for each of the generated descriptions belonging to the same website\.

Results: The average similarity \(and standard deviation\) between the generated and ground\-truth tasks is0\.9110\.911\(±\(\\pm0\.042\)0\.042\);N=2,022N=2,022tasks\. The notably high absolute similarity scores indicate the accurate inferences made using WorkflowView\. Table[1](https://arxiv.org/html/2606.14654#S4.T1)shows the average MRR and Recall@KKvalues \(along with standard deviations\)\. The near\-perfect MRR and Recall@K values also indicate that the true ground\-truth descriptions are ranked at the top for a large majority of the corresponding generated task descriptions\. In Table[2](https://arxiv.org/html/2606.14654#S4.T2), we qualitatively illustrate the close alignment between a generated task description and the corresponding ground\-truth description; an expanded set of qualitative examples is present in Appendix Table[11](https://arxiv.org/html/2606.14654#A1.T11)\. The key strength of our work lies in demonstrating zero\-shot, cross\-domain generality of WorkflowView; nonetheless, we compare against domain\-specific fine\-tuned seq2seqSutskeveret al\.\([2014](https://arxiv.org/html/2606.14654#bib.bib9)\)baselines for this task in Appendix[A\.1\.1](https://arxiv.org/html/2606.14654#A1.SS1.SSS1)\.

### 4\.2Predicting student dropouts based on MOOC interaction logs

![Refer to caption](https://arxiv.org/html/2606.14654v1/x2.png)Figure 2:Dropout prediction performance using WorkflowView\. The plots show the weighted F1score, precision, and recall in response to the variations in ‘start time’ \(i\.e\., time when the action sequences are modeled\) and ‘end time’ \(i\.e\., time before the last action until which the action sequences are modeled\)\. For comparison, we include the scores corresponding to baselines where only the ‘majority’ class would be predicted \(i\.e\., all dropout\) and predictions based on biased ‘random’ guesses as per prior class probabilities\. The bestF1F\_\{1\}score \(F1F\_\{1\}= 0\.89; Precision = 0\.81; Recall = 0\.97\) correspond to a start time of 6 days and an end time of 24 hours\. The number of few\-shot examples provided to WorkflowView were 3 for this analysis; Figure[3](https://arxiv.org/html/2606.14654#S4.F3)below shows the sensitivity to the number of few\-shot examples\.![Refer to caption](https://arxiv.org/html/2606.14654v1/x3.png)Figure 3:Variation in predictive performance of WorkflowView \(weighted F1, precision, and recall\) on the MOOC dropout prediction task, in response to the number of few\-shot examples considered \(N∈\{0,1,3,5,10,20\}N\\in\\\{0,1,3,5,10,20\\\}\)\.Task and dataset: To assess the method’s generalizability across diverse tasks and domains, we experiment with interaction logs of a MOOC software to predict student dropouts\. The test set of the dataset curated by[Fenget al\.](https://arxiv.org/html/2606.14654#bib.bib56)\([2019](https://arxiv.org/html/2606.14654#bib.bib56)\) comprises interaction logs from a total of 44,008 unique students enrolled in 247 unique courses, resulting in 67,699 unique \(student, course\) pairs\. Of the 67,699 unique enrollments, 51,316 \(75\.8%\) resulted in a dropout\. 22 unique actions were logged from all the student interactions, representing the action space\. The goal of this task is to process the time\-stamped sequence of actions at least N hoursbefore the last actionusing WorkflowView to determine if the enrollment is going to result in a dropout, such thatN∈\{1,6,12,18,24\}N\\in\\\{1,6,12,18,24\\\}hours\. The design for this predictive task takes into account a potential intervention to take place when the student performs their currently last action that may discourage them from dropping out\.

Adapting WorkflowView: For this task, to facilitate a binary classification, we adapt WorkflowView to have a third categorization layer on top of the first two layers \(natural language description and succinct summary\)\. Effectively, if the final binary classification labels are accurate, it indicates that WorkflowView can interpret and extract meaningful task\-specific signals from the raw action sequences\. This particular task and the low\-barrier adaptation of WorkflowView to address it emphasizes the modularity of the underlying hierarchical abstractions\. We also explore WorkflowView’s compatibility to few\-shot settings, by modifying the prompts at each layer to include illustrative mappings\. Specifically, for the natural language description layer \(Layer 1\) this was done by providing the mapping between action sequences and the final category; for the succinct summary layer \(Layer 2\) this was done by additionally including the natural language descriptions from the previous layer for the same examples, and similarly, for Layer 3, we additionally included the succinct summaries for the examples\. The prompts used to adapt WorkflowView for this task are shown in the Appendix Tables[9](https://arxiv.org/html/2606.14654#A1.T9)and[10](https://arxiv.org/html/2606.14654#A1.T10)\. We explored few\-shot settings where the number of examples per category varied in\{1,3,5,10\}\\\{1,3,5,10\\\}, where that many examples were randomly sampled from the training set per category \(i\.e\., the total number of examples were twice as many\)\.

![Refer to caption](https://arxiv.org/html/2606.14654v1/x4.png)Figure 4:Activities users do in the context of document editing, at most 30 minutes before \(left, in blue\) and after \(right, in red\) prompting the integrated AI tool and accepting its output\. The action sequences are processed using WorkflowView to discover the high\-level categories; corresponding definitions are provided in the adjoining table\.
•Active editing of content: Modifying the content of a document, such as copying, pasting, deleting, and reorganizing text\.•Refining document layout/presentation: Refining and enhancing document layout and presentation, including inserting graphics\.•Formatting text and layout: Changing the appearance of text and its layout, including font adjustments, applying styles, and using tools like Format Painter\.•Add\-ins usage, VBA projects, external application content: Interactions with external applications or add\-ins within the context of document editing\.•Version management and backups: Create different versions or backups of a document\.•Creating a new document and immediate edits: Starting a new document and immediately engaging in editing or content insertion\.•Content transfer within or between documents: Copying and pasting content within or between documents\.•Applying themes and styles to the document: Organizing and structuring document content, including applying themes and styles\.•Creating drafts and templates with some edits: Creating drafts or templates, making edits, and possibly using draft generation features\.•Reviewing comments \(collaboration\): Engaging with comments, indicating a review or collaboration phase\.•Real\-time collaborative editing with multiple users: Engaging in real\-time collaboration and editing with others\.•Final edits before closing/printing: Preparing the document for presentation or distribution, including final edits, formatting, and printing\.•Collaborative features in shared documents: Shared document activities, including co\-authoring, managing comments, and using collaborative features\.•Using AI features \(e\.g\., word suggestions\): Using AI features to edit the document content\.•Content reconsideration: Experimenting with content by adding and then removing it, indicating reconsideration of content placement or inclusion\.•Document navigation: Navigating through the document, including moving the cursor or scrolling\.Table 3:Discovered document editing categories and their corresponding descriptions\.

Evaluation setup: The evaluation is designed to measure how effectively and reliably can WorkflowView perform the binary classification task of predicting student dropouts from MOOC action sequences\. Our evaluations are centered around two axes that could precipitate predictive variability:\(a\)time horizon for the action sequences under consideration, and\(b\)the number of few\-shot examples provided to the model\. For the former, we vary the start time as well as the end time and observe the weightedF1F\_\{1\}score, precision, and recall for each combination\. For the latter, we explore both zero\-shot and few\-shot settings, while varying the number of examples considered in the few\-shot setting\. In the few\-shot setting, we randomly sampleKKexamples per category from the train set of the data; we acknowledge that sampling strategies that result in better performance and greater robustness are possibleWanget al\.\([2020](https://arxiv.org/html/2606.14654#bib.bib54)\); Nookalaet al\.\([2023](https://arxiv.org/html/2606.14654#bib.bib55)\)\. To limit the amount of experimentation we first evaluate the performance of WorkflowView on a 2\-dimensional hyper\-parameter grid of start and end times while fixing the number of few\-shot examples to 3, and then evaluate the sensitivity to the number of few\-shot examples\.

Results: Figure[2](https://arxiv.org/html/2606.14654#S4.F2)shows the predictive performance in response to variations in start and end time hyper\-parameters\. For reference, we also include comparisons with two random baselines: ‘majority’, where only the majority class is predicted \(i\.e\., all sequences are categorized as dropout\) and ‘random’, where the categorizations are based on class probabilities based on the training data distributions\. The first key observation is that regardless of the start and end times hyperparameters, WorkflowView categorizations are consistently and notably better than either of the baselines\. In fact, the best performance across all the combinations comes out at anF1F\_\{1\}score of0\.890\.89\(Precision=0\.81=0\.81and Recall=0\.97=0\.97\) corresponding to a start time of 6 days and an end time of 24 hours before the last activity\. It is worth noting that this performance is on par with several learning\-based methods that utilize over hundreds of thousands of training examples\.111[Fuet al\.](https://arxiv.org/html/2606.14654#bib.bib51)\([2021](https://arxiv.org/html/2606.14654#bib.bib51)\) train a long short\-term memory network to build a predictive model that achieves anF1F\_\{1\}score of 0\.869 on the binary classification task of MOOC dropout prediction\. Similarly,[Basnetet al\.](https://arxiv.org/html/2606.14654#bib.bib50)\([2022](https://arxiv.org/html/2606.14654#bib.bib50)\) propose training\-based approaches that rely on large\-scale annotated data and result in anF1F\_\{1\}score of 0\.84;[Fenget al\.](https://arxiv.org/html/2606.14654#bib.bib56)\([2019](https://arxiv.org/html/2606.14654#bib.bib56)\) report anF1F\_\{1\}score of 0\.91\. See App\.[A\.1\.2](https://arxiv.org/html/2606.14654#A1.SS1.SSS2)for baseline comparisons\.Next, we fix the start time of the action sequence used for predictive modeling to 6 days and the end time to 24 hours, and vary the number of few\-shot examples supplied to the model\. Figure[3](https://arxiv.org/html/2606.14654#S4.F3)shows that using 3 or 5 few\-shot examples per category improves the performance considerably over the zero\-shot setting \(F1F\_\{1\}improves from0\.840\.84to0\.890\.89and0\.900\.90, respectively\)\. The minor drop in performance with only a single few\-shot example can be explained by the two operating modes of in\-context learningLin and Lee \([2024](https://arxiv.org/html/2606.14654#bib.bib52)\); Minet al\.\([2022](https://arxiv.org/html/2606.14654#bib.bib53)\): with insufficient demonstrations, models tend to rely on retrieving familiar tasks from pretraining \(‘task retrieval’\) rather than adapting to the presented task \(‘task learning’\)\. However, on further increasing the number of in\-context learning examples from5→10→205\\rightarrow 10\\rightarrow 20, the performance drops possibly because of increasing context length\. More broadly, applying WorkflowView to predict student dropouts based on MOOC interaction logs not only indicates the effectiveness of the method regardless of the hyperparameters related to sequence length and duration but also provides insights into how the method interacts with the broader literature around in\-context learning\.

### 4\.3Analyzing before/after activities around key actions: A case\-study of how AI tools help in document workflows

As AI assistance tools are transforming how users engage with documents across different digital applications, we conduct a case\-study on analyzing the action sequences before and after users accept the response provided by an AI tool \(Copilot in Word; as captured by a specific action in the application telemetry\)\. The AI tool considered in this study is embedded as a product feature within Microsoft Word, a document editing application with hundreds of millions of active users\. The application allows the users to prompt the AI tool at any stage of their workflow while working with a document and accept or discard its presented output\. The case\-study demonstrates how WorkflowView could provide interpretable and actionable user\-centric insights that could improve interactions and product design\.

We used WorkflowView to analyze the anonymous and privacy\-preserving telemetry of users of Microsoft Word\. We sampled50,00050,000users who had interacted with the AI tool at least once in the month of June 2025 and were located in the United States with their application language set tous\-en\. Users consented to log collection as part of the user agreement\. Note that the interaction logs are devoid of textual data and writer data, and the telemetry that captures users’ interactions with the application UI is highly granular and include approximately 2000 unique actions\. We only present aggregated, percentage\-based insights over the random sample of the users\.

Two key modifications exist beyond applying Layers 1 \(i\.e\., natural language descriptions\) and 2 \(succinct activity summary\) of WorkflowView for this task\. Since there is no list of prior activities that the action sequences need to be mapped to, these activities have to be ‘discovered’ from the data\. Once these activities have been discovered, there is a need for a categorization layer that maps the succinct summaries to one of the discovered categories\. For the category discovery step, we use an existing method \(TnT\-LLMWanet al\.\([2024](https://arxiv.org/html/2606.14654#bib.bib28)\)\) that does end\-to\-end label generation based on the raw succinct activity summaries \(i\.e\., Layer 2’s output\)\. The identified labels are then used in Layer 3 of WorkflowView for multi\-class classification, akin to the binary classification task for MOOC dropout prediction\. We included the description of the high\-level categories that were obtained using TnT\-LLM in the previous step to inform the multi\-class classification\. This case\-study also illustrates another example of easy adaptation of WorkflowView to applications that may involve identifying categories of activities or model evolving user behavior, where existing categories may become outdated over time\.

Dataset: We processed the sequences before and after \(at most 30 minutes in duration\) each of the occurrences of the action that indicates keeping AI tool’s output using WorkflowView\. Finally, for discovering the category of activities using TnT\-LLM, we used20%20\\%of the occurrences and then inferred the categories \(using Layer 3\) on the entire action sequence set\.

Analysis and Insights: Figure[4](https://arxiv.org/html/2606.14654#S4.F4)shows that active content editing \(described as activities involving modifying content of a document, such as copying, pasting, deleting, and reorganizing text in Table 3\) is the most frequent activity both before \(15%\) as well as after \(15%\) AI assistance, indicating its prominence in document\-related workflows where AI tools are used\. Active editing of content tends to continue as such after accepting AI tool’s outputs or, in certain cases, the user tends to transition to other activities like formatting text and its layout or transferring content within or across documents\. It is worth noting that the share of activities pertaining to formatting or refining layouts is greaterafterthe AI tool’s responses are accepted when compared to their share before, which may indicate that users try to incorporate the AI tool’s output in a manner that is consistent with the original content’s formatting\. The insights obtained with WorkflowView enable interpreting user engagement patterns from noisy and granular action sequences\. Additionally, these insights also offer actionable guidance for product improvements such as introducing more context\-aware formatting suggestions or adaptive layout tools that align with post\-AI interaction behaviors\.

It is also worth noting that in an evolving landscape where AI tools are changing how users interact with applications, it is a strength that the activities \(and the corresponding descriptions\) aresynthesizedwith the activity summaries inferred by WorkflowView from the action sequence data, rather than being predefined labels\. This ensures that the taxonomy reflects actual user behavior as it evolves over time\. We discuss practical considerations around efficiency in the context of real\-world deployment in the following section\.

## 5Discussion: Deployment and Extensions

WorkflowView is an LLM\-powered approach to do hierarchical abstractions over action sequences to understand users’ behavior within digital applications\. We demonstrate that the method can be easily adapted to work with a diverse set of tasks involving action sequences \(task description generation, binary classification, category discovery and multi\-class classification\) across different domains \(browser, MOOC, and document editing application\)\. Quantitative and qualitative analyses demonstrate that WorkflowView is on par with training\-based models for these tasks in zero\-shot or few\-shot settings\. The results indicate the promise of embedding LLM\-powered inferences in the lowest level of data infrastructure to drive user\-centric product improvements\. Below, we discuss some of the future extensions and applications of WorkflowView\.

Deployment cost and latency: Relying on LLM\-based inference requires careful consideration of deployment cost and latency, particularly for applications with large user bases where interaction logs may span terabytes of data\. Two trends are worth noting\. First, the cost and latency of LLM inference have decreased rapidly in recent yearsAgarwalet al\.\([2023](https://arxiv.org/html/2606.14654#bib.bib49)\); Cottieret al\.\([2025](https://arxiv.org/html/2606.14654#bib.bib39)\)\. In parallel, there has been growing support for deploying smaller language models, which our evaluations suggest can perform on par with larger models for activity inference tasks\. For example, Phi\-4 \(14B parameters;Abdinet al\.\([2024](https://arxiv.org/html/2606.14654#bib.bib29)\)\) achieves performance comparable to GPT\-4o on the web browser inference task while requiring substantially fewer hardware resources \(Appendix Table[6](https://arxiv.org/html/2606.14654#A1.T6)\)\. Together, these trends make it increasingly feasible to apply LLM\-based methods such as WorkflowView to large\-scale interaction logs\. In the near term, WorkflowView can be deployed in offline settings to help developers understand how users interact with their applications and to identify opportunities for product improvement\. Periodic offline analyses can also surface shifts in user behavior over time, which is particularly relevant in dynamic human–AI collaborative workflows\. Such deployments can further control cost by operating on representative samples of users, while avoiding the latency constraints associated with real\-time inference\.

![Refer to caption](https://arxiv.org/html/2606.14654v1/x5.png)Figure 5:Illustrative multimodal extension of WorkflowView for task inference using browser snapshots and UI click logs\. In this example, task inference is performed unobtrusively as the user interacts with the application\. The visual modality complements the textual action sequence by grounding descriptions in on\-screen context \(e\.g\., visible content and time spent\), leading to more accurate and informative inferences than action sequences alone\.Multimodal extensions of WorkflowView: In this work, WorkflowView operates on action sequences captured in text\. However, given the multimodal capabilities of modern LLMs, the framework can be naturally extended to incorporate UI screenshots captured at key transition points during user interaction\. Visual context can provide complementary signals about user behavior, reduce reliance on application\-specific instrumentation, and ground textual descriptions in the actual interface state\. Figure[5](https://arxiv.org/html/2606.14654#S5.F5)qualitatively illustrates such a multimodal setting, where both high\-level task inference and behavioral descriptions are captured accurately\.

Multimodal extensions of WorkflowView open the door to real\-time, proactive AI assistance that supports users in achieving their high\-level goals\. For example, a user browsing an e\-commerce website could be offered structured recommendations—such as items to compare, stratified by viewed and unviewed—based on their inferred goal \(e\.g\., evaluating alternative keyboards\) and current progress toward that goal\. More broadly, accurate real\-time modeling of user behavior, encompassing both intent and task progress, is likely to be foundational for human–AI collaboration, enabling seamless hand\-offs between users and AI systems\.

## 6Conclusion

Building on the generalization capabilities of LLMs, including their demonstrated effectiveness on non\-language data, we introduce WorkflowView, a framework for hierarchical abstraction over action sequences to infer users’ high\-level activities within digital applications\. We show that WorkflowView can be applied reliably across three diverse domains and across multiple tasks\. Through a case study on real\-world telemetry from Microsoft Word, we illustrate how WorkflowView enables anonymous, privacy\-preserving, and aggregated analysis of user behavior that can inform user\-centric product improvements\. Finally, we outline multimodal extensions of WorkflowView that can support more effective human–AI collaboration and discuss key deployment considerations\.

## 7Limitations

Privacy and security considerations: It is critical to consider the privacy and security implications associated with the design, deployment, and future extensions of WorkflowView\. First, UI action sequences should only be collected with informed user consent, and inferences should be limited to behavioral understanding that does not reveal PII or sensitive content—for example, inferring that a user is “actively applying formatting changes to text” rather than “actively formatting text in a legal contract\.” For real\-time assistance, particularly in multimodal settings where private data could appear in UI screenshots, operationalization could involve strictly performingon\-deviceinferences, while logging only privacy\- and security\-compliant textual abstractions server\-side for offline analysis that informs product improvements\. Concrete privacy budgets or differential privacy \(DP\)\-style guarantees could be explored in future work\. Transparent and informed user consent is essential to ensure trust in AI\-powered technologies\.

Limitations: First, the action names that make up action sequences must convey meaningful information about user interactions \(e\.g\., ‘ClickLayoutRibbon’ rather than ‘Action1’\)\. This limitation also highlights the importance of developing an informative logging infrastructure to fully leverage LLM capabilities\. Second, future work could explore more token\-efficient prompting mechanisms to represent raw action sequences, as our current approach uses a direct textual representation of time\-stamped actions\. Simple strategies, such as chunking actions based on temporal proximity, could substantially reduce token counts\. Finally, while this work evaluates off\-the\-shelf LLMs under zero\-shot and few\-shot settings with a focus on cross\-task and cross\-domain generalizability, future research could investigate large\-scale pre\-training on action sequences from diverse domains to further improve generalization across both in\-sample and out\-of\-sample tasks\.

Data: Two of the datasets were curated by prior work are publicly availableDenget al\.\([2023](https://arxiv.org/html/2606.14654#bib.bib30)\); Fenget al\.\([2019](https://arxiv.org/html/2606.14654#bib.bib56)\); we comply with their terms of use\. Users of Microsoft Word consented to interaction log collection as part of the user agreement\.

## Disclaimer

Some of the information in this document relates to pre\-released content which may be subsequently modified\. Microsoft makes no warranties, express or implied, with respect to the information provided here\. This document is provided “as\-is”\. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice\. Some examples depicted herein are provided for illustration only and are fictitious\. No real association or connection is intended or should be inferred\. This document does not provide you with any legal rights to any intellectual property in any Microsoft product\.

## References

- M\. Abdin, J\. Aneja, H\. Behl, S\. Bubeck, R\. Eldan, S\. Gunasekar, M\. Harrison, R\. J\. Hewett, M\. Javaheripi, P\. Kauffmann,et al\.\(2024\)Phi\-4 technical report\.arXiv preprint arXiv:2412\.08905\.Cited by:[§A\.1\.3](https://arxiv.org/html/2606.14654#A1.SS1.SSS3.p1.2),[§4\.1](https://arxiv.org/html/2606.14654#S4.SS1.p1.1),[§5](https://arxiv.org/html/2606.14654#S5.p2.1)\.
- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p4.1)\.
- M\. Agarwal, A\. Qureshi, N\. Sardana, L\. Li, J\. Quevedo, and D\. Khudia \(2023\)LLM inference performance engineering: best practices\.Note:Accessed: 2026\-01\-05External Links:[Link](https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices)Cited by:[§5](https://arxiv.org/html/2606.14654#S5.p2.1)\.
- R\. Agrawal, T\. Imieliński, and A\. Swami \(1993\)Mining association rules between sets of items in large databases\.InProceedings of the 1993 ACM SIGMOD international conference on Management of data,pp\. 207–216\.Cited by:[§1](https://arxiv.org/html/2606.14654#S1.p2.1),[§2](https://arxiv.org/html/2606.14654#S2.p2.1)\.
- R\. Agrawal and R\. Srikant \(1995\)Mining sequential patterns\.InProceedings of the eleventh international conference on data engineering,pp\. 3–14\.Cited by:[§1](https://arxiv.org/html/2606.14654#S1.p2.1),[§2](https://arxiv.org/html/2606.14654#S2.p2.1)\.
- Y\. Bai, S\. Feng, V\. Balachandran, Z\. Tan, S\. Lou, T\. He, and Y\. Tsvetkov \(2024\)Kgquiz: evaluating the generalization of encoded knowledge in large language models\.InProceedings of the ACM Web Conference 2024,pp\. 2226–2237\.Cited by:[§1](https://arxiv.org/html/2606.14654#S1.p3.1)\.
- R\. B\. Basnet, C\. Johnson, and T\. Doleck \(2022\)Dropout prediction in moocs using deep learning and machine learning\.Education and Information Technologies27\(8\),pp\. 11499–11513\.Cited by:[footnote 1](https://arxiv.org/html/2606.14654#footnote1)\.
- D\. Britz, A\. Goldie, M\. Luong, and Q\. Le \(2017\)Massive exploration of neural machine translation architectures\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,Cited by:[§A\.1\.1](https://arxiv.org/html/2606.14654#A1.SS1.SSS1.p1.1)\.
- S\. K\. Card \(2018\)The psychology of human\-computer interaction\.Crc Press\.Cited by:[§A\.2](https://arxiv.org/html/2606.14654#A1.SS2.p1.1)\.
- B\. Cottier, B\. Snodin, D\. Owen, and T\. Adamczewski \(2025\)LLM inference prices have fallen rapidly but unequally across tasks\.Note:Accessed: 2026\-01\-05External Links:[Link](https://epoch.ai/data-insights/llm-inference-price-trends)Cited by:[§5](https://arxiv.org/html/2606.14654#S5.p2.1)\.
- B\. Cuke, B\. Goethals, and C\. Robardet \(2009\)A new constraint for mining sets in sequences\.InProceedings of the 2009 SIAM international conference on data mining,pp\. 317–328\.Cited by:[§1](https://arxiv.org/html/2606.14654#S1.p2.1),[§2](https://arxiv.org/html/2606.14654#S2.p2.1)\.
- X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su \(2023\)Mind2web: towards a generalist agent for the web\.Advances in Neural Information Processing Systems36,pp\. 28091–28114\.Cited by:[§4\.1](https://arxiv.org/html/2606.14654#S4.SS1.p1.1),[§7](https://arxiv.org/html/2606.14654#S7.p3.1)\.
- H\. Dev and Z\. Liu \(2017\)Identifying frequent user tasks from application logs\.InProceedings of the 22nd international conference on intelligent user interfaces,pp\. 263–273\.Cited by:[§1](https://arxiv.org/html/2606.14654#S1.p2.1),[§2](https://arxiv.org/html/2606.14654#S2.p2.1)\.
- W\. Feng, J\. Tang, T\. X\. Liu, S\. Zhang, and J\. Guan \(2019\)Understanding dropouts in moocs\.InProceedings of the 33rd AAAI Conference on Artificial Intelligence,Cited by:[§A\.1\.2](https://arxiv.org/html/2606.14654#A1.SS1.SSS2.p1.4),[Table 5](https://arxiv.org/html/2606.14654#A1.T5.1.1.1.1.1.1.1.2.1.1),[Table 5](https://arxiv.org/html/2606.14654#A1.T5.1.1.1.1.1.1.1.3.2.1),[§4\.2](https://arxiv.org/html/2606.14654#S4.SS2.p1.1),[§7](https://arxiv.org/html/2606.14654#S7.p3.1),[footnote 1](https://arxiv.org/html/2606.14654#footnote1)\.
- S\. Fine, Y\. Singer, and N\. Tishby \(1998\)The hierarchical hidden markov model: analysis and applications\.Machine learning32\(1\),pp\. 41–62\.Cited by:[§A\.2](https://arxiv.org/html/2606.14654#A1.SS2.p1.1)\.
- Q\. Fu, Z\. Gao, J\. Zhou, and Y\. Zheng \(2021\)CLSA: a novel deep learning model for mooc dropout prediction\.Computers & Electrical Engineering94,pp\. 107315\.Cited by:[footnote 1](https://arxiv.org/html/2606.14654#footnote1)\.
- H\. Guo, S\. Yuan, and X\. Wu \(2021\)Logbert: log anomaly detection via bert\.In2021 international joint conference on neural networks \(IJCNN\),pp\. 1–8\.Cited by:[§1](https://arxiv.org/html/2606.14654#S1.p2.1)\.
- S\. Hochreiter and J\. Schmidhuber \(1997\)Long short\-term memory\.Neural computation9\(8\),pp\. 1735–1780\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p2.1)\.
- Y\. Ji, Y\. Liu, F\. Yao, M\. He, S\. Tao, X\. Zhao, C\. Su, X\. Yang, W\. Meng, Y\. Xie,et al\.\(2025\)Adapting large language models to log analysis with interpretable domain knowledge\.InProceedings of the 34th ACM International Conference on Information and Knowledge Management,pp\. 1135–1144\.Cited by:[§A\.2](https://arxiv.org/html/2606.14654#A1.SS2.p2.1)\.
- M\. Jin, S\. Wang, L\. Ma, Z\. Chu, J\. Y\. Zhang, X\. Shi, P\. Chen, Y\. Liang, Y\. Li, S\. Pan, and Q\. Wen \(2024\)Time\-LLM: time series forecasting by reprogramming large language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p4.1)\.
- R\. Kaur, Z\. Zeng, T\. Balch, and M\. Veloso \(2025\)LETS\-c: leveraging text embedding for time series classification\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 32365–32399\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p4.1)\.
- K\. Krishna, D\. Jain, S\. V\. Mehta, and S\. Choudhary \(2018\)An lstm based system for prediction of human activities with durations\.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies1\(4\),pp\. 1–31\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p2.1)\.
- V\. Le and H\. Zhang \(2023\)Log parsing with prompt\-based few\-shot learning\.In2023 IEEE/ACM 45th International Conference on Software Engineering \(ICSE\),pp\. 2438–2449\.Cited by:[§A\.2](https://arxiv.org/html/2606.14654#A1.SS2.p2.1)\.
- Z\. Lin and K\. Lee \(2024\)Dual operating modes of in\-context learning\.InForty\-first International Conference on Machine Learning,Cited by:[§4\.2](https://arxiv.org/html/2606.14654#S4.SS2.p4.9)\.
- H\. Liu, Z\. Zhao, J\. Wang, H\. Kamarthi, and B\. A\. Prakash \(2024a\)LSTPrompt: large language models as zero\-shot time series forecasters by long\-short\-term prompting\.InFindings of the Association for Computational Linguistics ACL 2024,pp\. 7832–7840\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p4.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024b\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.Cited by:[§A\.2](https://arxiv.org/html/2606.14654#A1.SS2.p3.1)\.
- H\. Mannila, H\. Toivonen, and A\. Inkeri Verkamo \(1997\)Discovery of frequent episodes in event sequences\.Data mining and knowledge discovery1\(3\),pp\. 259–289\.Cited by:[§1](https://arxiv.org/html/2606.14654#S1.p2.1),[§2](https://arxiv.org/html/2606.14654#S2.p2.1)\.
- T\. Mikolov, K\. Chen, G\. Corrado, and J\. Dean \(2013\)Efficient estimation of word representations in vector space\.arXiv preprint arXiv:1301\.3781\.Cited by:[§A\.1\.1](https://arxiv.org/html/2606.14654#A1.SS1.SSS1.p1.1)\.
- S\. Min, X\. Lyu, A\. Holtzman, M\. Artetxe, M\. Lewis, H\. Hajishirzi, and L\. Zettlemoyer \(2022\)Rethinking the role of demonstrations: what makes in\-context learning work?\.arXiv preprint arXiv:2202\.12837\.Cited by:[§4\.2](https://arxiv.org/html/2606.14654#S4.SS2.p4.9)\.
- S\. Moon, A\. Madotto, Z\. Lin, T\. Nagarajan, M\. Smith, S\. Jain, C\. Yeh, P\. Murugesan, P\. Heidari, Y\. Liu,et al\.\(2024\)Anymal: an efficient and scalable any\-modality augmented language model\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track,pp\. 1314–1332\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p4.1)\.
- A\. M\. Nambhi, B\. P\. Reddy, A\. P\. Agarwal, G\. Verma, H\. Singh, and I\. A\. Burhanuddin \(2019\)Stuck? no worries\! task\-aware command recommendation and proactive help for analysts\.InProceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization,pp\. 271–275\.Cited by:[§1](https://arxiv.org/html/2606.14654#S1.p2.1),[§2](https://arxiv.org/html/2606.14654#S2.p2.1)\.
- V\. P\. S\. Nookala, G\. Verma, S\. Mukherjee, and S\. Kumar \(2023\)Adversarial robustness of prompt\-based few\-shot learning for natural language understanding\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 2196–2208\.Cited by:[§4\.2](https://arxiv.org/html/2606.14654#S4.SS2.p3.2)\.
- OpenAI \(2022\)New and improved embedding model\.Note:Accessed: 2026\-01\-05External Links:[Link](https://openai.com/index/new-and-improved-embedding-model/)Cited by:[§4\.1](https://arxiv.org/html/2606.14654#S4.SS1.p2.2)\.
- OpenAI \(2024\)Hello gpt\-4o\.Note:Accessed: 2026\-01\-05External Links:[Link](https://openai.com/index/hello-gpt-4o/)Cited by:[§4\.1](https://arxiv.org/html/2606.14654#S4.SS1.p1.1)\.
- OpenAI \(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.External Links:2508\.10925,[Link](https://arxiv.org/abs/2508.10925)Cited by:[§A\.1\.3](https://arxiv.org/html/2606.14654#A1.SS1.SSS3.p1.2),[§4\.1](https://arxiv.org/html/2606.14654#S4.SS1.p1.1)\.
- S\. Rothe, S\. Narayan, and A\. Severyn \(2020\)Leveraging pre\-trained checkpoints for sequence generation tasks\.Transactions of the Association for Computational Linguistics8,pp\. 264–280\.Cited by:[§A\.1\.1](https://arxiv.org/html/2606.14654#A1.SS1.SSS1.p1.1)\.
- J\. Schuurmans and F\. Frasincar \(2019\)Intent classification for dialogue utterances\.IEEE Intelligent Systems35\(1\),pp\. 82–88\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p3.1)\.
- I\. V\. Serban, R\. Lowe, P\. Henderson, L\. Charlin, and J\. Pineau \(2015\)A survey of available corpora for building data\-driven dialogue systems\.arXiv preprint arXiv:1512\.05742\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p3.1)\.
- C\. Shah, R\. White, R\. Andersen, G\. Buscher, S\. Counts, S\. Das, A\. Montazer, S\. Manivannan, J\. Neville, N\. Rangan,et al\.\(2025\)Using large language models to generate, validate, and apply user intent taxonomies\.ACM Transactions on the Web19\(3\),pp\. 1–29\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p3.1)\.
- I\. Sutskever, O\. Vinyals, and Q\. V\. Le \(2014\)Sequence to sequence learning with neural networks\.Advances in neural information processing systems27\.Cited by:[§A\.1](https://arxiv.org/html/2606.14654#A1.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.14654#S4.SS1.p3.5)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p2.1)\.
- Vellum \(2025\)Open llm leaderboard\.Note:[https://www\.vellum\.ai/open\-llm\-leaderboard](https://www.vellum.ai/open-llm-leaderboard)Accessed: 2026\-01\-05Cited by:[§A\.1\.3](https://arxiv.org/html/2606.14654#A1.SS1.SSS3.p1.2)\.
- G\. Verma, M\. Choi, K\. Sharma, J\. Watson\-Daniels, S\. Oh, and S\. Kumar \(2024\)Cross\-modal projection in multimodal llms doesn’t really project visual attributes to textual space\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 657–664\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p4.1)\.
- M\. Wan, T\. Safavi, S\. K\. Jauhar, Y\. Kim, S\. Counts, J\. Neville, S\. Suri, C\. Shah, R\. W\. White, L\. Yang,et al\.\(2024\)Tnt\-llm: text mining at scale with large language models\.InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining,pp\. 5836–5847\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p3.1),[§4\.3](https://arxiv.org/html/2606.14654#S4.SS3.p3.1)\.
- Y\. Wang, S\. Wang, Y\. Li, and D\. Dou \(2022\)Recognizing medical search query intent by few\-shot learning\.InProceedings of the 45th International ACM SIGIR Conference on research and development in information Retrieval,pp\. 502–512\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p3.1)\.
- Y\. Wang, Q\. Yao, J\. T\. Kwok, and L\. M\. Ni \(2020\)Generalizing from a few examples: a survey on few\-shot learning\.ACM computing surveys \(csur\)53\(3\),pp\. 1–34\.Cited by:[§4\.2](https://arxiv.org/html/2606.14654#S4.SS2.p3.2)\.
- J\. Yang, W\. Wang, P\. S\. Yu, and J\. Han \(2002\)Mining long sequential patterns in a noisy environment\.InProceedings of the 2002 ACM SIGMOD international conference on Management of data,pp\. 406–417\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p2.1)\.
- S\. Yang, Y\. Xiao, and F\. Meng \(2024\)Deep learning\-based method for predicting student dropouts in moocs\.In2024 7th International Conference on Machine Learning and Natural Language Processing \(MLNLP\),pp\. 1–6\.Cited by:[§A\.1\.2](https://arxiv.org/html/2606.14654#A1.SS1.SSS2.p1.4),[Table 5](https://arxiv.org/html/2606.14654#A1.T5.1.1.1.1.1.1.1.4.3.1),[Table 5](https://arxiv.org/html/2606.14654#A1.T5.1.1.1.1.1.1.1.5.4.1)\.
- D\. Zhou, N\. Schärli, L\. Hou, J\. Wei, N\. Scales, X\. Wang, D\. Schuurmans, C\. Cui, O\. Bousquet, Q\. V\. Le,et al\.\(2023\)Least\-to\-most prompting enables complex reasoning in large language models\.InThe Eleventh International Conference on Learning Representations,Cited by:[§A\.2](https://arxiv.org/html/2606.14654#A1.SS2.p3.1)\.
- Y\. Zhou, Y\. Chen, X\. Rao, Y\. Zhou, Y\. Li, and C\. Hu \(2024\)Leveraging large language models and bert for log parsing and anomaly detection\.Mathematics\.Cited by:[§1](https://arxiv.org/html/2606.14654#S1.p2.1)\.
- Y\. Zhu, W\. Meng, Y\. Liu, S\. Zhang, T\. Han, S\. Tao, and D\. Pei \(2021\)Unilog: deploy one model and specialize it for all log analysis tasks\.arXiv preprint arXiv:2112\.03159\.Cited by:[§2](https://arxiv.org/html/2606.14654#S2.p2.1)\.

## Appendix AAppendix

### A\.1Comparison against fine\-tuned baselines

While the core value proposition of WorkflowView is its zero\-shot, cross\-domain applicability, for completeness, we benchmark the performance against domain\-specific fine\-tuned baselines\. For the browser task inference dataset, we compare against LSTM and BERT\-based variants of sequence\-to\-sequence modelsSutskeveret al\.\([2014](https://arxiv.org/html/2606.14654#bib.bib9)\)\. For the MOOC dropout prediction, we compare against several approaches proposed in prior work that use different feature sets to perform the binary classification task, while keeping the evaluation settings consistent\.

#### A\.1\.1Baselines for browser task inference

Since generating the task description using browser action sequences is a sequence\-to\-sequence task \(akin to neural machine translation\), we use the existing seq2seq models and their implementations\. Using the train set of the Mind2Web dataset, the models learn the transformation of action sequences to the task description \(word by word\), such that actions are demarcated using the\[ACTION\]token\. We use word2vecMikolovet al\.\([2013](https://arxiv.org/html/2606.14654#bib.bib10)\)embeddings, while randomly initializing the out\-of\-vocabulary words and keeping the embeddings trainable \(as some of the interaction log vocabulary is not aligned with conventional language vocabulary\)\. To avoid extensive hyperparameter tuning, we follow the training recipe and best practices described by[Britzet al\.](https://arxiv.org/html/2606.14654#bib.bib11)\([2017](https://arxiv.org/html/2606.14654#bib.bib11)\) for LSTM\-seq2seq and[Rotheet al\.](https://arxiv.org/html/2606.14654#bib.bib13)\([2020](https://arxiv.org/html/2606.14654#bib.bib13)\) for BERT\-seq2seq closely; code available at[https://github\.com/google/seq2seq](https://github.com/google/seq2seq)and[https://github\.com/google\-research/google\-research/tree/master/bertseq2seq](https://github.com/google-research/google-research/tree/master/bertseq2seq), respectively\. Our evaluations \(shown in Table[4](https://arxiv.org/html/2606.14654#A1.T4)\) are centered around the same metrics for the ‘Global’ setting that is described in Section[4\.1](https://arxiv.org/html/2606.14654#S4.SS1)\.

Table 4:Comparing our training\-free, zero\-shot approach \(WorkflowView\) with domain\-specific fine\-tuned baselines for task inference using browser action sequences\.
#### A\.1\.2Baselines for MOOC dropout prediction

We compare against the Context\-aware Feature Interaction Network \(CFIN\) introduced byFenget al\.\([2019](https://arxiv.org/html/2606.14654#bib.bib56)\)\. CFIN models each enrollment using two feature groups: \(i\) learning\-activity featuresX​\(u,c\)X\(u,c\)extracted from historical logs \(primarily statistics over student actions\), and \(ii\) context featuresZ​\(u,c\)Z\(u,c\)capturing user and course attributes \(e\.g\., demographics and course category\)\. CFIN combines context\-aware smoothing and feature\-interaction modeling, and uses a*3\-layer*deep neural network \(DNN\) classifier as its prediction head\. As their strongest variant,[Fenget al\.](https://arxiv.org/html/2606.14654#bib.bib56)propose an ensemble strategy \(“CFIN\-en”\) analogous to stacking: they take the representation from the penultimate DNN layer and train an XGBoost classifier jointly on this representation and the original features\(X,Z\)\(X,Z\)\. We use the authors’ public implementation[https://github\.com/wzfhaha/dropout\_prediction](https://github.com/wzfhaha/dropout_prediction)and re\-train CFIN and the XGBoost\-stacked ensemble using features constructed under our 6\-day input/24\-hour evaluation window, while otherwise following the paper/code defaults\. We additionally compare against recent deep learning baselines that operate on week\-level temporal feature vectors\. FollowingYanget al\.\([2024](https://arxiv.org/html/2606.14654#bib.bib3)\)\(CNN\-LSTM\) and its bi\-attention variant \(CNN\-LSTM Bi\-Att\)\. We keep the architecture and hyperparameters consistent with the original work \(e\.g\., dropout and early stopping\), and reconstruct the inputs to match our 6\-day input/24\-hour evaluation window\. Table[A\.1\.2](https://arxiv.org/html/2606.14654#A1.SS1.SSS2)shows the comparison; WorkflowView achieves its best performance with 5 in\-context examples per class \(10 total\), reaching a weightedℱ1\\mathcal\{F\}\_\{1\}of 0\.90\. It is noteworthy that the performance is competitive even at 0\-shot and remains stable across a range of fewshot budgets \(see Fig\.[2](https://arxiv.org/html/2606.14654#S4.F2)&[3](https://arxiv.org/html/2606.14654#S4.F3)\)\.

Table 5:Comparing our few\-shot adapted approach to domain\-specific supervised baselines for MOOC dropout prediction\. Reported values areμ​\(±σ\)\\mu\(\\pm\\sigma\)\.
#### A\.1\.3Experiments with smaller LLMs

Table 6:Embedding\-based retrieval of ground\-truth task descriptions using task descriptions generated using WorkflowView\. Phi\-4 andgpt\-oss\-20bwere used to generate the task descriptions\.We evaluate the performance of smaller LLMs on browser task inference \(‘Global’ setting; see Section[4\.1](https://arxiv.org/html/2606.14654#S4.SS1)\) and consider two models — Phi\-4 \(14B parameters\)Abdinet al\.\([2024](https://arxiv.org/html/2606.14654#bib.bib29)\)andgpt\-oss\-20bOpenAI \([2025](https://arxiv.org/html/2606.14654#bib.bib5)\)\. This is largely to assess whether the LLMs that are among the fastest \(as measured by tokens per second\) and cheapest \(cost per token\)Vellum \([2025](https://arxiv.org/html/2606.14654#bib.bib4)\), can also interpret user action sequences as well as more expensive counterparts like GPT\-4o\. We find that the task descriptions generated using Phi\-4 andgpt\-oss\-20bdemonstrated a mean similarity of0\.902±0\.0360\.902\\pm 0\.036and0\.909±0\.0390\.909\\pm 0\.039, respectively\. Table[6](https://arxiv.org/html/2606.14654#A1.T6)shows that the retrieval\-based metrics are also on par with those obtained using the GPT\-4o model in Table[1](https://arxiv.org/html/2606.14654#S4.T1)\.

### A\.2Additional related work

Hierarchical action modeling: Beyond the pattern mining techniques discussed in Section 2, our work shares conceptual roots with hierarchical representation learning for sequential data\. Traditional approaches in robotics and plan recognition have long used Hierarchical Hidden Markov Models \(HHMMs\)Fineet al\.\([1998](https://arxiv.org/html/2606.14654#bib.bib155)\)to decompose complex behaviors\. In the human\-computer interaction community, GOMS modelingCard \([2018](https://arxiv.org/html/2606.14654#bib.bib153)\)provided early foundations for decomposing user goals into operators, which WorkflowView’s context involves replacing manual task analysis with LLM\-based inference\.

LLMs for system logs: While we discuss LLMs for user action sequences, there is a growing body of work specifically targeting system logs for anomaly detection and root cause analysis\. Methods like LogPPTLe and Zhang \([2023](https://arxiv.org/html/2606.14654#bib.bib152)\)and Log\-LLMJiet al\.\([2025](https://arxiv.org/html/2606.14654#bib.bib151)\)demonstrate the utility of prompt\-based learning for structured log parsing\. WorkflowView differs by focusing on semantic user intent across diverse UI domains rather than system\-level health monitoring\.

In\-context learning for sequence tasks and hierarchical prompting: The sensitivity analysis in Section 4\.2 regarding few\-shot examples aligns with recent findings on the “lost in the middle” phenomenon and context window saturationLiuet al\.\([2024b](https://arxiv.org/html/2606.14654#bib.bib150)\)\. More recently, the “least\-to\-most” prompting paradigmZhouet al\.\([2023](https://arxiv.org/html/2606.14654#bib.bib149)\)demonstrated that LLMs are significantly more effective when complex problems are decomposed into a series of simpler sub\-problems, with the solution to each step facilitating the next\. WorkflowView translates this principle to the domain of telemetry by treating raw event logs as the input “complex problem” and using a layered architecture to progressively abstract user intent\. The current study does not include a formal ablation on the hierarchical structure itself \(e\.g\., comparing single\-pass prompting against the multi\-layered approach\)\. We maintain that this hierarchical design is central to the “progressive denoising” required for noisy and granular telemetry\. As least\-to\-most prompting has already been shown to significantly outperform single\-pass reasoning in tasks requiring complex decomposition and easy\-to\-hard generalization, we chose to implement this proven hierarchical logic rather than re\-evaluating its effectiveness ablations\.

### A\.3Quality & stability of inferred activities

It is important to address the stability of our unsupervised categorization, particularly within the Microsoft Word case study involving enterprise document software\. Because this domain lacks a labeled ground\-truth for inferred high\-level categories, we conducted a sensitivity analysis to evaluate the consistency of the LLM’s discovered categories\. We re\-ran the entire categorization pipeline across three independent trials using different random seeds\. Qualitative analysis of the inferred categories indicates that the “top\-N” discovered categories, accounting for over 90% of the total analyzed sequences, were inferred with remarkable consistency across all runs\. For instance, core categories such as “Reviewing comments \(collaboration\)” and “Content transfer within or between documents” appeared in every trial\. The variance in these high\-frequency categories was limited strictly to lexical changes in naming; for example, one trial labeled a cluster as “Active editing of content” while another named it “Editing content actively,” yet both mapped to the same underlying distribution of low\-level user action patterns as verified by tf\-idf scores of the action sequences associated with these categories\. In contrast, the “long\-tail” activities representing less than 10% of the dataset exhibited higher instability; for example, “Document navigation” was not consistently isolated as a standalone category across all runs, often being absorbed into broader clusters\. While future work could involve extensive human evaluation to further validate these semantic boundaries, the qualitative results observed across our experiments—particularly the browser inference tasks in Table[2](https://arxiv.org/html/2606.14654#S4.T2)and Table[11](https://arxiv.org/html/2606.14654#A1.T11)—are highly compelling\. The precision with which the framework translates low\-level user action sequences into human\-readable intent builds significant confidence in the stability and quality of these inferred categories for functional product telemetry, where the goal is to understand the most common user journeys\.

Table 7:WorkflowView Layer 1 prompt for browser task inference; obtaining natural language descriptions from action sequences\.Table 8:WorkflowView Layer 2 Prompt for browser task inference; obtaining succinct summary of user intent\.Table 9:WorkflowView Layer for MOOC student dropout prediction; obtaining natural language descriptions\.Table 10:WorkflowView Layer 2 for MOOC student dropout prediction; obtaining succinct activity summary\.Table 11:Qualitative examples of task descriptions generated using WorkflowView from the action sequences, and the corresponding ground\-truth task descriptions\.

Similar Articles

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Papers with Code Trending

WebWatcher is a multimodal agent for deep research that uses synthetic trajectories and reinforcement learning to achieve superior performance in complex visual and textual information retrieval tasks. The paper also introduces BrowseComp-VL, a new benchmark for evaluating multimodal agents.