TwinBI: An Agentic Digital Twin for Efficient Augmented Interactions with Business Intelligence Dashboards

arXiv cs.AI 06/15/26, 04:00 AM Papers

Summary

TwinBI is a framework that couples an LLM-based agent with an executable BI dashboard state to maintain consistency during multi-step analytical interactions, improving accuracy and reducing timeout rates in benchmarks.

arXiv:2606.13731v1 Announce Type: new Abstract: Business intelligence (BI) increasingly combines dashboard interaction with LLM-based assistance, but these two modes often fall out of sync during multi-step analysis. As users switch between direct dashboard manipulation and natural-language queries, it becomes difficult to preserve a consistent analytical state across filters, hierarchies, metrics, and chart context. We present TwinBI, an agentic digital-twin framework that couples an LLM-based agent system with an executable BI dashboard state. TwinBI unifies conversational interaction, dashboard manipulation, semantic grounding, and provenance tracking through a shared analytical state reconstructed from a unified interaction log. It also exposes artifacts such as schema views, SQL, logs, and an /insights command for state-grounded analytical summaries. We evaluate TwinBI in two complementary ways. In a controlled A/B benchmark with the same backbone agent, TwinBI improves exact-match accuracy from 43.3% to 63.3%, partial-credit accuracy from 48.3% to 70.8%, and substantially reduces timeout rate from 40.0% to 10.0% relative to Dashboard alone. In a usability study, participants benefited from the integrated dashboard-and-chat workflow, with high task accuracy, moderate workload, and favorable ratings for state-aware interaction mechanisms. These results suggest that TwinBI improves both agent-level analytical reliability and user-facing analytical support by turning visible dashboard state into richer actionable context. Our dataset and source code are available at: https://github.com/simonjisu/TwinBI

Original Article

View Cached Full Text

Cached at: 06/15/26, 09:09 AM

# TwinBI: An Agentic Digital Twin for Efficient Augmented Interactions with Business Intelligence Dashboards
Source: [https://arxiv.org/html/2606.13731](https://arxiv.org/html/2606.13731)
11institutetext:Graduate School of Data Science, Seoul National University
11email:simonjisu@snu\.ac\.kr & wensyanli@snu\.ac\.kr###### Abstract

Business intelligence \(BI\) increasingly combines dashboard interaction with LLM\-based assistance, but these two modes often fall out of sync during multi\-step analysis\. As users switch between direct dashboard manipulation and natural\-language queries, it becomes difficult to preserve a consistent analytical state across filters, hierarchies, metrics, and chart context\. We present TwinBI, an agentic digital\-twin framework that couples an LLM\-based agent system with an executable BI dashboard state\. TwinBI unifies conversational interaction, dashboard manipulation, semantic grounding, and provenance tracking through a shared analytical state reconstructed from a unified interaction log\. It also exposes artifacts such as schema views, SQL, logs, and an/insightscommand for state\-grounded analytical summaries\. We evaluate TwinBI in two complementary ways\. In a controlled A/B benchmark with the same backbone agent, TwinBI improves exact\-match accuracy from 43\.3% to 63\.3%, partial\-credit accuracy from 48\.3% to 70\.8%, and substantially reduces timeout rate from 40\.0% to 10\.0% relative to Dashboard alone\. In a usability study, participants benefited from the integrated dashboard\-and\-chat workflow, with high task accuracy, moderate workload, and favorable ratings for state\-aware interaction mechanisms\. These results suggest that TwinBI improves both agent\-level analytical reliability and user\-facing analytical support by turning visible dashboard state into richer actionable context\. Our dataset and source code are available at:

[https://github\.com/simonjisu/TwinBI](https://github.com/simonjisu/TwinBI)

## 1Introduction

Business Intelligence \(BI\) systems form the core infrastructure that underpins data\-driven decision\-making in modern organizations\. They allow analysts and decision\-makers to investigate structured data, track organizational performance, and ground their decisions in measurable evidence\[bi\_system\]\. Recent progress in natural language processing, particularly in LLM\-based agent architectures, has introduced a new interaction paradigm for BI\. These systems are often presented as potential successors to traditional dashboards and analytics tools, converting natural language requests into tool executions and structured query language \(SQL\) statements\.

Yet this emerging replacement narrative overlooks a long\-standing disconnect between fluent natural language generation and analytically sound decision support\. Enterprise business intelligence \(BI\) is grounded in precisely defined semantics—such as metric definitions, time assumptions, aggregation grains, and filter scopes—that are often encoded only implicitly in dashboards and semantic layers\. LLM\-based agents can drift outside these constraints, yielding answers that read well but are analytically inconsistent with the system’s actual state\. We therefore suggest that robust “agentic BI” may benefit from combining interactive BI tools with LLM\-based assistance through an explicit coordination layer that aligns user intent, semantic definitions, and query execution, rather than relying on natural language alone as the interface\.

To tackle this challenge, we presentTwinBI, a framework that achieves BI by means oftwo interconnected digital twins: an LLM agent twin that models user intent and reasoning and a BI twin that represents an executable analytics state, with both twins remaining synchronized throughout the interaction\. TwinBI fuses natural\-language interaction with machine\-readable explicit representations of analytic schemas and hierarchies, metric dimension mappings, executable query specifications, and their associated result sets, while grounding the agent’s behavior in the user’s current analytical context, inferred from dashboard interactions\. The system exposes intermediate analytical states \(including tool invocations and query parameters\) and captures complete provenance through unified event logging and persistent identifiers\. This design promotes transparency and traceability for both user interactions and system\-level reasoning\. With this, BI\-Twins shifts the role of LLM\-based agents from “replacing BI” to “working in concert with BI,” thereby enhancing the robustness and reliability of decision support for business users\.

In this paper, we present the design of TwinBI and evaluate it in two complementary ways\. In a controlled A/B benchmark with the same backbone agent, TwinBI improves exact\-match accuracy, partial\-credit accuracy, and completion reliability over Dashboard alone\. We further report a usability study showing that users benefit from the integrated dashboard\-and\-chat workflow for completing analysis tasks and interpreting results\.

## 2Background

Business Intelligence \(BI\) encompasses the methods, tools, and technologies that transform organizational data into actionable insights\[bi\_system\]\. Many BI platforms rely on Online Analytical Processing \(OLAP\), wheredata cubesorganize multidimensional aggregates overmeasures\(e\.g\., sales or units sold\) anddimensions\(e\.g\., time, geography, or product\)\[gray1997data,chaudhuri1997overview\]\. Dimensions often contain hierarchies, such as Year≻\\succQuarter≻\\succMonth, that support analysis at multiple granularities\.

Analytical exploration over cubes is commonly described through operators such asslice,dice,roll up,drill down, andpivot\. In BI dashboards, these operators correspond to familiar actions such as filtering, changing time granularity, cross\-filtering, changing group\-by fields, and reconfiguring chart views, thereby forming the interaction vocabulary for navigating the underlying multidimensional space\.

Large language models \(LLMs\) enable natural language interaction with data by converting user questions into structured representations and generating explanations grounded in retrieved evidence\. Extending this idea,LLM agentsmove beyond single\-turn prompts to create tool\-augmented workflows in which the agent decomposes a request, invokes external tools, and combines their outputs into a final response\[fan2024survey,yao2022react\]\.

A widely used strategy for building LLM\-powered BI assistants is theNatural Language to SQL\(NL2SQL\) pipeline, which converts a user’s request into an SQL query, executes it, and returns the result in natural language\[zhang2024natural,liu2024survey\]\. This approach is practical because it maps user intent directly to executable analytical queries over the underlying database\.

Despite these advances, LLM agents still fall short of being full BI platforms, as operational analytics needs far more than SQL generation\. In particular, BI systems must preserve analytical state across both natural\-language interaction and direct dashboard manipulation while keeping metrics, filters, and hierarchy semantics aligned throughout multi\-step exploration\.

![Refer to caption](https://arxiv.org/html/2606.13731v1/figs/tb.png)Figure 1:The TwinBI user interface: \(1\) a chat interface for natural\-language analytics queries, \(2\) an embedded dashboard for interactive exploration, and \(3\) an inspection panel exposing artifacts such as SQL and the hierarchical schema to support schema understanding\.![Refer to caption](https://arxiv.org/html/2606.13731v1/x1.png)Figure 2:System Architecture of TwinBI\.
## 3System Architecture

TwinBI adopts a layered architecture that synchronizes an LLM\-based agent system with an executable BI dashboard state\. The architecture preserves a consistent analytical state across conversational interaction and direct dashboard manipulation while maintaining end\-to\-end traceability through unified logs\.

Figure[1](https://arxiv.org/html/2606.13731#S2.F1)shows the TwinBI interface, which combines natural\-language interaction, embedded dashboards, and inspection views for schema\- and query\-level artifacts\. Figure[2](https://arxiv.org/html/2606.13731#S2.F2)illustrates the overall design of the system\. The system is containerized using Docker\[docker\_docs\]to ensure isolated and reproducible deployment\. The architecture consists of five layers: \(1\) Presentation Layer, \(2\) Orchestration Layer, \(3\) Semantic Layer, \(4\) BI Tool Layer, and \(5\) Data Layer\.

### 3\.1Presentation Layer

Thepresentation layerprovides the user\-facing experience, combining a chat interface with embedded BI dashboards\. Users can submit natural\-language queries or interact directly with visualizations via filtering, tab switching, cross\-filtering, and drill\-down operations\.

To keep both modalities aligned, the interface tracks the active analytical context, including the selected chart, tab, and recent dashboard interactions, and sends these signals to the backend for state reconstruction\.

The interface is built with Streamlit\[streamlit\_docs\]and incorporates Apache Superset\[superset\_software\]dashboards to provide interactive visualizations\.

Table 1:Major user interface activities used for unified interaction logs\.
### 3\.2Backend Orchestration & Intelligence Layer

Thebackend orchestration and intelligence layermanages both the multi\-LLM agent system and the executable dashboard state\. Implemented with FastAPI\[fastapi\_docs\], it combines recent dialogue history, dashboard interaction logs, and tool outputs into a unified analytical context, routes sub\-tasks to specialized agents, and consolidates their results into responses anchored in the current state\.

All interactions with external systems occur through backend\-governed tools\. The backend also maintains aunified interaction logthat captures conversational exchanges, dashboard operations, and tool metadata as the authoritative record for state reconstruction and provenance\. Table[1](https://arxiv.org/html/2606.13731#S3.T1)summarizes the major dashboard interaction events recorded in this log\.

### 3\.3Semantic Layer

Thesemantic layercaptures business meaning using declarative models of measures, dimensions, hierarchies, and join paths\. It provides the shared semantic model for both conversational outputs and dashboard queries, enforcing compatible grains and valid joins\. We also derive a Hierarchy Schema Graph from fact tables and dimension hierarchies, giving the Schema Explorer agent a structured and navigable view of the analytical schema\.

This layer is built on top of Cube\[cube\_core\], which provides REST and SQL interfaces for executing model\-driven queries\.

### 3\.4BI Tool

TheBI tool layeroffers interactive dashboards for visual data exploration\. TwinBI uses Apache Superset\[superset\_software\]both to render visualizations and to capture detailed interaction events, which are ingested into the unified log and replayed through the chart data API when contextual grounding is required\.

### 3\.5Data Layer

The Data Layer serves as analytical storage for domain\-specific data\. Analytical datasets are stored in DuckDB\[duckdb\_docs\]and accessed solely through the semantic layer, preserving uniform metric definitions and aggregation behavior while keeping storage decoupled from interaction handling\.

## 4Functionality of TwinBI System

TwinBI is designed for the common BI workflow in which users alternate between clicking through an existing dashboard and asking follow\-up questions in natural language\. Instead of treating these as separate modes, the system reuses the dashboard state accumulated through interaction and applies it to subsequent chat requests\. In practice, this means that charts, filters, and follow\-up questions are resolved against the same reconstructed analytical state rather than against an isolated prompt\.

![Refer to caption](https://arxiv.org/html/2606.13731v1/x2.png)Figure 3:\(Left\) Example query requesting product categories with more than 15% quarter\-over\-quarter unit sales growth in the fourth quarter\. \(Right\) Example/insightsoutput summarizing high\-growth product categories, caveats about percentage\-based interpretation, and suggested next analysis steps\.### 4\.1Finding and Creating Charts

Users can access or generate charts through two complementary mechanisms: \(1\) direct interaction with the BI dashboard and \(2\) natural\-language prompts via the chat interface\.

In dashboard\-centric interaction, users navigate and refine existing visualizations by applying filters, switching tabs, using cross\-filtering, drilling down, and toggling series visibility\. We explicitly log these actions because the resulting state is often not recoverable from a later chat turn alone\. For example, a follow\-up question such as “Why did this category increase?” is underspecified unless the system can recover which chart was active, which filters were already applied, and which hierarchy level the user had reached\. TwinBI therefore encodes dashboard actions as structured events and uses them to reconstruct the current analytical state before interpreting a new conversational request\.

In conversational interaction, users can request new visualizations without needing to specify schema names or rebuild the dashboard context from scratch\. The system resolves the request through the semantic layer and keeps any generated chart aligned with the current slice of analysis\. We found this important in cases where users wanted to branch from an existing dashboard view rather than start a new query from a blank state\.

### 4\.2Hierarchy Schema Graph, SQL, and Logs

TwinBI exposes three inspection artifacts for users who want to verify what the system is doing rather than accept a final answer at face value\. First, the Hierarchy Schema Graph gives a compact view of measures, dimensions, and hierarchies through the Schema Explorer\. This is mainly useful when a user knows the business concept they want but not the exact field names supported by the semantic model\. Second, TwinBI exposes the SQL associated with each chart query so that users can inspect joins, filters, and aggregation choices\. Third, the unified interaction log can be inspected directly to trace how a conversational answer relates to earlier dashboard operations and tool calls\. We include these artifacts because debugging BI answers often requires checking whether an error came from schema selection, filter carry\-over, or answer generation\.

### 4\.3Finding Insights by Agent

As shown on the right side of Figure[3](https://arxiv.org/html/2606.13731#S4.F3), TwinBI offers a dedicated/insightscommand for moments when users want a state\-aware summary of the current view rather than an answer to a new question\. When this command is executed, the backend assembles a compact execution context from the unified interaction log, including recent conversations, tool traces, the active chart, and its current filters\. This context is passed to a specialized insight agent that returns a short summary organized around three elements: the current analytical slice, the main quantitative observations visible in that slice, and sensible next checks for the user\.

This function is intentionally constrained so summary\-style outputs remain grounded in the currently visible analytical evidence\. It only summarizes information supported by the current analytical state and must indicate when evidence is insufficient for a stronger claim\. Figure[3](https://arxiv.org/html/2606.13731#S4.F3)illustrates this by showing a summary tied to the current dashboard context rather than to new exploratory queries\.

## 5Experiments

We design a benchmark\-style A/B evaluation to measure whether TwinBI’s state\-grounded orchestration improves analytical task completion over a Dashboard system under matched model and environment conditions\. Unlike the usability study in Section[6](https://arxiv.org/html/2606.13731#S6), this experiment targets controlled agent performance on a fixed query set and focuses on exact\-match accuracy, robustness, and interaction efficiency\.

### 5\.1Experimental Setting

The evaluation uses a retail sales dashboard environment built over a shared semantic model with product, store, and date as its primary analytical dimensions\. In both experimental conditions, we employ a Playwright\-based browser agent configured withgpt\-5\-minias the decision\-making model\[playwright,openaiIntroducingGPT5\], subject to a maximum budget of 30 interaction steps\. At each step, the agent observes the current state, selects the next action, executes it through Playwright, and then updates its next decision from the newly observed state\. Here, the state includes the current screenshot, actionable UI candidates, recent action history, and task\-specific prompt context\. We compare two systems:\(A\)Dashboard, which makes these stepwise decisions from the visible dashboard alone, and\(B\)TwinBI, which makes the same kind of stepwise decisions but augments them with chat interface and backend support withgpt\-5\-minifor Orchestration & Intelligence Layer\.

The benchmark consists of 30 analytical queries\. The query set is balanced across five task families, with six queries in each family: \(1\) store and district ranking, \(2\) premium product analysis, \(3\) quarter\-over\-quarter growth analysis, \(4\) comparison and aggregation tasks across dashboard views, and \(5\) robustness and trap tasks that test policy compliance and filter stability\.

To construct the target answers used for evaluation, we resolved each query through three independent paths: direct database queries, cube\-API queries, and dashboard\-level queries\. We then checked these three answers for self\-consistency and performed a final manual verification step before fixing the reference annotation for each benchmark item\.

### 5\.2Evaluation Metrics

We evaluate our system using both outcome\-oriented and behavior\-oriented metrics\. The primary outcome measures are: \(1\)Exact match accuracy, which assesses whether the final structured prediction is identical to the reference annotation; \(2\)Partial\-credit accuracy, which quantifies slot\-level correctness for partially accurate structured outputs, thereby enabling us to distinguish near\-miss reasoning from complete failure; and \(3\)Average stepsto completion, which operationalizes interaction efficiency as the total number of recorded steps divided by the number of queries\.

The behavioral metrics are defined as follows: \(1\)Timeout ratecaptures the proportion of queries that terminate upon reaching the maximum step budget instead of producing a valid answer\. \(2\)Invalid action ratedenotes the proportion of recorded interaction steps that either violate the prescribed action policy or reference an interface element that is unusable\. \(3\)Loop query raterepresents the proportion of queries that exhibit consecutive repeated action signatures, whereasloop step ratedenotes the proportion of all recorded steps that are part of such repetitive loops\. For example, the metric counts repeated chat steps only when the same chat prompt is issued in consecutive turns, and repeated click steps only when the same click target coordinates recur consecutively\.

Table 2:Aggregate A/B results withgpt\-5\-minifixed across both conditions over the 30\-query benchmark set\. Average steps are computed from the total recorded steps in each run\.![Refer to caption](https://arxiv.org/html/2606.13731v1/figs/result_dashboard_eg.png)Figure 4:Category\-level dashboard view withDepartment QoQ Growth\(left\),Units sold by quarter\(center\), andCategory Scatter \(Growth vs\. Scale\)\(right\)\.
### 5\.3Results

Table[2](https://arxiv.org/html/2606.13731#S5.T2)summarizes the aggregate A/B results\. TwinBI improved both exact\-match and partial\-credit accuracy while requiring fewer steps per query\. Figure[4](https://arxiv.org/html/2606.13731#S5.F4)shows the category\-level dashboard view used in the representative Q14 example\. This gain appears to come from enriching visible dashboard state with structured context through the backend agent and chat interface, rather than relying on dashboard probing alone\. Q14 is aQoQ slot\-fillingtask that asks for the department–category pair with the highest quarter\-over\-quarter unit sales growth from Q3 to Q4\. To solve it from Dashboard alone, the system must apply the2024\-10\-01filter, selectMarketingin theDepartment QoQ Growthtable to trigger cross\-filtering, and then hover overMobilein theCategory Scatter \(Growth vs\. Scale\)chart to recover the required tuple\. Dashboard localized this view but then spent the full 30\-step budget on repeated hover\-based probing without completing this chain reliably, whereas TwinBI opened the same view and recovered the answer through a single chat query grounded in the visible dashboard state\.

Table 3:Behavior\-oriented metrics from the same A/B run\. TwinBI reduced timeouts and invalid actions substantially, although its loop\-step rate remained non\-trivial because some failures still involved repeated chat\-centered reasoning patterns\.Table[3](https://arxiv.org/html/2606.13731#S5.T3)reports the corresponding behavior metrics\. The lower timeout and invalid\-action rates indicate that TwinBI makes the interaction process more stable, not only more accurate\. This robustness gain appears to be associated with replacing brittle UI\-only probing with chat\-supported interpretation over richer structured context; Q17 is a representative case\. The task asks for the Q3\-to\-Q4 QoQ growth rate of theMobilecategory in theMarketingdepartment\. In the Dashboard trace, the system reached the correct category\-level view and attempted the right interaction chain, but then remained stuck in repeated hover\-based tooltip probing and late filter switching to recover the final value\. In contrast, TwinBI opened the same view and issued a chat query over the visible dashboard, which returned the required QoQ growth rate directly and allowed the run to terminate successfully\. The loop metrics suggest a more mixed shift in failure mode: TwinBI reduced repeated\-failure queries, but some remaining failures still involved repeated chat\-centered steps rather than dashboard\-level probing\.

This A/B test is intended to isolate the effect of system support for the same backbone agent\. The results suggest that TwinBI improves completion reliability and structured interpretation by turning visible dashboard state into richer actionable context under the same agent and step budget\.

## 6Usability Study

We evaluate how TwinBI supports users as they move between dashboard\-based exploration and conversational follow\-up\. Rather than measuring open\-ended long\-term adoption, this study focuses on whether users can complete representative analytical tasks, how much interaction effort they expend, which interaction patterns they prefer, and whether the system helps them articulate correct higher\-level interpretations\.

### 6\.1Methodology

We conducted a within\-subjects usability study with five participants\. Each participant completed three analytical scenarios with progressively increasing analytical complexity and system support\.

In each scenario, participants had to locate entities that met predefined performance constraints using multidimensional filtering and aggregation\. The scenarios varied in how many interaction modalities were available, enabling us to disentangle the effects of orchestration and analytical state grounding\.

- •S1: Store Performance Analysis \(Limited Support\)\.Participants identified the top\-performing store in the North district based on average daily sales, excluding stores with fewer than 15 active days per month\. Dashboard filtering and chart inspection were available; chat assistance was optional\.
- •S2: Product Pricing Analysis \(Moderate Support\)\.Participants identified product types whose average revenue per unit exceeded the overall portfolio average\. The interface provided dashboard interaction and conversational chart requests after initial dashboard interaction\.
- •S3: Category Growth Analysis \(Full Support\)\.Participants identified categories that achieved a minimum of 15% growth in unit sales quarter over quarter between Q3 \(beginning on 2024\-07\-01\) and Q4 \(beginning on 2024\-10\-01\)\. All interaction mechanisms, including conversational chart generation and insight support, were available\.

This staged design lets us observe not only whether participants succeed, but also how their behavior changes as more state\-aware assistance becomes available\. In particular, we were interested in whether participants would continue relying on direct dashboard interaction, shift toward chat once context was established, or combine the two modes\.

For evaluation, we report both objective and subjective measures\. The objective measures are: \(1\)Task Accuracy, indicating whether participants solved each scenario correctly; \(2\)Interaction Cost, measured as the number of dashboard events and chat turns per scenario; and \(3\)Insight Accuracy, indicating whether participants produced correct higher\-level interpretations\. The subjective measures are: \(1\)Perceived Difficultyon a 5\-point Likert scale; \(2\)Feature Usefulness, covering dashboard interaction, chart finding, click\+chat, chat\-only interaction, SQL inspection, schema exploration, interaction log inspection, and/insights; and \(3\)NASA\-TLXdimensions for mental demand, temporal demand, performance, effort, and frustration\[hart1988nasa\_tlx\]\.

Table 4:Summary of scenario\-level results \(S1–S3\)\. Accuracy metrics are reported as percentages\. Interaction cost represents the mean number of dashboard clicks and chat turns per scenario\. Perceived difficulty reflects participants’ average self\-reported task difficulty on a 5\-point scale\.MetricS1S2S3Task Accuracy \(%\)100%73\.33%100%Insight Accuracy \(%\)80%100%80%Average Interaction Cost \(Clicks\)6\.43449Average Interaction Cost \(Chats\)0\.665\.2Average Perceived Difficulty \(1–5\)1\.83\.44\.2![Refer to caption](https://arxiv.org/html/2606.13731v1/x3.png)Figure 5:Feature usefulness combining relative preference \(Borda scores\) and usefulness ratings \(mean with standard deviation\)\. Bars show Borda scores; points show feature usefulness assessment ratings \(0 responses treated as N/A\)\.![Refer to caption](https://arxiv.org/html/2606.13731v1/x4.png)Figure 6:NASA\-TLX workload ratings \(N=5N=5\)\. Bars represent mean scores with standard deviation error bars\. Dashed vertical lines indicate workload thresholds \(Low, Moderate, High\), color\-coded for interpretability\.
### 6\.2Results

RQ1: To what extent does TwinBI simplify BI workflows and minimize operational friction without compromising analytical accuracy?

Table[4](https://arxiv.org/html/2606.13731#S6.T4)reports the scenario\-level outcomes\. Task Accuracy remained high across all scenarios, although the harder scenarios required substantially more interaction\. Participants completed S1 with little assistance and relatively few clicks, whereas S2 and S3 led to heavier use of both dashboard interaction and chat\. In the study, participants usually used the dashboard to establish context first and then turned to chat for comparison, threshold checking, or explanation across multiple views\. Figure[6](https://arxiv.org/html/2606.13731#S6.F6)shows that workload still stayed in the low\-to\-moderate range despite this increase in interaction cost\. Given the small sample, we interpret this result as evidence that the combined workflow is usable for moderately complex tasks rather than as proof of broad efficiency gains\.

RQ2: Does analytical state awareness improve agent assistance effectiveness?

Figure[5](https://arxiv.org/html/2606.13731#S6.F5)presents the perceived usefulness results\. Participants consistently ranked state\-aware combinations such as the Clickable Dashboard, Finding Charts via Agent, and Click\+Chat above chat\-only interaction and direct SQL inspection\. During the study, participants rarely abandoned the dashboard entirely\. More often, they located the relevant view first and then used chat to clarify or summarize what they were already seeing\. Borda aggregation, together with inter\-participant agreement \(Kendall’sW=0\.62W=0\.62,p<0\.01p<0\.01\), shows that this ordering was fairly stable across participants\[emerson2013original,kendall1939problem\]\.

Chat use increased in the more complex scenarios, especially when participants needed help interpreting a filtered view rather than finding it\. In that sense, the benefit came less from replacing dashboard interaction and more from supporting follow\-up interpretation once the dashboard context was already in place\.

RQ3: Does the unified interaction log improve insight generation and reflective reasoning?

Insight Accuracy was 80% in S1, 100% in S2, and 80% in S3, so most participants were able to produce correct higher\-level interpretations in the guided scenarios\. Direct use of/insights, however, was limited: three of the five participants invoked it, and two of those three produced fully correct insights\. That usage pattern is too small to support a strong claim, but it does suggest that the feature can be useful when participants have already narrowed the analysis to a specific view\.

The usefulness ratings in Figure[5](https://arxiv.org/html/2606.13731#S6.F5)point in the same direction\. Participants responded more positively to log\-derived support when it appeared through higher\-level tools than when the raw interaction log was shown directly\. In this study, provenance seemed most helpful when it reduced interpretation work rather than when it became another object of inspection\.

Table 5:Representative functionality comparison of closely related systems\. Type: C = commercial, P/S = paper/system\. Y = explicitly supported; P = partially supported or unclear; – = not stated or not supported\.

## 7Related Works

Natural language interfaces to data \(NLIDB\), NL\-to\-SQL systems, and recent LLM agents lower the barrier to querying structured data\[zhang2024natural,liu2024survey,yao2022react,openai\_agents\]\. In BI settings, however, the central issue is not only query generation but also preserving semantic consistency across metrics, filter scope, aggregation grain, and iterative interaction with dashboards\[bi\_system\]\. Table[5](https://arxiv.org/html/2606.13731#S6.T5)therefore compares representative prior work along the capabilities most relevant to this setting: natural\-language querying, dashboard support, synchronization between chat and dashboard state, dashboard operations, schema support, and logging\.

Prior academic systems mainly address partial slices of this space, such as NL\-driven chart generation, chat\-oriented analytical assistance, or dashboard usability, rather than synchronized state management across conversational and dashboard interaction\[dibia2023lida,maddigan2023chat2vis,xie2024waitgpt,weng2025insightlens,dhanoa2025hey\]\.

Commercial BI copilots increasingly combine NLQ with dashboards and semantic layers\[powerbi\_copilot\_intro,amazonq\_quicksight\_ga\_2024,tableau\_einstein\_copilot\_release,looker\_conversational\_analytics\_ga\]\. However, public documentation still suggests only partial support for explicit synchronization, schema\-grounded interaction continuity, or comprehensive provenance logging\. TwinBI fills this gap by combining dashboard interaction, conversational querying, explicit synchronization, schema\-aware reasoning, and unified logging in one system\.

## 8Conclusion

We presentedTwinBI, an agentic digital\-twin framework that unifies conversational interaction and direct dashboard manipulation through a shared analytical state\. TwinBI combines semantic grounding, dashboard state reconstruction, and unified provenance tracking so that users and agents can operate over the same analytical context\. Across a controlled A/B benchmark and a usability study, our results suggest that this design improves analytical reliability, interaction stability, and user\-facing analytical support over dashboard interaction alone\.

Future work includes testing on larger datasets and more diverse users, improving chart grounding and value extraction for complex analyses, extending TwinBI to transfer analytical state across dashboards, and exploring how TwinBI can better support agentic decision\-making workflows\.

## References

TwinBI: An Agentic Digital Twin for Efficient Augmented Interactions with Business Intelligence Dashboards

Similar Articles

Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

Launch HN: BitBoard (YC P25) – Analytics Workspace for Agents

Benchmarking Biology’s AI Agent: ML@B's Collaboration with LatchBio

DecisionBox Enterprise

Experiments in Agentic AI for Science

Submit Feedback

Similar Articles

Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

Launch HN: BitBoard (YC P25) – Analytics Workspace for Agents

Benchmarking Biology’s AI Agent: ML@B's Collaboration with LatchBio

Experiments in Agentic AI for Science