Task-Routed Mixture-of-Experts with Cognitive Appraisal for Implicit Sentiment Analysis
Summary
This paper proposes a task-routed mixture-of-experts model with cognitive appraisal theory for implicit sentiment analysis, introducing auxiliary tasks to improve reasoning about sentiment from context and outperforming existing approaches.
View Cached Full Text
Cached at: 05/21/26, 06:36 AM
# Task-Routed Mixture-of-Experts with Cognitive Appraisal for Implicit Sentiment Analysis
Source: [https://arxiv.org/html/2605.20916](https://arxiv.org/html/2605.20916)
Yaping Chai, Haoran Xie, Joe S\. QinThis work was supported by the Research Impact Fund by the Research Grants Council of Hong Kong \(Project No\. 130272\); a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China \(R1015\-23\); the Faculty Research Grants \(SDS24A8, SDS25A15, and SDS24A19\), Interdisciplinary & Strategic Research Grant \(ISRG252606\), and the Direct Grants \(DR25E8 and DR26F2\) of Lingnan University, Hong Kong\.*\(Corresponding author: Haoran Xie\.\)*Yaping Chai, Haoran Xie, and Joe S\. Qin are with the Division of Artificial Intelligence, Lingnan University, Hong Kong \(e\-mail: yapingchai@ln\.hk; hrxie@ln\.edu\.hk; joeqin@ln\.edu\.hk\)\.
###### Abstract
Implicit sentiment analysis is challenging because sentiment toward an aspect is often inferred from events rather than expressed through explicit opinion words\. Existing models typically learn from the final polarity label, which provides limited guidance for reasoning about sentiment from the context\. Motivated by cognitive appraisal theory, we propose an appraisal\-aware multi\-task learning \(MTL\) framework for implicit sentiment analysis that provides polarity prediction with two complementary auxiliary tasks: implicit sentiment detection and cognitive rationale generation\. However, training several objectives with different targets and sharing a single backbone across tasks in MTL limits flexibility and can lead to task interference\. To reduce interference among these related but distinct objectives, we adopt task\-level mixture\-of\-experts models in which all tasks share a common set of experts, and task identity controls the sparse combination of these experts\. Our method builds on an encoder\-decoder architecture and replaces a subset of encoder and decoder blocks with these sparse mixtures\. We use a task\-conditioned router to select sparse expert mixtures for each task, and a task\-separated routing objective to encourage different tasks to learn distinct expert\-selection patterns\. Experimental results show that our model outperforms recently proposed approaches, with strong gains on the implicit sentiment subset\. Our code is available at[https://github\.com/yaping166/TRMoE\-ISA](https://github.com/yaping166/TRMoE-ISA)\.
## IIntroduction
Aspect\-based sentiment analysis \(ABSA\) aims to identify the sentiment polarity expressed toward a specific aspect in a review and has become a central research direction for understanding fine\-grained opinions\[[4](https://arxiv.org/html/2605.20916#bib.bib12)\]\. However, much of the existing work assumes that sentiment evidence is provided in the text\. Words such as “delicious”, “rude”, or “overpriced” often indicate the speaker’s evaluative intention\. In contrast, the sentence “We waited forty minutes before anyone came to the table” does not contain an obvious negative adjective\. However, it still expresses dissatisfaction with the service\. This scenario, known as implicit sentiment analysis \(ISA\), is challenging because polarity cannot be inferred solely on explicit sentiment words\[[17](https://arxiv.org/html/2605.20916#bib.bib1)\]\.
Existing approaches have made progress by improving contextual representations\[[29](https://arxiv.org/html/2605.20916#bib.bib18)\]or aligning implicit expressions with explicit sentiment words\[[20](https://arxiv.org/html/2605.20916#bib.bib20)\]\. Nevertheless, most learning objectives are still centered on a single polarity label\. This guidance tells the model what sentiment is expressed, but does not provide information about the reasoning behind the sentiment judgment\. In implicit cases, a reasoning chain is important for interpretation\[[11](https://arxiv.org/html/2605.20916#bib.bib22),[9](https://arxiv.org/html/2605.20916#bib.bib31)\]\. For example, a consumer who receives the main course much later than everyone else at the table may appraise the event as blocking the goal of timely service \(goal inconduciveness\) and violating expected service norms \(norm incompatibility\)\[[32](https://arxiv.org/html/2605.20916#bib.bib25)\]\. These appraisals explain how speakers evaluate events along core psychological dimensions, e\.g\., goals and expectations, before a negative attitude toward service emerges\. These intermediate cognitive steps explain why attitudes toward service are likely to be negative, yet single\-task polarity learning only provides the final polarity outcome \(negative\), rather than the intermediate steps that led to it, which limits the learning signal for implicit sentiment reasoning\.
According to cognitive appraisal theory, affective responses arise from the cognitive evaluation of psychological factors, including goals, expectations, agency, and consequences\[[15](https://arxiv.org/html/2605.20916#bib.bib3),[24](https://arxiv.org/html/2605.20916#bib.bib2)\]\. This perspective suggests that aspect\-level sentiment is not only a polarity decision but also the outcome of affective processes\. Motivated by this view, we formulate ISA as an appraisal\-aware multi\-task learning problem\. In addition to polarity classification, we introduce cognitive appraisal reasoning, in which the model generates a brief explanation of why the speaker holds an attitude toward the target aspect\. The rationale provides intermediate affective information that links the event in the review to the final polarity\. We also introduce implicit sentiment detection, where the model predicts whether the sentiment evidence is explicit or implicit\. This task guides the model to look beyond lexical sentiment words and infer the affective implication of events\.
These auxiliary tasks are complementary to polarity classification\. However, they require different targets: polarity classification focuses on the final sentiment label, implicit sentiment detection focuses on an explicit\-implicit evidence judgment, and rationale generation requires explanatory sentence generation\. In multi\-task learning, it is common to use a single backbone across tasks with different formats and objectives, which can lead to task interference and negative transfer when parameters are updated jointly\[[27](https://arxiv.org/html/2605.20916#bib.bib4),[8](https://arxiv.org/html/2605.20916#bib.bib30)\]\. Sharing affective knowledge across related objectives while limiting interference among them remains challenging\.
Task\-level mixture\-of\-experts \(MoE\) provides an efficient way to address this challenge\[[19](https://arxiv.org/html/2605.20916#bib.bib28),[1](https://arxiv.org/html/2605.20916#bib.bib29)\]\. Prior task\-level MoE studies\[[5](https://arxiv.org/html/2605.20916#bib.bib5),[31](https://arxiv.org/html/2605.20916#bib.bib6),[13](https://arxiv.org/html/2605.20916#bib.bib7)\]show that expert routing can help multi\-task models support both cooperation and specialization by assigning different tasks to different expert mixtures while still allowing shared experts when tasks need overlapping knowledge\. This motivates us to use a task\-routed MoE architecture, where the router is conditioned on task identity, allowing different tasks to use task\-specific expert mixtures rather than passing through the same network at every layer\.
Inspired by the above motivations, we propose a unified framework for implicit sentiment analysis\. It formulates each aspect\-level instance into three natural\-language tasks: polarity classification, implicit sentiment detection, and cognitive appraisal reasoning\. Our method builds on a sequence\-to\-sequence backbone\[[6](https://arxiv.org/html/2605.20916#bib.bib16)\]with task\-routed expert layers\. Each task has a learnable representation, and each routed layer uses this representation to select a sparse mixture of feed\-forward experts\. We further introduce a task\-separated routing objective\. This objective reduces the similarity between the gate distributions of different tasks, encouraging them to form separable routing patterns across expert layers\. The model therefore preserves shared pre\-trained language knowledge while creating task\-conditioned pathways for appraisal\-aware implicit sentiment inference\.
Our contributions are as follows:
- •We adopt task\-routed mixture\-of\-experts for appraisal\-aware implicit sentiment analysis and use task identity to route samples, enabling related tasks to share knowledge while keeping their expert selection patterns different\.
- •We propose a task\-separated routing objective that encourages different tasks to acquire separable expert\-selection patterns, reducing task interference\.
- •Empirical evaluation of the benchmarks shows that our method achieves a strong performance against recent methods, including implicit sentiment subsets, demonstrating the effectiveness of the proposed approach\.
## IIRelated Work
### II\-AImplicit Sentiment Analysis
Implicit sentiment analysis focuses on situations in which opinion words do not directly express sentiment toward an aspect\. Prior work addresses the scarcity of implicit sentiment expressions by aligning representations or using external knowledge and synthetic data\. For example,\[[17](https://arxiv.org/html/2605.20916#bib.bib1)\]introduces supervised contrastive pre\-training for implicit sentiment, using contrastive learning, review reconstruction, and masked aspect prediction to align representations of explicit and implicit sentiment expressions\.\[[28](https://arxiv.org/html/2605.20916#bib.bib17)\]constructs multi\-aspect samples with aspect and polarity augmentation channels, and uses contrastive learning with an entropy\-based filter to reduce noise from generated examples\. Recent text data augmentation methods enhance model performance by leveraging the generative capabilities of large language models\[[3](https://arxiv.org/html/2605.20916#bib.bib27)\]\. For example,\[[11](https://arxiv.org/html/2605.20916#bib.bib22)\]uses chain\-of\-thought prompting to induce the latent aspect, the underlying opinion, and the final polarity step by step\. Other studies utilize the internal logic and syntactic dependencies that connect aspects to their implicit sentiments\. For example,\[[29](https://arxiv.org/html/2605.20916#bib.bib18)\]proposes a relational graph attention network that uses an aspect\-oriented dependency tree to reshape syntactic structure around the target aspect, helping the model connect aspects with their relevant opinion expressions\.\[[30](https://arxiv.org/html/2605.20916#bib.bib19)\]studies the influence of confounding sentiment words and uses instrumental variables with stochastic perturbations to estimate a cleaner causal relation between a sentence and its sentiment\. These methods show that ISA benefits from richer context modeling and implicit\-explicit alignment\. However, most of them still optimize the model mainly toward the final polarity decision\. In contrast, our work treats implicitness detection and cognitive rationale generation as auxiliary tasks, so that auxiliary objectives complement the polarity classification with additional supervision dimensions beyond surface sentiment labels\.
### II\-BMulti\-task Learning for ABSA
Multi\-task learning enables knowledge sharing across related tasks, allowing the model to leverage complementary information from different tasks\[[2](https://arxiv.org/html/2605.20916#bib.bib24)\]\. In ABSA,\[[22](https://arxiv.org/html/2605.20916#bib.bib21)\]improves aspect\-target sentiment classification through domain\-specific language model fine\-tuning followed by task\-specific supervised training, showing that domain\-aware auxiliary training can reduce the mismatch between general pre\-training and target\-domain sentiment prediction\.\[[20](https://arxiv.org/html/2605.20916#bib.bib20)\]generates explicit sentiment augmentations for implicit cases and integrates them as additional clues for polarity prediction\.\[[14](https://arxiv.org/html/2605.20916#bib.bib23)\]studies multi\-task implicit sentiment analysis with large language models, constructing auxiliary sentiment\-element tasks and using automatic weight learning to handle data and task uncertainty\. However, training several objectives with different targets and sharing a single backbone across tasks in MTL limits flexibility and can lead to task interference\[[27](https://arxiv.org/html/2605.20916#bib.bib4)\]\. Task\-level mixture\-of\-experts mitigates this by maintaining a single set of experts and activating different experts for each task\. For example,\[[31](https://arxiv.org/html/2605.20916#bib.bib6)\]uses task representations to route tasks through different expert combinations and analyze the learned cross\-task skills\.\[[5](https://arxiv.org/html/2605.20916#bib.bib5)\]uses MoE layers and a mutual\-information objective to encourage sparse dependencies between tasks and experts, balancing cooperation and specialization in multi\-task learning\. Beyond task\-conditioned sparse experts, we propose a task\-separated routing objective that allows related tasks to share knowledge while enabling different tasks to select distinct experts, reducing task interference\.
## IIIMethod
Figure 1:Overview of our framework\. A\. Multi\-task Data Construction: each aspect\-level instance is formulated into three text\-to\-text tasks\. B\. Rationale Generation: Prompt template for the cognitive appraisal rationale generation task, along with an example of the model output\. C\. Task\-routed MoE\-FFN Block: layer normalization is applied before the experts; only the top\-k expert weights are kept; the outputs of those experts are then combined using the routing weights\.Our framework has three components\. First, we convert each aspect\-level instance into a unified text\-to\-text multi\-task problem containing polarity classification, implicit sentiment detection, and cognitive appraisal reasoning\. Second, we replace the feed\-forward modules in selected encoder and decoder layers with a set of task\-conditioned experts, allowing the model to share general linguistic knowledge while assigning different expert mixtures based on the task identifier\. Third, we use a task\-separated routing objective to encourage the learned routing patterns to be separable across tasks\. Fig\.[1](https://arxiv.org/html/2605.20916#S3.F1)shows the overview of our method\.
### III\-AProblem Formulation
Each annotated instance is a tuple\(x,a,y,e,r\)\(x,a,y,e,r\), wherex∈𝒳x\\in\\mathcal\{X\}is a review sentence;aais an aspect term inxx;y∈𝒴=\{positive,negative,neutral\}y\\in\\mathcal\{Y\}=\\\{\\text\{positive\},\\text\{negative\},\\text\{neutral\}\\\}is the gold polarity towardaa\. The dataset provides an implicitness indicatore∈\{0,1\}e\\in\\\{0,1\\\}, withe=1e\{=\}1for implicit sentiment ande=0e\{=\}0for explicit sentiment, together with a short rationalerrthat explains whyyyis assigned to\(x,a\)\(x,a\)\. Instead of only predicting the polarityyy, we formulate each aspect\-level instance as three text\-to\-text tasks:
𝒯=\{pol,imp,rea\}\\mathcal\{T\}=\\\{\\textsc\{pol\},\\textsc\{imp\},\\textsc\{rea\}\\\}covering polarity classification \(pol\), implicitness detection \(imp\), and cognitive appraisal reasoning \(rea\)\. Training on all three tasks yields supervision at different targets: the sentiment decision, whether the sentiment is expressed via explicit opinion cues or inferred from context, and a textual description of the underlying appraisal\.
### III\-BCognitive Appraisal Rationale Generation
In order to improve the reasoning capability of ISA language models, we use a large language model \(LLM\) to generate a cognitive appraisal rationale that explicitly links the sentence, the target term, and the sentiment polarity\. The LLM generates an appraisal\-based rationalerrfor cognitive evaluations, indicating the relationship between\(x,a\)\(x,a\)andyy, creating an augmented element set\(x,a,y,r\)\(x,a,y,r\)that can be used to train models to perform ISA with appraisal reasoning\. Fig\.[1](https://arxiv.org/html/2605.20916#S3.F1)\(B\) shows the prompt template and an example output\. These rationales enable the model to integrate appraisal\-based information for implicit sentiment analysis\.
### III\-CTask\-Routed Mixture\-of\-Experts
#### Expert Set
We adopt a pretrained encoder\-decoder language model\[[6](https://arxiv.org/html/2605.20916#bib.bib16)\]containingLLtransformer blocks\. Each block contains an attention sublayer and a feed\-forward sublayerFFN\\mathrm\{FFN\}\. Following the previous sparse MoE design\[[10](https://arxiv.org/html/2605.20916#bib.bib9)\], we only modify layers inℳ⊂\{1,…,L\}\\mathcal\{M\}\\subset\\\{1,\\dots,L\\\}\. For eachℓ∈ℳ\\ell\\in\\mathcal\{M\}, the denseFFNℓ\\mathrm\{FFN\}\_\{\\ell\}in the corresponding encoder and decoder blocks is replaced withNNparallel expertsEℓ\(1\),…,Eℓ\(N\)E\_\{\\ell\}^\{\(1\)\},\\dots,E\_\{\\ell\}^\{\(N\)\}, whereNNrepresents the number of experts in that layer\. EachEℓ\(i\)E\_\{\\ell\}^\{\(i\)\}keeps the same architecture as the originalFFNℓ\\mathrm\{FFN\}\_\{\\ell\}\.
#### Task\-conditioned Router
In contrast to the token\-level routers of large language MoEs\[[10](https://arxiv.org/html/2605.20916#bib.bib9),[25](https://arxiv.org/html/2605.20916#bib.bib8),[16](https://arxiv.org/html/2605.20916#bib.bib35)\], our router operates at the task level, where the task identityttdetermines the expert mixture\. The three tasks,pol,imp, andrea, are encoded as task IDs, and each task ID is associated with a task embedding𝝉t∈ℝdτ\\boldsymbol\{\\tau\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{\\tau\}\}\. For each routed layerℓ∈ℳ\\ell\\in\\mathcal\{M\}, we use a separate gating networkgℓg\_\{\\ell\}to map this task embedding toNNexpert logits:
gℓ\(𝝉t\)=MLPℓ\(LN\(𝝉t\)\)∈ℝNg\_\{\\ell\}\(\\boldsymbol\{\\tau\}\_\{t\}\)=\\mathrm\{MLP\}\_\{\\ell\}\\\!\\bigl\(\\mathrm\{LN\}\(\\boldsymbol\{\\tau\}\_\{t\}\)\\bigr\)\\in\\mathbb\{R\}^\{N\}\(1\)whereMLPℓ\\mathrm\{MLP\}\_\{\\ell\}is a two\-layer feed\-forward network with atanh\\tanhactivation\. The expert gate is then obtained by normalizing the logits:
𝝅ℓ,t=softmax\(gℓ\(𝝉t\)\)\\boldsymbol\{\\pi\}\_\{\\ell,t\}=\\mathrm\{softmax\}\\\!\\bigl\(g\_\{\\ell\}\(\\boldsymbol\{\\tau\}\_\{t\}\)\\bigr\)\(2\)Thus𝝅ℓ,t\\boldsymbol\{\\pi\}\_\{\\ell,t\}is the routing weight assigned to theNNexperts for taskttat layerℓ\\ell\.
#### Sparse Top\-kkSelection
We use the top\-kkgating in sparse MoE layers\. For taskttat layerℓ\\ell, we only keep thekkexperts with the largest probabilities in𝝅ℓ,t\\boldsymbol\{\\pi\}\_\{\\ell,t\}, and𝝅~ℓ,t\\tilde\{\\boldsymbol\{\\pi\}\}\_\{\\ell,t\}denote the resulting sparse gate after re\-normalizing the retained weights\. Let𝐡\\mathbf\{h\}represent the hidden state after the preceding sublayer,LN\(⋅\)\\mathrm\{LN\}\(\\cdot\)is layer normalisation, each\{Eℓ\(j\)\}j=1N\\\{E\_\{\\ell\}^\{\(j\)\}\\\}\_\{j=1\}^\{N\}is one expert feed\-forward network in this layer, andDrop\(⋅\)\\mathrm\{Drop\}\(\\cdot\)is dropout\. We computeLN\(𝐡\)\\mathrm\{LN\}\(\\mathbf\{h\}\), mix the expert outputs with𝝅~ℓ,t\\tilde\{\\boldsymbol\{\\pi\}\}\_\{\\ell,t\}, add the mixture to𝐡\\mathbf\{h\}, and applyDrop\(⋅\)\\mathrm\{Drop\}\(\\cdot\):
FFNℓMoE\(𝐡,t\)=𝐡\+Drop\(∑j=1Nπ~ℓ,t\(j\)Eℓ\(j\)\(LN\(𝐡\)\)\)\\mathrm\{FFN\}^\{\\mathrm\{MoE\}\}\_\{\\ell\}\(\\mathbf\{h\},t\)\\;=\\;\\mathbf\{h\}\\;\+\\;\\mathrm\{Drop\}\\\!\\left\(\\sum\_\{j=1\}^\{N\}\\tilde\{\\pi\}\_\{\\ell,t\}^\{\(j\)\}\\;E\_\{\\ell\}^\{\(j\)\}\\\!\\bigl\(\\mathrm\{LN\}\(\\mathbf\{h\}\)\\bigr\)\\right\)\(3\)
### III\-DLearning Objectives
The objectives below are defined on the encoder\-decoder forward pass ofpθp\_\{\\theta\}, whereθ\\thetadenotes all trainable parameters andpθp\_\{\\theta\}is the distribution over target tokens defined by the model with parametersθ\\theta\. At layersℓ∈ℳ\\ell\\in\\mathcal\{M\}, the dense feed\-forward sublayer is replaced byFFNℓMoE\\mathrm\{FFN\}^\{\\mathrm\{MoE\}\}\_\{\\ell\}\(Eq\. \([3](https://arxiv.org/html/2605.20916#S3.E3)\)\)\.
#### Multi\-task generation objective
All three tasks are trained in a unified generative manner\. For each taskt∈𝒯=\{pol,imp,rea\}t\\in\\mathcal\{T\}=\\\{\\textsc\{pol\},\\textsc\{imp\},\\textsc\{rea\}\\\}, given the review\-aspect pair under the corresponding task instruction, the model generates the task\-specific target sequenceyty\_\{t\}\. The multi\-task generation objective is the standard sequence\-to\-sequence negative log\-likelihood:
ℒgen\(θ\)=−∑t∈𝒯𝔼\(x,a,yt\)∈𝒟t∑s=1\|yt\|logpθ\(yt,s∣yt,<s,x,a,t\)\\mathcal\{L\}\_\{\\mathrm\{gen\}\}\(\\theta\)\\;=\\;\-\\sum\_\{t\\in\\mathcal\{T\}\}\\mathbb\{E\}\_\{\(x,a,y\_\{t\}\)\\in\\mathcal\{D\}\_\{t\}\}\\sum\_\{s=1\}^\{\|y\_\{t\}\|\}\\log p\_\{\\theta\}\\\!\\left\(y\_\{t,s\}\\mid y\_\{t,<s\},x,a,t\\right\)\(4\)whereyt,sy\_\{t,s\}is thess\-th token of the target sequence for tasktt,𝒟t\\mathcal\{D\}\_\{t\}is the training set for tasktt, andθ\\thetadenotes the trainable parameters\.
#### Task\-separated routing objective
To reduce task interference among related yet distinct objectives, we propose a routing objective that pushes the gate distributions of different tasks apart\. Each routed layerℓ∈ℳ\\ell\\in\\mathcal\{M\}produces one dense gate vector𝝅ℓ,t\\boldsymbol\{\\pi\}\_\{\\ell,t\}for each taskt∈𝒯t\\in\\mathcal\{T\}\. Let𝒫=\{\(t,t′\):t,t′∈𝒯,t≠t′\}\\mathcal\{P\}=\\\{\(t,t^\{\\prime\}\):t,t^\{\\prime\}\\in\\mathcal\{T\},t\\neq t^\{\\prime\}\\\}denote the ordered task pairs\. We minimize the average cosine similarity between their gates:
ℒsep\(θ\)=1\|ℳ\|∑ℓ∈ℳ1\|𝒫\|∑\(t,t′\)∈𝒫cos\(𝝅ℓ,t,𝝅ℓ,t′\)\\mathcal\{L\}\_\{\\mathrm\{sep\}\}\(\\theta\)\\;=\\;\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{\\ell\\in\\mathcal\{M\}\}\\,\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{\(t,t^\{\\prime\}\)\\in\\mathcal\{P\}\}\\cos\\\!\\left\(\\boldsymbol\{\\pi\}\_\{\\ell,t\},\\,\\boldsymbol\{\\pi\}\_\{\\ell,t^\{\\prime\}\}\\right\)\(5\)wherecos\(𝐮,𝐯\)=𝐮⊤𝐯/\(∥𝐮∥2∥𝐯∥2\)\\cos\(\\mathbf\{u\},\\mathbf\{v\}\)=\\mathbf\{u\}^\{\\top\}\\mathbf\{v\}/\(\\lVert\\mathbf\{u\}\\rVert\_\{2\}\\lVert\\mathbf\{v\}\\rVert\_\{2\}\)is the cosine similarity\. The dense gate𝝅ℓ,t\\boldsymbol\{\\pi\}\_\{\\ell,t\}is the softmax\-normalised output of the layer\-specific router in Eq\. \([2](https://arxiv.org/html/2605.20916#S3.E2)\), with logitsgℓ\(𝝉t\)g\_\{\\ell\}\(\\boldsymbol\{\\tau\}\_\{t\}\)given in Eq\. \([1](https://arxiv.org/html/2605.20916#S3.E1)\)\.
#### Overall objective
The final training objective combines generation loss and task\-separated routing loss:
ℒ\(θ\)=ℒgen\(θ\)\+λsepℒsep\(θ\),\\mathcal\{L\}\(\\theta\)\\;=\\;\\mathcal\{L\}\_\{\\mathrm\{gen\}\}\(\\theta\)\\;\+\\;\\lambda\_\{\\mathrm\{sep\}\}\\,\\mathcal\{L\}\_\{\\mathrm\{sep\}\}\(\\theta\),\(6\)whereλsep≥0\\lambda\_\{\\mathrm\{sep\}\}\\geq 0controls the strength of the impact of routing separation\. Whenλsep=0\\lambda\_\{\\mathrm\{sep\}\}=0, the router is trained only through the generation objective\.
## IVExperiment
### IV\-ADatasets and Metrics
We conduct experiments on the SemEval\-2014 Restaurant and Laptop datasets\[[21](https://arxiv.org/html/2605.20916#bib.bib13)\]\. Following\[[17](https://arxiv.org/html/2605.20916#bib.bib1)\], we further divide each benchmark into explicit and implicit sentiment subsets according to whether the sentiment polarity toward opinion words directly expresses the target aspect\. We reserve 10% of the original training set as the validation set and train the model on the remaining training instances\. We evaluate performance with respect to both overall and implicit sentiments\. Specifically, we report accuracy and macro\-F1 score on the full test set, and accuracy on the implicit sentiment polarity subset\.
### IV\-BBaselines
We compare our method with three groups of baselines\.
- •Conventional ABSA baselines:these methods typically treat aspect\-based sentiment analysis as a supervised classification task\. The models we use for comparison include BERT\+SPC\[[7](https://arxiv.org/html/2605.20916#bib.bib14)\], BERT\+ADA\[[22](https://arxiv.org/html/2605.20916#bib.bib21)\], RGAT\[[29](https://arxiv.org/html/2605.20916#bib.bib18)\], BERTAsp\+CEPT\[[17](https://arxiv.org/html/2605.20916#bib.bib1)\], BERTAsp\+SCAPT\[[17](https://arxiv.org/html/2605.20916#bib.bib1)\], C3DA\[[28](https://arxiv.org/html/2605.20916#bib.bib17)\], and ISAIV\[[30](https://arxiv.org/html/2605.20916#bib.bib19)\]\.
- •Inference\-only baselines directly:these methods use the zero\-shot learning ability of the LLM, which prompts the LLM to classify each test instance without task\-specific training on the ABSA datasets\. The baseline models include GPT\-5\.4\-mini\[[26](https://arxiv.org/html/2605.20916#bib.bib32)\], DeepSeekV3\.2\[[18](https://arxiv.org/html/2605.20916#bib.bib33)\]and Llama\-3\.3\-70B\-Instruct\[[12](https://arxiv.org/html/2605.20916#bib.bib34)\]\.
- •Instruction\-based fine\-tuning baselines:these methods formulate the task as a generation problem\. In the ABSA task, provide instruction prompts to the model and fine\-tune it to generate the target sentiment output\. For comparison, we use ABSA\-ESA\[[20](https://arxiv.org/html/2605.20916#bib.bib20)\], Flan\-T5\[[6](https://arxiv.org/html/2605.20916#bib.bib16)\], InstructABSA\[[23](https://arxiv.org/html/2605.20916#bib.bib36)\], THOR\-prompt\[[11](https://arxiv.org/html/2605.20916#bib.bib22)\], THOR\[[11](https://arxiv.org/html/2605.20916#bib.bib22)\], and MT\-ISA\[[14](https://arxiv.org/html/2605.20916#bib.bib23)\]as baselines\.
### IV\-CModels and Hyperparameters
We use Flan\-T5\-large as the backbone model and use GPT\-5\.4\-mini for cognitive appraisal rationales generation\. We replace selected encoder and decoder FFN layers with task\-routed MoE layers in layers 8, 10, 12, 14, 16, 18, 20, and 22\. Each layer contains 5 experts and activates the top 2 experts for each task\. The routing separation weightλsep\\lambda\_\{\\mathrm\{sep\}\}is set to 0\.4 because the validation set achieves optimal performance at this value, as discussed in section[IV\-F](https://arxiv.org/html/2605.20916#S4.SS6.SSS0.Px1)\. We train with a batch size of 4 and accumulate gradients in 4 steps, and the learning rate is3×10−53\\times 10^\{\-5\}\. All reported results are the average of three runs with different random seeds\.
### IV\-DMain Results
TABLE I:Benchmark evaluation of the proposed method against representative baselines on Rest14 and Lap14\. The best score in each column is shown in bold\.†indicates results are referred from\[[14](https://arxiv.org/html/2605.20916#bib.bib23)\]\.ModelRest14Lap14AllAAllFISAAAllAAllFISAAConventional ABSA BaselinesBERT\+SPC†\[[7](https://arxiv.org/html/2605.20916#bib.bib14)\]BERT\-base \(110M\)83\.5777\.1665\.5478\.2273\.4569\.54BERT\+ADA†\[[22](https://arxiv.org/html/2605.20916#bib.bib21)\]BERT\-base \(110M\)87\.1480\.0565\.9278\.9674\.1870\.11RGAT†\[[29](https://arxiv.org/html/2605.20916#bib.bib18)\]BERT\-base \(110M\)86\.681\.3567\.7978\.2174\.0772\.99BERTAsp\+CEPT†\[[17](https://arxiv.org/html/2605.20916#bib.bib1)\]BERT\-base \(110M\)87\.582\.0767\.7981\.6678\.3875\.86BERTAsp\+SCAPT†\[[17](https://arxiv.org/html/2605.20916#bib.bib1)\]BERT\-base \(110M\)89\.1183\.7972\.2882\.7679\.1577\.59C3DA†\[[28](https://arxiv.org/html/2605.20916#bib.bib17)\]BERT\-base \(110M\)86\.9381\.2365\.5480\.6177\.1173\.57ISAIV†\[[30](https://arxiv.org/html/2605.20916#bib.bib19)\]BERT\-base \(110M\)87\.0581\.4\-80\.4177\.25\-ABSA\-ESA†\[[20](https://arxiv.org/html/2605.20916#bib.bib20)\]T5\-base \(220M\)88\.2981\.7470\.7882\.4479\.3480Inference\-only BaselinesGPT\-5\.4\-mini88\.8481\.9271\.7682\.3978\.2873\.29DeepSeek\-V3\.287\.1476\.1465\.5479\.6272\.7860\.57Llama\-3\.3\-70b\-instruct86\.3474\.9163\.380\.5673\.4559\.43Instruction\-based Fine\-tuningFlan\-T5\[[6](https://arxiv.org/html/2605.20916#bib.bib16)\]Flan\-T5\-base \(250M\)86\.4377\.4563\.380\.4175\.271\.43InstructABSA\[[23](https://arxiv.org/html/2605.20916#bib.bib36)\]Flan\-T5\-base \(250M\)85\.6275\.5262\.1781\.8277\.7276\.57THOR\-prompt\[[11](https://arxiv.org/html/2605.20916#bib.bib22)\]Flan\-T5\-base \(250M\)86\.779\.4965\.9281\.1977\.4274\.28THOR\[[11](https://arxiv.org/html/2605.20916#bib.bib22)\]Flan\-T5\-base \(250M\)87\.0580\.0966\.6781\.6677\.5174\.29MT\-ISA†\[[14](https://arxiv.org/html/2605.20916#bib.bib23)\]Flan\-T5\-base \(250M\)88\.2182\.4570\.4182\.9179\.8680\.57OursFlan\-T5\-base \(250M\)88\.2380\.3669\.1681\.0975\.8872\.43Flan\-T5\[[6](https://arxiv.org/html/2605.20916#bib.bib16)\]Flan\-T5\-large \(780M\)88\.758171\.9184\.0179\.5878\.29InstructABSA\[[23](https://arxiv.org/html/2605.20916#bib.bib36)\]Flan\-T5\-large \(780M\)87\.8680\.0868\.5483\.3979\.2778\.86THOR\-prompt\[[11](https://arxiv.org/html/2605.20916#bib.bib22)\]Flan\-T5\-large \(780M\)88\.1382\.0372\.2983\.5480\.380THOR\[[11](https://arxiv.org/html/2605.20916#bib.bib22)\]Flan\-T5\-large \(780M\)88\.2180\.8570\.4183\.781\.0180\.57OursFlan\-T5\-large \(780M\)90\.2184\.7475\.98581\.6382\.67#### Analysis of conventional ABSA baselines
As shown in Table[I](https://arxiv.org/html/2605.20916#S4.T1), conventional ABSA baselines are not competitive in overall performance, especially in the implicit sentiment subset\. These methods mainly train the model to predict the polarity label, relying on explicit opinion words when present\. However, they do not guide the model to detect implicit evidence or explain why an event supports a sentiment\. Instruction\-based fine\-tuning improves conventional ABSA baselines across both domains, with larger performance gains on the implicit subset, which is a particular challenge for ISA\.
#### Analysis of inference\-only baselines
The inference\-only baselines show that LLMs can leverage their pretrained knowledge to interpret sentiment reviews without task\-specific training\. However, zero\-shot reasoning remains limited, and under the same prompting settings, stronger LLMs tend to score higher\. For example, the GPT baseline ranks above the Llama baseline across all evaluation metrics and datasets\. Despite the LLM having strong general reasoning ability, instruction\-based fine\-tuning, including our method, also improves over inference\-only LLMs\. This improvement suggests that aspect\-specific supervision aligns the model with the benchmark and trainable inference skills\.
#### Analysis of instruction\-based fine\-tuning baselines
In the instruction\-based fine\-tuning setup, smaller instruction\-tuned models have benefited from the generative formulation\. Still, they lack sufficient capacity to leverage the auxiliary supervision introduced in our framework\. We observe that increasing the backbone to Flan\-T5\-large further improves this group’s performance\. With the Flan\-T5\-large backbone, our method achieves the best results, indicating that the proposed multi\-task learning framework becomes more effective when the backbone has sufficient representation capacity\. Compared with prior instruction\-tuned methods, our method improves over this group because polarity prediction is trained together with implicitness detection and cognitive appraisal reasoning, allowing the model to learn not only the lexical polarity cues but also the latent evaluation process that supports the sentiment\. The routing mechanism then allows these objectives to share general language knowledge while preserving task\-specific expert pathways\.
### IV\-EAblation Study
We perform ablation experiments to verify the impact of multi\-task learning and the MoE routing architecture\. As shown in Table[II](https://arxiv.org/html/2605.20916#S4.T2), Ours \(w/o MTL\) keeps the MoE architecture and trains only the polarity prediction task, and Ours \(w/o MoE\) removes the MoE layers and performs multi\-task learning with the standard Flan\-T5\-large architecture\. Removing multi\-task learning \(w/o MTL\) degrades performance on both datasets\. The larger decrease on ISAAindicates that implicit sentiment prediction benefits from auxiliary supervision of implicitness detection and cognitive appraisal rationale generation\. Without these auxiliary tasks, the model mainly learns the direct mapping from the review and aspect to the polarity label, which weakens its ability to capture latent sentiment evidence\. Removing the MoE architecture \(w/o MoE\) also leads to clear performance degradation, especially on Rest14, where ISAAdecreases from 75\.9% to 71\.16%\. This performance drop shows that simply applying multi\-task learning with a shared Flan\-T5\-large backbone is insufficient to fully exploit different supervision signals\. The task\-routed MoE layers enable different tasks to share general linguistic knowledge while maintaining task\-specific expert pathways, thereby improving both overall sentiment classification and implicit sentiment inference\.
TABLE II:Ablation results on Rest14 and Lap14, isolating the effects of multi\-task learning and MoE architecture\. The best score in each column is shown in bold\.
### IV\-FFurther Analysis
#### Effect ofλsep\\lambda\_\{\\mathrm\{sep\}\}
Figure 2:Effect of the hyperparameterλsep\\lambda\_\{\\mathrm\{sep\}\}on two benchmarks, where performance is measured by F1 score\. Both datasets achieve their best performance atλsep=0\.4\\lambda\_\{\\mathrm\{sep\}\}=0\.4, and allλ\\lambdavalues remain above the THOR baseline\. We mark the highest value with a⋆\\star\.We study how the hyperparameterλsep\\lambda\_\{\\mathrm\{sep\}\}affects experimental outcomes\. As shown in Fig\.[2](https://arxiv.org/html/2605.20916#S4.F2), our method outperforms THOR\[[11](https://arxiv.org/html/2605.20916#bib.bib22)\]in allλ\\lambdasettings, demonstrating the effectiveness of our method\. Both benchmarks achieve their best performance at an intermediate value ofλsep=0\.4\\lambda\_\{\\mathrm\{sep\}\}=0\.4, indicating that a balanced routing separation weight improves model performance\. Ifλsep\\lambda\_\{\\mathrm\{sep\}\}is too small, tasks still overshare experts, and weaken task\-specific signals; if it is too large, routing becomes overly constrained, weakening useful cross\-task transfer\.
#### Routing entropy across tasks
To study whether different tasks use MoE routing in different ways, we analyze routing entropy for implicitness detection \(imp\), polarity classification \(pol\), and cognitive appraisal rationale generation \(rea\), and compare these entropies under different routing separation weightsλ\\lambda\. Routing entropy reflects how the model activates experts for each task: higher values indicate broader mixing across experts, while lower values indicate more concentrated, sparse expert usage\.
Figure 3:Routing entropyH\(⋅\)H\(\\cdot\)for the three tasks on Rest14 and Lap14 under differentλ\\lambda\. Each point is the mean over routed MoE layers\.As shown in Fig\.[3](https://arxiv.org/html/2605.20916#S4.F3), asλ\\lambdaincreases, thereacurve decreases, indicating that a larger routing separation weight leads to routing on fewer experts, thereby reducing task entropy\. Additionally, across allλ\\lambdasettings and in both domains, we observe a consistent phenomenon:H\(imp\)≈H\(pol\)≪H\(rea\)H\(\\textsc\{imp\}\)\\approx H\(\\textsc\{pol\}\)\\ll H\(\\textsc\{rea\}\), whereH\(⋅\)H\(\\cdot\)denotes routing entropy\. The entropy pattern indicates that easier targets, such as classification and detection, can be achieved with a small subset of experts, whereas rationale generation demands richer expert mixing\.
Figure 4:An example of the routing probabilities heatmap at the MoE decoder’s 18th layer, where each row represents one of three tasks, each column represents an expert, and each cell represents the probability that the task assigns to that expert at this layer\.
#### Specialization of expert assignments
We measure expert specialization by selecting the expert with the highest probability \(Top\-1\) among the five experts in each routed layer, reflecting the dominance of the most active expert in each context\. As shown in Fig\.[4](https://arxiv.org/html/2605.20916#S4.F4), for simpler detection and classification tasks, theTop\-1values across all MoE layers are notably high, indicating that a single expert dominates the activation\. For the complex sequence generation taskrea, theTop\-1is lower and can decrease further in certain layers, reflecting more mixed expert usage\. This pattern aligns with thereaobjective, which involves generating rationales that benefit from combining multiple experts rather than relying primarily on a single expert\. The same phenomenon also occurs in the routing entropy discussed in section[IV\-F](https://arxiv.org/html/2605.20916#S4.SS6.SSS0.Px2), where the entropy ofimpandpolis lower, while the entropy ofreais higher\.
TABLE III:Effect of expert number under the same experimental settings and top\-2 expert selection\. Best results are highlighted in bold\.NNrefers to the number of experts per routed layer\.
#### Effect of expert number
We vary the number of experts per routed layerN∈\{4,5,6,7\}N\\in\\\{4,5,6,7\\\}while keeping top\-2 expert selection to study how the number of experts affects performance\. Table[III](https://arxiv.org/html/2605.20916#S4.T3)shows that the overall performance improves fromN=4N=4toN=5N=5and then decreases asNNincreases to66and77\.N=5N=5achieves the best average results\. A likely reason is that with too few experts, different tasks tend to rely on overlapping expert subsets, which is less effective for the rationale task that benefits from broader expert mixing; with too many experts, enlarging the expert pool does not yield further gains and can instead reduce the learning signal received by each expert\. We therefore useN=5N=5in the main experiments\.
## VConclusion
In implicit aspect\-level sentiment, sentiment toward an aspect is often conveyed through events rather than explicit opinion words\. Polarity supervision alone is not enough to teach a model how an event implies an attitude toward an aspect\. We introduce cognitive appraisal reasoning and implicit sentiment detection as auxiliary tasks, so that polarity prediction is trained together with signals about why the sentiment holds and whether the evidence is explicit or implicit\. We then address the sharing problem introduced by the multi\-task formulation with task\-routed mixture\-of\-experts layers\. Instead of forcing all objectives to pass through the same feed\-forward networks, the model learns task\-conditioned expert mixtures\. The task\-separated routing objective further encourages the routing distributions of different tasks to remain distinguishable\. Performance gains in the benchmarks, including improvements in the implicit subset, demonstrate the effectiveness of the proposed method\.
## References
- \[1\]\(2025\)A survey on mixture of experts in large language models\.IEEE Transactions on Knowledge and Data Engineering\.Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p5.1)\.
- \[2\]R\. Caruana\(1997\)Multitask learning\.Machine learning28\(1\),pp\. 41–75\.Cited by:[§II\-B](https://arxiv.org/html/2605.20916#S2.SS2.p1.1)\.
- \[3\]Y\. Chai, H\. Xie, and S\. J\. Qin\(2026\)Text data augmentation for large language models: a comprehensive survey of methods, challenges, and opportunities\.Artif\. Intell\. Rev\.59\(1\),pp\. 35\.Cited by:[§II\-A](https://arxiv.org/html/2605.20916#S2.SS1.p1.1)\.
- \[4\]X\. Chen, H\. Xie, S\. J\. Qin, Y\. Chai, X\. Tao, and F\. L\. Wang\(2024\)Cognitive\-inspired deep learning models for aspect\-based sentiment analysis: a retrospective overview and bibliometric analysis\.Cogn\. Comput\.16\(6\),pp\. 3518–3556\.Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p1.1)\.
- \[5\]Z\. Chen, Y\. Shen, M\. Ding, Z\. Chen, H\. Zhao, E\. G\. Learned\-Miller, and C\. Gan\(2023\)Mod\-Squad: designing mixtures of experts as modular multi\-task learners\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 11828–11837\.Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p5.1),[§II\-B](https://arxiv.org/html/2605.20916#S2.SS2.p1.1)\.
- \[6\]H\. W\. Chung, L\. Hou, S\. Longpre, B\. Zoph, Y\. Tay, W\. Fedus, Y\. Li, X\. Wang, M\. Dehghani, S\. Brahma,et al\.\(2024\)Scaling instruction\-finetuned language models\.J\. Mach\. Learn\. Res\.25\(70\),pp\. 1–53\.Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p6.1),[§III\-C](https://arxiv.org/html/2605.20916#S3.SS3.SSS0.Px1.p1.10),[3rd item](https://arxiv.org/html/2605.20916#S4.I1.i3.p1.1),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.19.25.8.1),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.19.30.13.1)\.
- \[7\]J\. Devlin, M\.\-W\. Chang, K\. Lee, and K\. Toutanova\(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProc\. Conf\. North Amer\. Chapter Assoc\. Comput\. Linguistics: Human Lang\. Technol\.,pp\. 4171–4186\.Cited by:[1st item](https://arxiv.org/html/2605.20916#S4.I1.i1.p1.2),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.9.7.1)\.
- \[8\]C\. Ding, Z\. Lu, S\. Wang, R\. Cheng, and V\. N\. Boddeti\(2023\)Mitigating task interference in multi\-task learning via explicit task routing with non\-learnable primitives\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 7756–7765\.Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p4.1)\.
- \[9\]Z\. Duan and J\. Wang\(2024\)Implicit sentiment analysis based on chain of thought prompting\.Note:arXiv:2211\.10986Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p2.1)\.
- \[10\]W\. Fedus, B\. Zoph, and N\. Shazeer\(2022\)Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.Cited by:[§III\-C](https://arxiv.org/html/2605.20916#S3.SS3.SSS0.Px1.p1.10),[§III\-C](https://arxiv.org/html/2605.20916#S3.SS3.SSS0.Px2.p1.8)\.
- \[11\]H\. Fei, B\. Li, Q\. Liu, L\. Bing, F\. Li, and T\. Chua\(2023\)Reasoning implicit sentiment with chain\-of\-thought prompting\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 1171–1182\.Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.20916#S2.SS1.p1.1),[3rd item](https://arxiv.org/html/2605.20916#S4.I1.i3.p1.1),[§IV\-F](https://arxiv.org/html/2605.20916#S4.SS6.SSS0.Px1.p1.4),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.19.27.10.1),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.19.28.11.1),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.19.32.15.1),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.19.33.16.1)\.
- \[12\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.Note:arXiv:2407\.21783Cited by:[2nd item](https://arxiv.org/html/2605.20916#S4.I1.i2.p1.1)\.
- \[13\]C\. He, F\. Gao, H\. Liu, S\. Zhu, Y\. Jia, H\. Zan, and M\. Peng\(2025\)Task\-aware contrastive mixture of experts for quadruple extraction in conversations with code\-like replies and non\-opinion detection\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\),Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p5.1)\.
- \[14\]W\. Lai, H\. Xie, G\. Xu, and Q\. Li\(2025\)Multi\-task learning with llms for implicit sentiment analysis: data\-level and task\-level automatic weight learning\.IEEE Transactions on Knowledge and Data Engineering\.Cited by:[§II\-B](https://arxiv.org/html/2605.20916#S2.SS2.p1.1),[3rd item](https://arxiv.org/html/2605.20916#S4.I1.i3.p1.1),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.19.17.1)\.
- \[15\]R\. S\. Lazarus\(1991\)Emotion and adaptation\.Oxford University Press,New York\.Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p3.1)\.
- \[16\]Y\. Li, S\. Jiang, B\. Hu, L\. Wang, W\. Zhong, W\. Luo, L\. Ma, and M\. Zhang\(2025\)Uni\-moe: scaling unified multimodal llms with mixture of experts\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[§III\-C](https://arxiv.org/html/2605.20916#S3.SS3.SSS0.Px2.p1.8)\.
- \[17\]Z\. Li, Y\. Zou, C\. Zhang, Q\. Zhang, and Z\. Wei\(2021\)Learning implicit sentiment in aspect\-based sentiment analysis with supervised contrastive pre\-training\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,Online and Punta Cana, Dominican Republic,pp\. 246–256\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.22)Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p1.1),[§II\-A](https://arxiv.org/html/2605.20916#S2.SS1.p1.1),[1st item](https://arxiv.org/html/2605.20916#S4.I1.i1.p1.2),[§IV\-A](https://arxiv.org/html/2605.20916#S4.SS1.p1.1),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.13.11.2),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.15.13.2)\.
- \[18\]A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong,et al\.\(2025\)Deepseek\-v3\. 2: pushing the frontier of open large language models\.Note:arXiv:2512\.02556Cited by:[2nd item](https://arxiv.org/html/2605.20916#S4.I1.i2.p1.1)\.
- \[19\]S\. Mu and S\. Lin\(2025\)A comprehensive survey of mixture\-of\-experts: algorithms, theory, and applications\.Note:arXiv:2503\.07137Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p5.1)\.
- \[20\]J\. Ouyang, Z\. Yang, S\. Liang, B\. Wang, Y\. Wang, and X\. Li\(2024\)Aspect\-based sentiment analysis with explicit sentiment augmentations\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 18842–18850\.Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p2.1),[§II\-B](https://arxiv.org/html/2605.20916#S2.SS2.p1.1),[3rd item](https://arxiv.org/html/2605.20916#S4.I1.i3.p1.1),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.18.16.1)\.
- \[21\]M\. Pontiki, D\. Galanis, J\. Pavlopoulos, H\. Papageorgiou, I\. Androutsopoulos, and S\. Manandhar\(2014\)SemEval\-2014 task 4: aspect based sentiment analysis\.InProceedings of the 8th International Workshop on Semantic Evaluation, SemEval@COLING 2014, Dublin, Ireland, August 23\-24, 2014,pp\. 27–35\.Cited by:[§IV\-A](https://arxiv.org/html/2605.20916#S4.SS1.p1.1)\.
- \[22\]A\. Rietzler, S\. Stabinger, P\. Opitz, and S\. Engl\(2020\)Adapt or get left behind: domain adaptation through bert language model finetuning for aspect\-target sentiment classification\.InProceedings of the twelfth language resources and evaluation conference,pp\. 4933–4941\.Cited by:[§II\-B](https://arxiv.org/html/2605.20916#S2.SS2.p1.1),[1st item](https://arxiv.org/html/2605.20916#S4.I1.i1.p1.2),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.10.8.1)\.
- \[23\]K\. Scaria, H\. Gupta, S\. Goyal, S\. Sawant, S\. Mishra, and C\. Baral\(2024\)Instructabsa: instruction learning for aspect based sentiment analysis\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 2: Short Papers\),pp\. 720–736\.Cited by:[3rd item](https://arxiv.org/html/2605.20916#S4.I1.i3.p1.1),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.19.26.9.1),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.19.31.14.1)\.
- \[24\]K\. R\. Scherer\(2001\)Appraisal considered as a process of multilevel sequential checking\.InAppraisal Processes in Emotion: Theory, Methods, Research,K\. R\. Scherer, A\. Schorr, and T\. Johnstone \(Eds\.\),pp\. 92–120\.Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p3.1)\.
- \[25\]N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. V\. Le, G\. E\. Hinton, and J\. Dean\(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.InProceedings of the 5th International Conference on Learning Representations \(ICLR\),Cited by:[§III\-C](https://arxiv.org/html/2605.20916#S3.SS3.SSS0.Px2.p1.8)\.
- \[26\]A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram,et al\.\(2025\)Openai gpt\-5 system card\.Note:arXiv:2601\.03267Cited by:[2nd item](https://arxiv.org/html/2605.20916#S4.I1.i2.p1.1)\.
- \[27\]T\. Standley, A\. R\. Zamir, D\. Chen, L\. Guibas, J\. Malik, and S\. Savarese\(2020\)Which tasks should be learned together in multi\-task learning?\.InProceedings of the 37th International Conference on Machine Learning \(ICML\),Proceedings of Machine Learning Research, Vol\.119,pp\. 9120–9132\.Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p4.1),[§II\-B](https://arxiv.org/html/2605.20916#S2.SS2.p1.1)\.
- \[28\]B\. Wang, L\. Ding, Q\. Zhong, X\. Li, and D\. Tao\(2022\)A contrastive cross\-channel data augmentation framework for aspect\-based sentiment analysis\.InProceedings of the 29th international conference on computational linguistics,pp\. 6691–6704\.Cited by:[§II\-A](https://arxiv.org/html/2605.20916#S2.SS1.p1.1),[1st item](https://arxiv.org/html/2605.20916#S4.I1.i1.p1.2),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.16.14.1)\.
- \[29\]K\. Wang, W\. Shen, Y\. Yang, X\. Quan, and R\. Wang\(2020\)Relational graph attention network for aspect\-based sentiment analysis\.InProceedings of the 58th annual meeting of the association for computational linguistics,pp\. 3229–3238\.Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.20916#S2.SS1.p1.1),[1st item](https://arxiv.org/html/2605.20916#S4.I1.i1.p1.2),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.11.9.1)\.
- \[30\]S\. Wang, J\. Zhou, C\. Sun, J\. Ye, T\. Gui, Q\. Zhang, and X\. Huang\(2022\)Causal intervention improves implicit sentiment analysis\.InProceedings of the 29th international conference on computational linguistics,pp\. 6966–6977\.Cited by:[§II\-A](https://arxiv.org/html/2605.20916#S2.SS1.p1.1),[1st item](https://arxiv.org/html/2605.20916#S4.I1.i1.p1.2),[TABLE I](https://arxiv.org/html/2605.20916#S4.T1.17.15.1)\.
- \[31\]Q\. Ye, J\. Zha, and X\. Ren\(2022\)Eliciting and understanding cross\-task skills with task\-level mixture\-of\-experts\.InFindings of the Association for Computational Linguistics: EMNLP 2022,Abu Dhabi, United Arab Emirates,pp\. 2567–2592\.Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p5.1),[§II\-B](https://arxiv.org/html/2605.20916#S2.SS2.p1.1)\.
- \[32\]G\. Yeo, S\. Furniturewala, and K\. Jaidka\(2024\)Beyond text: leveraging multi\-task learning and cognitive appraisal theory for post\-purchase intention analysis\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 12353–12360\.Cited by:[§I](https://arxiv.org/html/2605.20916#S1.p2.1)\.
![[Uncaptioned image]](https://arxiv.org/html/2605.20916v1/pic/Yaping.jpg)Yaping Chaiis a Ph\.D\. candidate in the School of Data Science at Lingnan University, Hong Kong, supervised by Prof\. Joe S\. Qin and Prof\. Haoran Xie\. Her research interests center on large language models, natural language processing, and aspect\-based sentiment analysis\. She investigates advanced techniques for fine\-tuning and assessing language models for sentiment analysis and other related NLP tasks\.![[Uncaptioned image]](https://arxiv.org/html/2605.20916v1/pic/xie.jpg)Haoran Xie\(Senior Member, IEEE\) received a Ph\.D\. degree in Computer Science from City University of Hong Kong and an Ed\.D degree in Digital Learning from the University of Bristol\. He is currently a Professor and the Person\-in\-Charge at the Division of Artificial Intelligence, Director of LEO Dr David P\. Chan Institute of Data Science, and Associate Dean of the School of Data Science, Lingnan University, Hong Kong\. His research interests include natural language processing, large language models, language learning, and AI in education\. He has published 500 research publications, including 300 journal articles\. He is the Editor\-in\-Chief of Natural Language Processing Journal, Computers & Education: Artificial Intelligence, and Computers & Education: X Reality\. He has been selected as the World’s Top 2% Scientists by Stanford University\.![[Uncaptioned image]](https://arxiv.org/html/2605.20916v1/pic/joeqin.jpg)S\. Joe Qin\(Fellow, IEEE\) received the B\.S\. and M\.S\. degrees in automatic control from Tsinghua University, Beijing, China, in 1984 and 1987, respectively, and the Ph\.D\. degree in chemical engineering from the University of Maryland, College Park, MD, USA, in 1992\. He is currently the Wai Kee Kau Chair Professor and President of Lingnan University, Hong Kong\. His research interests include data science and analytics, machine learning, process monitoring, model predictive control, system identification, smart manufacturing, smart cities, and predictive maintenance\. Prof\. Qin is a Fellow of the U\.S\. National Academy of Inventors, IFAC, and AIChE\. He was the recipient of the 2022 CAST Computing Award by AIChE, 2022 IEEE CSS Transition to Practice Award, U\.S\. NSF CAREER Award, and NSF\-China Outstanding Young Investigator Award\. His h\-indices for Web of Science, SCOPUS, and Google Scholar are 66, 73, and 89, respectively\.Similar Articles
CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning
Introduces CAREBench, a benchmark grounded in appraisal theory to evaluate LLMs' emotion understanding through cognitive appraisal reasoning, revealing that current models struggle with reasoning and positive emotion recognition despite matching humans on some downstream tasks.
Expert-Aware Refusal Steering
This paper extends refusal steering (activation-based jailbreaking) to Mixture-of-Experts LLMs, finding that MoE routing patterns do not inhibit steering, and proposes expert-aware methods that can suppress refusal behavior based on a single expert's output.
Scaling Continual Learning to 300+ Tasks with Bi-Level Routing Mixture-of-Experts
This paper introduces CaRE, a novel continual learning framework using a bi-level routing mixture-of-experts mechanism to effectively handle class-incremental learning over sequences of 300+ tasks.
JetBrains/Mellum2-12B-A2.5B-Thinking
JetBrains releases Mellum2-12B-A2.5B-Thinking, an open-source Mixture-of-Experts reasoning model with 131k context length, trained with RLVR for explicit chain-of-thought reasoning.
Emergent Modularity in Mixture-of-Experts Models (8 minute read)
Ai2 releases EMO, a 14B-parameter mixture-of-experts language model trained to develop emergent modularity. It allows using a small subset of experts for specific tasks while maintaining near full-model performance.