lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation
Summary
This paper presents a system for constrained humor generation that uses a generate-many select-best strategy with a preference model learned from human comparisons. It achieved top ranks in English and Chinese subtasks and second in Spanish at SemEval-2026 Task 1.
View Cached Full Text
Cached at: 06/02/26, 03:35 PM
# Humor Is an Audience. Preference Modeling for Constrained Humor Generation
Source: [https://arxiv.org/html/2606.00022](https://arxiv.org/html/2606.00022)
## lmfaoooo at SemEval\-2026 Task 1: Humor Is an Audience\. Preference Modeling for Constrained Humor Generation
Alexey Ivanov OpenAI Mountain View, CA SaveTheRbtz@GMail\.com
###### Abstract
Humor generation remains difficult not only because producing fluent, novel jokes is hard, but because “funny” is audience\-dependent and supervision is noisy—preferences vary with audience, context, and culture, and annotator agreement is often low\. In this paper, we describe our system for the SemEval\-2026 Task 1 \(MWAHAHA\)Castroet al\.\([2026](https://arxiv.org/html/2606.00022#bib.bib2)\), which focuses on humor generation under explicit constraints\. The task evaluates submitted systems via human preference judgments in 1\-on\-1 arena\-style comparisons\.
We adopt a “generate\-many→\\rightarrowselect\-best” strategy\. First, we generate a diverse pool of candidates per instance using multi\-step prompting, model ensembling, and diversity\-oriented decoding\. Second, we select outputs using a preference model that approximates a “reader” by learning from human comparisons rather than absolute funniness scores\. To support this approach, we release∼\\sim2\.5K human pairwise judgments collected through the Humor Arena prototypeIvanov and Tikhonov \([2025](https://arxiv.org/html/2606.00022#bib.bib1)\)\. We further propose an interpretable pipeline that converts labeled comparisons into a preference model\. Across three preference datasets, our models consistently outperform baselines and show stronger cross\-domain transfer\. Finally, we apply the learned preference model to rank candidates for the MWAHAHA setting and release intermediate artifacts \(candidate pools and rankings\) to facilitate follow\-up work\.111[https://github\.com/altsoph/lmfaoooo](https://github.com/altsoph/lmfaoooo)
Our system ranked 1st in the English and Chinese subtasks of MWAHAHA and 2nd in the Spanish subtask\.
lmfaoooo at SemEval\-2026 Task 1: Humor Is an Audience\. Preference Modeling for Constrained Humor Generation
Alexey TikhonovInworld\.AIBerlin, Germanyaltsoph@gmail\.comAlexey IvanovOpenAIMountain View, CASaveTheRbtz@GMail\.com
## 1Introduction
> *“The question is,” said Alice, “whether you can make words mean so many different things\.” “The question is,” said Humpty Dumpty, “which is to be master—that’s all\.”*Carroll \([1871](https://arxiv.org/html/2606.00022#bib.bib7)\)
Can computers be funny? Humor generation has long been viewed as a stress test for natural language generation: beyond grammatical and semantic coherence, a successful joke must trigger an audience response that is strongly shaped by culture, demographics, and context\. Different authors, fromPropp \([1976](https://arxiv.org/html/2606.00022#bib.bib6)\)toWarrenet al\.\([2021](https://arxiv.org/html/2606.00022#bib.bib5)\), describe many competing theories of humor appreciation and emphasize that humor depends on incongruity, violation/benign framing, and the audience\. A practical consequence is that humor supervision is inherently noisy: even careful human labeling setups often show limited agreement, and different groups systematically disagree about what they preferRuch \([2013](https://arxiv.org/html/2606.00022#bib.bib24)\),Murakamiet al\.\([2026](https://arxiv.org/html/2606.00022#bib.bib23)\)\. Even in controlled dataset settings, annotators disagree substantially; moderate agreement is common in humor\-related annotation tasks, e\.g\., Fleiss’κ≈0\.49\\kappa\\approx 0\.49for pun\-related dataSunet al\.\([2022](https://arxiv.org/html/2606.00022#bib.bib4)\)\. As a result, “gold labels” are often better viewed as samples from a preference distribution than as objective truth\.
A second practical challenge is memorization and mode collapse in joke generation\. In a systematic probe of ChatGPT’s joke outputsJentzsch and Kersting \([2023](https://arxiv.org/html/2606.00022#bib.bib8)\), over 90% of 1008 generated jokes repeated the same 25 jokes, suggesting that unconstrained generation can default to a small set of high\-frequency templates rather than produce novel humor\. This motivates evaluation and selection strategies that reward diversity and penalize near\-duplicates\. Moreover, models are often compared against “polished” jokes that have been socially selected \(e\.g\., through comedian curation or online popularity\), whereas AI outputs are largely unfiltered—an asymmetry that can bias evaluations against the model\.
SemEval\-2026 Task 1, MWAHAHA \(Models Write Automatic Humor And Humans Annotate\)Castroet al\.\([2026](https://arxiv.org/html/2606.00022#bib.bib2)\), is the first SemEval task dedicated to advancing computational humor generation, including text\-based constrained joke generation\. The task aims to push models beyond memorization by requiring generation under constraints and evaluating outputs with human annotation\. In this work, we argue that for constrained humor generation, the bottleneck is often selection rather than raw generation: modern LLMs can produce many plausible candidates, but reliably identifying which candidate will be funniest for the target audience is the core difficulty\. We therefore treat humor generation as a two\-stage process: \(i\) generate diverse candidates, then \(ii\) select the optimal one using learned audience preferences\.
Our main contributions are:
- •Preference data release\.We release∼\\sim2\.5K human pairwise judgments from Humor ArenaIvanov and Tikhonov \([2025](https://arxiv.org/html/2606.00022#bib.bib1)\)\.
- •Preference\-first framing\.We formalize constrained humor generation as generation and ranking of multiple candidates using repeated pairwise preference decisions and show this yields more stable learning signals than baseline scoring\.
- •Interpretable preference model\.We propose an interpretable “humor feature” extraction pipeline that converts comparisons into a compact feature basis and enables training of lightweight preference models\.
- •lmfaoooo system\.We apply the pipeline to SemEval\-2026 Task 1 and document our full generation and selection workflow\. We publicly release222[https://github\.com/altsoph/lmfaoooo](https://github.com/altsoph/lmfaoooo)our prompts, candidate pools, features, and rankings used in our experiments, as long as versions of the used models and other information to support reproducibility\.
## 2Background: humor, novelty, and evaluation
Humor is not a single phenomenon but a family of mechanisms \(e\.g\., incongruity, wordplay, expectation violation\)\. Surveys of humor theories enumerate many partially overlapping accounts, suggesting that a universal definition of “funniness” is unlikely to be operationally sufficient for model training and evaluationWarrenet al\.\([2021](https://arxiv.org/html/2606.00022#bib.bib5)\)\. Classic humor studies also report differences in humor appreciation across demographic groups, reinforcing that evaluation depends on the rater populationMundorfet al\.\([1988](https://arxiv.org/html/2606.00022#bib.bib9)\),Dore \([2019](https://arxiv.org/html/2606.00022#bib.bib10)\)\.
Tikhonov and Shtykovskiy \([2024](https://arxiv.org/html/2606.00022#bib.bib11)\)show, in blind evaluations, that careful task formulation can elicit machine\-generated humor of a quality comparable to human jokes, including through the use of brainstorming\-style prompting techniques\. A common pattern in creative production is to generate many candidates and then select the best\. In LLM creativity studies, explicitly separating brainstorming from selection improves performance over brainstorming aloneSummers\-Stayet al\.\([2023](https://arxiv.org/html/2606.00022#bib.bib12)\)\. Different tricks can be used to increase the diversity of generationsZhanget al\.\([2025](https://arxiv.org/html/2606.00022#bib.bib13)\)\. Moreover, direct training for joke generation also yields strong resultsWanget al\.\([2025](https://arxiv.org/html/2606.00022#bib.bib14)\)\.
Effective and unbiased evaluation is a tricky problem in the era of massive adoption of LLMsTikhonov and Yamshchikov \([2023](https://arxiv.org/html/2606.00022#bib.bib17)\)\. However, some interesting results are already achieved even for a subjective task of creative text evaluationAgafonovaet al\.\([2020](https://arxiv.org/html/2606.00022#bib.bib16)\)\. Automatic humor recognition, detection, or selection is also a known open problemMihalcea and Strapparava \([2006](https://arxiv.org/html/2606.00022#bib.bib15)\), with some recent promising resultsKalloniatis \([2024](https://arxiv.org/html/2606.00022#bib.bib18)\),Bago and others \([2025](https://arxiv.org/html/2606.00022#bib.bib19)\)\. In subjective NLG tasks, pointwise \(Likert\-style\) scoring is often noisy or inconsistent\. Comparative protocols can be better aligned with human judgments in many settingsNovikovaet al\.\([2018](https://arxiv.org/html/2606.00022#bib.bib20)\), although they introduce their own biases and costs\. MWAHAHACastroet al\.\([2026](https://arxiv.org/html/2606.00022#bib.bib2)\)itself adopts a pairwise arena\-style human preference evaluation for final scoring\. Learning ranking from paired comparisons has a long tradition in statistics \(e\.g\., Bradley–Terry style modelsBradley and Terry \([1952](https://arxiv.org/html/2606.00022#bib.bib21)\)\)\. A practical advantage is that one can infer global rankings even with incomplete comparison graphs, which is useful when comparisons are expensive\. Recent studies on pairwise LLM evaluation and rank aggregation \(e\.g\., uncertainty\-guided comparison scheduling and evaluator design\) provide complementary evidence and actionable recommendations for improving both accuracy and cost\.
This motivates our system design: treat candidate generation as a diversity\-maximization problem and selection as a preference\-learning problem; then, generate the final ranking using EvalicaUstalov \([2025](https://arxiv.org/html/2606.00022#bib.bib22)\), an open\-source toolkit designed to support reliable, reproducible evaluation and ranking\.
## 3Task description: SemEval\-2026 MWAHAHA
MWAHAHA Subtask ACastroet al\.\([2026](https://arxiv.org/html/2606.00022#bib.bib2)\)requires generating jokes under constraints \(e\.g\., conditioning on a news headline or including specified words\) in multiple languages \(English, Spanish, Chinese\)\. The task aims to encourage novelty and fairness through constraint design and human evaluation\.
## 4Data
We train and evaluate preference models on three datasets\.
### 4\.1Reddit\-based comparisons
We use a Reddit jokes dataset with human ratings and comparisons derived from up/downvotes, originally collected byWeller and Seppi \([2019](https://arxiv.org/html/2606.00022#bib.bib29)\)\. This dataset represents relatively broad, community\-driven humor with strong topical and stylistic regularities\. Since vote\-derived labels can reflect confounds \(e\.g\., visibility and temporal effects\) rather than pure funniness, we treat them as noisy preference signals and take care to reduce near\-duplicate leakage across train/test splits\. The dataset was downloaded using scripts provided byBaranovet al\.\([2023](https://arxiv.org/html/2606.00022#bib.bib30)\)and processed exactly the same way as described inTikhonov and Shtykovskiy \([2024](https://arxiv.org/html/2606.00022#bib.bib11)\)\.
### 4\.2Humor Mechanics dataset
We also use the publicly available human comparison data released with the Humor Mechanics project repository333[https://github\.com/altsoph/humor\-mechanics](https://github.com/altsoph/humor-mechanics)\. It contains both human\-written and generated jokes with human labels for them collected using assessors recruited via the Scale\.AI service\.
### 4\.3Humor Arena pairwise judgments
We collect pairwise judgments in Humor Arena444[https://humor\.ph34r\.me/](https://humor.ph34r.me/), a lightweight platform for comparing jokes generated by different models and forming model/joke rankings via repeated comparisons\. We release555[https://github\.com/SaveTheRbtz/humor](https://github.com/SaveTheRbtz/humor)a dataset of 2,543 pairwise comparisons over generated one\-liners collected in 391 sessions\. Each comparison has one of four outcomes: A wins, B wins, both are good, or both are bad\. For training preference models, we encode wins/losses asy∈\{1,0\}y\\in\\\{1,0\\\}and discard both\-good/both\-bad outcomes\.
## 5Method
### 5\.1Candidate generation
For candidate generation stage, we closely follow a multi\-step generation strategy described inTikhonov and Shtykovskiy \([2024](https://arxiv.org/html/2606.00022#bib.bib11)\)and inspired bySummers\-Stayet al\.\([2023](https://arxiv.org/html/2606.00022#bib.bib12)\): instead of attempting to produce the joke in one pass, we first generate, expand and refine a list of associations, then in the final phase we merge them into several candidate jokes \(check Appendix B\. ofTikhonov and Shtykovskiy \([2024](https://arxiv.org/html/2606.00022#bib.bib11)\)for the specific prompts\)\.
To increase diversity and reduce memorization, we use \(i\) an ensemble of models \(specifically claude\-sonnet\-4Anthropic \([2025a](https://arxiv.org/html/2606.00022#bib.bib25)\), claude\-opus\-4\.5Anthropic \([2025b](https://arxiv.org/html/2606.00022#bib.bib26)\)and GPT\-5OpenAI \([2025](https://arxiv.org/html/2606.00022#bib.bib27)\), all with the default temperature = 1\.0\), and \(ii\) a tail\-focused sampling trickZhanget al\.\([2025](https://arxiv.org/html/2606.00022#bib.bib13)\)designed to draw candidates from lower\-probability regions of the model distribution\. In our experiments, we generate 50 candidates per task instance \(before deduplication and filtering\)\. To filter the pool, we apply deterministic checks to enforce hard constraints of the MWAHAHA task \(keyword inclusion and length limits\) and discard invalid candidates\. To reduce near\-duplicate leakage and improve selection robustness, we additionally de\-duplicate candidates using embedding similarity, keeping the centroid candidate per near\-duplicate cluster\.
### 5\.2Building the “humor basis”
To select the best candidate, we need to compare them and get results as close as possible to the human audience\. A core difficulty in preference modeling is representation: raw text embeddings can work, but they are hard to interpret and diagnose\. We therefore construct a compact, interpretable feature “humor basis”, a key element of our proposed approach – a relatively short list of different humor aspects that can be used \(or not\) in a particular joke\. For example, among such rules, there could be recommendations to use*Dark Humor*,*Exaggeration*, or*Wordplay*\. To some degree, it is similar to the Constitutional AI approach fromBaiet al\.\([2022](https://arxiv.org/html/2606.00022#bib.bib31)\), but we do not require the rules to be co\-aligned or coherent\.
At this stage, our goal is to achieve maximum diversity of popular humor aspects without overinflating the list\. To construct such a basis, we extract qualitative “difference hints” from each training preference pair by prompting an LLM to produce 7–10 short descriptions of how the two jokes differ\. This yields several thousand candidate hints\. We then embed the hints, cluster them with DP\-meansKulis and Jordan \([2012](https://arxiv.org/html/2606.00022#bib.bib28)\), and discard small clusters\. Finally, we apply semantic \(LLM\-based\) de\-duplication across clusters to obtain a compact basis \(17 features in the current run; see Appendix A for the full list\)\.
### 5\.3Feature vector extraction
Now, using the humor basis, we may use it for a feature vector extraction procedure by exposing a joke or pair of jokes aside with the basis and asking LLM to score it across the basis rules\. We implement two different setups here: \(i\) pointwise: here we ask LLM to score a given joke independently by producing a vector of weights \(one per rule\), \(ii\) pairwise: here we ask LLM to compare two jokes and return as a vector of differences – which joke in a pair is more aligned with the given rule\. Check Appendix B and Appendix C for examples of pointwise and pairwise decompositions, respectively\. Reach the publicly available github repo for exact decomposition prompts666[https://github\.com/altsoph/lmfaoooo](https://github.com/altsoph/lmfaoooo)\.
### 5\.4Preference modeling
Having these \(pointwise and pairwise\) feature vectors, we can now train lightweight preference predictors\. We first use L1 \(LASSO\) regularizationTibshirani \([1996](https://arxiv.org/html/2606.00022#bib.bib33)\)for feature selection and collinearity reduction; then we train a L2\-regularized regressionTikhonov \([1963](https://arxiv.org/html/2606.00022#bib.bib32)\)over selected features for better solution stability\.
While pairwise feature vectors are more expensive \(requiring up toΘ\(n2\)\\Theta\(n^\{2\}\)evaluations to obtain a useful ranking\), they often yield better self\-consistency and stabilityNovikovaet al\.\([2018](https://arxiv.org/html/2606.00022#bib.bib20)\)in global ranking\. Thus, we use a pairwise\-based regression as a main approach, providing some optimizations to reduce the real number on comparisons: \(i\) start with random edges allocation, \(ii\) prioritize comparisons that connect disconnected components, \(iii\) compare dissimilar candidates to reduce uncertainty\. Under some circumstances, it is enough to provideΘ\(n∗log\(n\)\)\\Theta\(n\*log\(n\)\)edges \(see, for example,Negahbanet al\.\([2012](https://arxiv.org/html/2606.00022#bib.bib34)\),Ailon and Mohri \([2010](https://arxiv.org/html/2606.00022#bib.bib35)\)\)\.
At the same time, we use a pointwise\-based one as a baseline, calculating absolute score for each joke and deriving the winner by comparison of such scores between a pair of jokes\.
### 5\.5Global ranking
Given a pair of candidates\(a,b\)\(a,b\)for the same constraint, our preference model predicts which candidate an audience would prefer\. We aggregate outcomes for each pair into a global ranking using paired\-comparison models \(Bradley–Terry–Luce family\)Bradley and Terry \([1952](https://arxiv.org/html/2606.00022#bib.bib21)\)and Elo\-style procedures using the Evalica libraryUstalov \([2025](https://arxiv.org/html/2606.00022#bib.bib22)\)\.
## 6Experiments
### 6\.1Within\-dataset results
Table 1:10\-fold CV accuracy for pointwise, NoBasis\-ablation, and pairwise models\.In Table[1](https://arxiv.org/html/2606.00022#S6.T1), we compare preference predictors trained on pairwise labels against baselines\.NoBasisis a direct LLM\-judge baseline: given a pair of jokes, an evaluator LLM is prompted only with the generic question “which joke is better?” \(no feature basis\); its predicted winner is compared to the human label to compute accuracy\. Across datasets, our pairwise feature\-based models are more reliable for predicting human choices\. These results suggest that pairwise learning provides a stronger signal than absolute scoring for humor preferences, especially in the noisiest setting \(H\.Arena\), where audience effects and style variance are high\.
### 6\.2Cross\-dataset transfer
Table 2:Cross\-dataset transfer accuracy forpairwisemodels \(train on rows, test on columns\)\. 50% is random choice\.Table 3:Cross\-dataset transfer accuracy forpointwisebaselines \(control\)\. 50% is random choice\.We test whether a preference model trained on one dataset generalizes to another\. Transfer is partial and asymmetric, indicating that humor preferences include both transferable components \(e\.g\., punchline clarity, expectation violation\) and community\-specific biases\.
Overall, pairwise models show materially stronger transfer than pointwise baselines in most off\-diagonal settings, suggesting that relative judgments capture more stable structure than absolute ratings\.
## 7SemEval\-2026 MWAHAHA Submission
MWAHAHA evaluates systems using human preference battles in an arena setting\. Our submission follows the same principle: generate a diverse candidate set, then select via a learned preference proxy for the listener\. We originally planned to use trial\-phase feedback to calibrate the target audience profile for selection; however, the competition workflow did not enable using trial labels to adapt final\-phase selection\. We therefore treated H\.Arena preferences as our best available proxy\. For non\-English subtasks, we use exactly the same workflow \(including the same 17\-feature humor basis\), but requested LLMs to generate associations and candidate jokes directly in the target language\.
Our system gained rank 1 at the English and Chinese subtasks of MWAHAHA with 1041 \(95% CI \[1009, 1064\]\) and 1081 \(95% CI \[1031, 1127\]\) ELO scores correspondingly, and rank 2 at the Spanish subtask with ELO score 1091 \(95% CI \[1053, 1121\]\)\.
## 8Discussion
Our findings support a preference\-first view of humor generation: generating diverse candidates is increasingly easy for modern LLMs, while reliably selecting what a target audience prefers remains the key difficulty\. This also explains why results in the literature can appear heterogeneous: evaluation depends strongly on rater pools, cultural context, and the measurement protocol\. Empirically, LLMs can generate many grammatically valid “joke\-shaped” candidates, but novelty and audience alignment are fragile\. The tendency to repeat common templates is well documented in LLM joke generation experimentsJentzsch and Kersting \([2023](https://arxiv.org/html/2606.00022#bib.bib8)\)\. This supports investing compute into diverse generation and focusing research effort on preference modeling and selection\.
An advantage of an interpretable feature basis is that it enables targeted error analysis: for example, we observe that certain features \(e\.g\., Clear Punchline, Subverting Expectations\) are consistently selected across datasets, while other features appear community\-specific\.
## 9Limitations and Ethics
#### Subjectivity and demographic variation\.
Humor depends on culture, context, and the listener; thus, any single preference model risks overfitting to its annotator population\. Humor can easily drift into toxic or offensive content\. We recommend explicit safety filtering and constraint\-based generation to keep outputs within task norms\. A further limitation is evaluator circularity: our pipeline relies on LLMs for feature extraction/decomposition, which can introduce evaluator bias\. However, we hope aligning the preference model with human choices should eliminate this effect\.
## 10Conclusion
We presented a system for SemEval\-2026 Task 1 \(MWAHAHA\) that treats constrained humor generation as a two\-stage pipeline: diverse candidate generation followed by pairwise preference\-based selection\. We release a new set of human pairwise humor judgments and propose an interpretable feature\-based preference modeling approach\. Across three datasets, pairwise preference models outperform pointwise baselines and transfer more robustly across domains\. Overall, our results support the view that continued progress in humor generation will increasingly depend on improving evaluation and preference modeling, rather than solely improving raw text generation\.
## References
- Y\. Agafonova, A\. Tikhonov, and I\. P\. Yamshchikov \(2020\)Paranoid transformer: reading narrative of madness as computational approach to creativity\.Future Internet12\(11\)\.External Links:[Link](https://www.mdpi.com/1999-5903/12/11/182),ISSN 1999\-5903,[Document](https://dx.doi.org/10.3390/fi12110182)Cited by:[§2](https://arxiv.org/html/2606.00022#S2.p3.1)\.
- N\. Ailon and M\. Mohri \(2010\)Active learning ranking from pairwise preferences with almost optimal query complexity\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://papers.neurips.cc/paper/4428-active-learning-ranking-from-pairwise-preferences-with-almost-optimal-query-complexity.pdf)Cited by:[§5\.4](https://arxiv.org/html/2606.00022#S5.SS4.p2.2)\.
- Anthropic \(2025a\)System card: claude opus 4 & claude sonnet 4\.Technical reportAnthropic\.Note:Model version identifier: claude\-sonnet\-4\-20250514External Links:[Link](https://www-cdn.anthropic.com/6d8a8055020700718b0c49369f60816ba2a7c285.pdf)Cited by:[§5\.1](https://arxiv.org/html/2606.00022#S5.SS1.p2.1)\.
- Anthropic \(2025b\)System card: claude opus 4\.5\.Technical reportAnthropic\.Note:Model version identifier: claude\-opus\-4\-5\-20251101External Links:[Link](https://www-cdn.anthropic.com/bf10f64990cfda0ba858290be7b8cc6317685f47.pdf)Cited by:[§5\.1](https://arxiv.org/html/2606.00022#S5.SS1.p2.1)\.
- P\. Bagoet al\.\(2025\)Evaluating large language models for humor detection in low\-resource settings\.InProceedings of the 10th Workshop on Slavic Natural Language Processing \(Slavic NLP 2025\),pp\. 9–16\.External Links:[Link](https://aclanthology.org/2025.bsnlp-1.2.pdf)Cited by:[§2](https://arxiv.org/html/2606.00022#S2.p3.1)\.
- Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon,et al\.\(2022\)Constitutional AI: harmlessness from AI feedback\.External Links:2212\.08073,[Document](https://dx.doi.org/10.48550/arXiv.2212.08073),[Link](https://arxiv.org/abs/2212.08073)Cited by:[§5\.2](https://arxiv.org/html/2606.00022#S5.SS2.p1.1)\.
- A\. Baranov, V\. Kniazhevsky, and P\. Braslavski \(2023\)You told me that joke twice: a systematic investigation of transferability and robustness of humor detection models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 13701–13715\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.845),[Link](https://aclanthology.org/2023.emnlp-main.845/)Cited by:[§4\.1](https://arxiv.org/html/2606.00022#S4.SS1.p1.1)\.
- R\. A\. Bradley and M\. E\. Terry \(1952\)Rank analysis of incomplete block designs: I\. the method of paired comparisons\.Biometrika39\(3\-4\),pp\. 324–345\.External Links:[Document](https://dx.doi.org/10.1093/biomet/39.3-4.324)Cited by:[§2](https://arxiv.org/html/2606.00022#S2.p3.1),[§5\.5](https://arxiv.org/html/2606.00022#S5.SS5.p1.1)\.
- L\. Carroll \(1871\)Through the looking\-glass, and what alice found there\.Macmillan and Co\.,London\.Note:Quote from Chapter VI: “Humpty Dumpty”Cited by:[§1](https://arxiv.org/html/2606.00022#S1.p1.1.1)\.
- S\. Castro, L\. Chiruzzo, S\. Góngora, S\. Rahili, N\. Deng, I\. Sastre, V\. Amoroso, G\. Rey, A\. Rosá, G\. Moncecchi, J\. A\. Meaney, J\. J\. Prada, and R\. Mihalcea \(2026\)SemEval\-2026 Task 1: MWAHAHA, Models Write Automatic Humor And Humans Annotate\.InProceedings of the 20th International Workshop on Semantic Evaluation \(SemEval\-2026\),Cited by:[§1](https://arxiv.org/html/2606.00022#S1.p4.1),[§2](https://arxiv.org/html/2606.00022#S2.p3.1),[§3](https://arxiv.org/html/2606.00022#S3.p1.1)\.
- M\. Dore \(2019\)Humour in audiovisual translation: theories and applications\.Routledge,New York\.External Links:ISBN 9780367432317,[Document](https://dx.doi.org/10.4324/9781003001928)Cited by:[§2](https://arxiv.org/html/2606.00022#S2.p1.1)\.
- A\. Ivanov and A\. Tikhonov \(2025\)humor: LLM humor arena\.Note:[https://github\.com/SaveTheRbtz/humor](https://github.com/SaveTheRbtz/humor)GitHub repository\. Latest commit on main: 43c4d2f \(Jul 11, 2025\)\. Accessed: 2026\-02\-14\.Cited by:[1st item](https://arxiv.org/html/2606.00022#S1.I1.i1.p1.1)\.
- S\. Jentzsch and K\. Kersting \(2023\)ChatGPT is fun, but it is not funny\! humor is still challenging large language models\.Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media Analysis,pp\. 325–340\.Cited by:[§1](https://arxiv.org/html/2606.00022#S1.p3.1),[§8](https://arxiv.org/html/2606.00022#S8.p1.1)\.
- A\. Kalloniatis \(2024\)Computational humor recognition: a systematic literature review\.Artificial Intelligence Review\.External Links:[Document](https://dx.doi.org/10.1007/s10462-024-11043-3)Cited by:[§2](https://arxiv.org/html/2606.00022#S2.p3.1)\.
- B\. Kulis and M\. I\. Jordan \(2012\)Revisiting k\-means: new algorithms via bayesian nonparametrics\.InProceedings of the 29th International Conference on Machine Learning \(ICML 2012\),Edinburgh, Scotland, UK,pp\. 1131–1138\.Note:Introduces the DP\-means hard clustering objective with a cluster\-creation penalty\.External Links:[Link](https://icml.cc/2012/papers/291.pdf)Cited by:[§5\.2](https://arxiv.org/html/2606.00022#S5.SS2.p2.1)\.
- R\. Mihalcea and C\. Strapparava \(2006\)Learning to laugh \(automatically\): computational models for humor recognition\.\.Computational Intelligence22,pp\. 126–142\.External Links:[Document](https://dx.doi.org/10.1111/j.1467-8640.2006.00278.x)Cited by:[§2](https://arxiv.org/html/2606.00022#S2.p3.1)\.
- N\. Mundorf, A\. Bhatia, D\. Zillmann, P\. Lester, and S\. Robertson \(1988\)Gender differences in humor appreciation\.Humor: International Journal of Humor Research1\(3\),pp\. 231–244\.External Links:[Document](https://dx.doi.org/10.1515/humr.1988.1.3.231)Cited by:[§2](https://arxiv.org/html/2606.00022#S2.p1.1)\.
- S\. Murakami, H\. Kamigaito, H\. Takamura, and M\. Okumura \(2026\)Who laughs with whom? disentangling influential factors in humor preferences across user clusters and llms\.External Links:2601\.03103,[Link](https://arxiv.org/abs/2601.03103)Cited by:[§1](https://arxiv.org/html/2606.00022#S1.p2.1)\.
- S\. Negahban, S\. Oh, and D\. Shah \(2012\)Rank centrality: ranking from pair\-wise comparisons\.External Links:1209\.1688,[Link](https://arxiv.org/abs/1209.1688)Cited by:[§5\.4](https://arxiv.org/html/2606.00022#S5.SS4.p2.2)\.
- J\. Novikova, O\. Dušek, and V\. Rieser \(2018\)RankME: reliable human ratings for natural language generation\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\-HLT\),External Links:[Link](https://aclanthology.org/N18-2012.pdf)Cited by:[§2](https://arxiv.org/html/2606.00022#S2.p3.1),[§5\.4](https://arxiv.org/html/2606.00022#S5.SS4.p2.2)\.
- OpenAI \(2025\)OpenAI GPT\-5\.2 system card\.OpenAI\.Note:Accessed: 2026\-02\-23External Links:[Link](https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf)Cited by:[§5\.1](https://arxiv.org/html/2606.00022#S5.SS1.p2.1)\.
- V\. I\. Propp \(1976\)Problemy komizma i smekha\.Iskusstvo, Moscow\(russian\)\.Note:Posthumous first publication; English translation published as*On the Comic and Laughter*\(University of Toronto Press, 2009\)Cited by:[§1](https://arxiv.org/html/2606.00022#S1.p2.1)\.
- W\. Ruch \(2013\)Assessment of appreciation of humor: studies with the 3 wd humor test\.InAdvances in Personality Assessment: Volume 9,C\. D\. Spielberger and J\. N\. Butcher \(Eds\.\),pp\. 27–75\.External Links:[Document](https://dx.doi.org/10.4324/9781315827483-2),[Link](https://www.taylorfrancis.com/chapters/edit/10.4324/9781315827483-2/assessment-appreciation-humor-studies-3-wd-humor-test-willibald-ruch)Cited by:[§1](https://arxiv.org/html/2606.00022#S1.p2.1)\.
- D\. Summers\-Stay, S\. M\. Lukin, and C\. R\. Voss \(2023\)Brainstorm, then select: a generative language model improves its creativity score\.In2023 IEEE ICRA,External Links:[Link](https://api.semanticscholar.org/CorpusID:259305709)Cited by:[§2](https://arxiv.org/html/2606.00022#S2.p2.1),[§5\.1](https://arxiv.org/html/2606.00022#S5.SS1.p1.1)\.
- J\. Sun, A\. Narayan\-Chen, S\. Oraby, A\. Cervone, T\. Chung, J\. Huang, Y\. Liu, and N\. Peng \(2022\)ExPUNations: augmenting puns with keywords and explanations\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 4590–4605\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.304/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.304)Cited by:[§1](https://arxiv.org/html/2606.00022#S1.p2.1)\.
- R\. Tibshirani \(1996\)Regression shrinkage and selection via the lasso\.Journal of the Royal Statistical Society: Series B \(Methodological\)58\(1\),pp\. 267–288\.External Links:[Document](https://dx.doi.org/10.1111/j.2517-6161.1996.tb02080.x),[Link](https://academic.oup.com/jrsssb/article/58/1/267/7027929)Cited by:[§5\.4](https://arxiv.org/html/2606.00022#S5.SS4.p1.1)\.
- A\. N\. Tikhonov \(1963\)Solution of incorrectly formulated problems and the regularization method\.Soviet Mathematics Doklady4,pp\. 1035–1038\.Note:English translation of: Dokl\. Akad\. Nauk SSSR 151 \(1963\), 501–504Cited by:[§5\.4](https://arxiv.org/html/2606.00022#S5.SS4.p1.1)\.
- A\. Tikhonov and P\. Shtykovskiy \(2024\)Humor mechanics: advancing humor generation with multistep reasoning\.InProceedings of the 15th International Conference on Computational Creativity, ICCC 2024, Jönköping, Sweden, June 17\-21, 2024,K\. Grace, M\. T\. Llano, P\. Martins, and M\. M\. Hedblom \(Eds\.\),pp\. 31–41\.External Links:[Link](https://computationalcreativity.net/iccc24/papers/ICCC24%5C_paper%5C_128.pdf)Cited by:[§2](https://arxiv.org/html/2606.00022#S2.p2.1),[§4\.1](https://arxiv.org/html/2606.00022#S4.SS1.p1.1),[§5\.1](https://arxiv.org/html/2606.00022#S5.SS1.p1.1)\.
- A\. Tikhonov and I\. P\. Yamshchikov \(2023\)Post Turing: mapping the landscape of LLM evaluation\.InProceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics \(GEM\),S\. Gehrmann, A\. Wang, J\. Sedoc, E\. Clark, K\. Dhole, K\. R\. Chandu, E\. Santus, and H\. Sedghamiz \(Eds\.\),Singapore,pp\. 398–412\.External Links:[Link](https://aclanthology.org/2023.gem-1.31/)Cited by:[§2](https://arxiv.org/html/2606.00022#S2.p3.1)\.
- D\. Ustalov \(2025\)Reliable, Reproducible, and Really Fast Leaderboards with Evalica\.InProceedings of the 31st International Conference on Computational Linguistics: System Demonstrations,Abu Dhabi, UAE,pp\. 46–53\(english\)\.External Links:2412\.11314,[Link](https://aclanthology.org/2025.coling-demos.6)Cited by:[§2](https://arxiv.org/html/2606.00022#S2.p4.1),[§5\.5](https://arxiv.org/html/2606.00022#S5.SS5.p1.1)\.
- H\. Wang, Y\. Zhao, D\. Li, X\. Wang, G\. Liu, X\. Lan, and H\. Wang \(2025\)Innovative thinking, infinite humor: humor research of large language models through structured thought leaps\.InInternational Conference on Learning Representations \(ICLR\),Note:PosterExternal Links:[Link](https://openreview.net/forum?id=CGhgB8Kz8i)Cited by:[§2](https://arxiv.org/html/2606.00022#S2.p2.1)\.
- C\. Warren, A\. Barsky, and A\. P\. McGraw \(2021\)What makes things funny? an integrative review of the antecedents of laughter and amusement\.Personality and Social Psychology ReviewVol\. 25\(1\),pp\. 41–65\.Cited by:[§1](https://arxiv.org/html/2606.00022#S1.p2.1),[§2](https://arxiv.org/html/2606.00022#S2.p1.1)\.
- O\. Weller and K\. Seppi \(2019\)Humor detection: a transformer gets the last laugh\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),Hong Kong, China,pp\. 3621–3625\.Note:Introduces/uses the Reddit jokes \(ReJ\) dataset for humor detection\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1372),[Link](https://aclanthology.org/D19-1372/)Cited by:[§4\.1](https://arxiv.org/html/2606.00022#S4.SS1.p1.1)\.
- J\. Zhang, S\. Yu, D\. Chong, A\. Sicilia, M\. R\. Tomz, C\. D\. Manning, and W\. Shi \(2025\)Verbalized sampling: how to mitigate mode collapse and unlock llm diversity\.External Links:2510\.01171,[Link](https://arxiv.org/abs/2510.01171)Cited by:[§2](https://arxiv.org/html/2606.00022#S2.p2.1),[§5\.1](https://arxiv.org/html/2606.00022#S5.SS1.p2.1)\.
## Appendix A\. Humor basis
ClearPunchline:Ensurethejokedeliversastrong,unmistakablepunchlineformaximumimpact\.
WordplaywithPurpose:Usepunsorwordplaythatservesthejoke,ratherthanrelyingonrepetitionorforcedcleverness\.
Universality:Usereferencesthatarewidelyunderstoodorrelatabletobroadenappeal\.
NaturalDialogue:Employconversationalexchangestomakethejokefeelorganicandengaging\.
SubtletyOverObviousness:Favorsubtlehumorthatallowsaudiencestoconnectthedotsoverjokesthatspelleverythingout\.
AvoidCliche:Steerawayfromjokesthatrelyonoverusedwordplayortiredhumorstructures\.
FreshPerspective:Offeranovelorsurprisingangleonfamiliarsituationstokeepmaterialoriginal\.
Exaggeration:Amplifyingacharacteristic,situation,orbehaviortoabsurdlevelstohighlightitscomedicpotential\.
SubvertingExpectations:Twistingafamiliarsetupcreatesdelightbycatchingtheaudienceoffguard\.
Character\-DrivenHumor:Useestablishedstereotypesorbehaviorstoanchorthejokeandbuildricherscenarios\.
EconomyofWords:Beconciseandefficientwithlanguage,trimmingunnecessarydetailstomaximizecomedicpayoff\.
Self\-Deprecation:Playfullytargetingoneselfcandisarmtheaudienceandmakehumormorerelatable\.
SatiricalEdge:Employsatiretocritiquesocialtrendsorbehaviors,addingdepthtothehumor\.
Anthropomorphism:Attributehumanqualitiestonon\-humanentitiesforhumorouseffect\.
CleverAnalogies:Usecreativecomparisonsthatlinkunrelatedconceptsforasurprisingcomedictwist\.
MemorableImagery:Createvividoramusingmentalpicturesthatstickwiththeaudience\.
DarkHumor:Makinglightofsubjectsthataregenerallyconsideredserious,taboo,ormorbid\.
## Appendix B\. Pointwise feature vector example
Joke:Puttingairinyourtiresusedtobefreenowitscostsadollar\.\.\.Itscalledinflation\.
Absolutescores:
ClearPunchline:1\.0
WordplaywithPurpose:1\.0
Universality:1\.0
NaturalDialogue:0\.0
SubtletyOverObviousness:0\.6
AvoidCliche:0\.6
FreshPerspective:0\.0
Exaggeration:0\.0
SubvertingExpectations:0\.0
Character\-DrivenHumor:1\.0
EconomyofWords:1\.0
Self\-Deprecation:0\.0
SatiricalEdge:0\.2
Anthropomorphism:0\.0
CleverAnalogies:0\.9
MemorableImagery:0\.1
DarkHumor:0\.0
## Appendix C\. Pairwise feature vector example
JokeA:"Howmanydigitsofpidoyouknow?"\-"Allofthem\.\.\.Ijustalwaysforgettheorder\!"
JokeB:Afriendhasafearofpi\.Ikeeptellinghimit’sirrational,buthedoesn’tlisten\.
Relativescores\(1\.0meansAiscloser,0\.0meansBisclosertomatchtherule\)
ClearPunchline:0\.5
WordplaywithPurpose:0\.0
Universality:0\.05
NaturalDialogue:1\.0
SubtletyOverObviousness:0\.45
AvoidCliche:0\.55
FreshPerspective:0\.95
Exaggeration:0\.75
SubvertingExpectations:0\.95
Character\-DrivenHumor:0\.25
EconomyofWords:0\.05
Self\-Deprecation:0\.85
SatiricalEdge:0\.35
Anthropomorphism:0\.25
CleverAnalogies:0\.45
MemorableImagery:0\.95
DarkHumor:0\.0Similar Articles
HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models
HumorRank introduces a tournament-based leaderboard using pairwise evaluations and Bradley-Terry MLE to rank LLMs on humor generation, showing humor quality depends on comedic mastery rather than scale.
RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation
This paper presents the winning system for SemEval-2026 Task 8's generation subtask, using a heterogeneous ensemble of seven LLMs with dual prompting strategies and a GPT-4o-mini judge to select the best response. The system achieved first place with a conditioned harmonic mean of 0.7827, outperforming all baselines and demonstrating the value of model diversity.
CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark
Researchers from Peking University introduce CFMS, the first fine-grained Chinese multimodal sarcasm detection benchmark with 2,796 image-text pairs and a triple-level annotation framework (sarcasm identification, target recognition, explanation generation), along with a novel RL-augmented in-context learning method (PGDS) that significantly outperforms existing baselines.
YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling
This paper details the YEZE system for SemEval-2026 Task 9, which detects online polarization in 22 languages using a heterogeneous ensemble of XLM-RoBERTa and mDeBERTa models.
Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses
This paper proposes a framework for evaluating LLMs' ability to generate multiple responses to scientific queries at different language complexity levels. The study finds that models often vary complexity inconsistently, with Claude Sonnet 4.5 performing best but only shifting complexity correctly 46% of the time.