HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

arXiv cs.CL Papers

Summary

HumorRank introduces a tournament-based leaderboard using pairwise evaluations and Bradley-Terry MLE to rank LLMs on humor generation, showing humor quality depends on comedic mastery rather than scale.

arXiv:2604.19786v1 Announce Type: new Abstract: Evaluating humor in large language models (LLMs) is an open challenge because existing approaches yield isolated, incomparable metrics rather than unified model rankings, making it difficult to track progress across systems. We introduce HumorRank, a tournament-based evaluation framework and leaderboard for textual humor generation. Using SemEval-2026 MWAHAHA test dataset, we conduct an extensive automated pairwise evaluation across nine models spanning proprietary, open-weight, and specialized systems. Pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) are aggregated via an Adaptive Swiss tournament, with Bradley-Terry Maximum Likelihood Estimation (MLE) producing globally consistent humor generation capability rankings. Our results demonstrate that HumorRank yields statistically grounded model stratifications, showing that humor quality is driven by mastery of comedic mechanisms rather than model scale alone. HumorRank thus provides a scalable, interpretable methodology for benchmarking and understanding LLM-generated humor.
Original Article
View Cached Full Text

Cached at: 04/23/26, 10:02 AM

# HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models
Source: [https://arxiv.org/html/2604.19786](https://arxiv.org/html/2604.19786)
Edward Ajayi Carnegie Mellon University Africa Kigali, Rwanda eaajayi@andrew\.cmu\.edu&Prasenjit Mitra Carnegie Mellon University Africa Kigali, Rwanda prasenjm@andrew\.cmu\.edu

###### Abstract

Evaluating humor in large language models \(LLMs\) is an open challenge because existing approaches yield isolated, incomparable metrics rather than unified model rankings, making it difficult to track progress across systems\. We introduce HumorRank, a tournament\-based evaluation framework and leaderboard for textual humor generation\. Using SemEval\-2026 MWAHAHA test dataset, we conduct an extensive automated pairwise evaluation across nine models spanning proprietary, open\-weight, and specialized systems\. Pairwise judgments grounded in the General Theory of Verbal Humor \(GTVH\) are aggregated via an Adaptive Swiss tournament, with Bradley\-Terry Maximum Likelihood Estimation \(MLE\) producing globally consistent humor generation capability rankings\. Our results demonstrate that HumorRank yields statistically grounded model stratifications, showing that humor quality is driven by mastery of comedic mechanisms rather than model scale alone\. HumorRank thus provides a scalable, interpretable methodology for benchmarking and understanding LLM\-generated humor\.

## 1Introduction

Humor generation is a domain that requires a highly nuanced understanding of language, context, and pragmatic reasoning\(Quan et al\.,[2025](https://arxiv.org/html/2604.19786#bib.bib27); Kim & Chilton,[2025](https://arxiv.org/html/2604.19786#bib.bib22)\), posing a significant challenge for evaluating the capabilities of large language models \(LLMs\)\(Narad et al\.,[2025](https://arxiv.org/html/2604.19786#bib.bib25)\)\. This difficulty is reflected in the fragmented landscape of existing evaluation methods, where different works adopt incompatible paradigms, including humor detection\(Ajayi & Mitra,[2025b](https://arxiv.org/html/2604.19786#bib.bib3); Romanowski et al\.,[2025](https://arxiv.org/html/2604.19786#bib.bib29)\), scalar scoring\(Goes et al\.,[2022](https://arxiv.org/html/2604.19786#bib.bib15)\), classification\(Wu et al\.,[2025a](https://arxiv.org/html/2604.19786#bib.bib41)\), LLM\-as\-a\-Judge approaches\(Shafiei & Saffari,[2025](https://arxiv.org/html/2604.19786#bib.bib31)\), and costly human preference evaluations\(Romanowski et al\.,[2025](https://arxiv.org/html/2604.19786#bib.bib29); Horvitz et al\.,[2024](https://arxiv.org/html/2604.19786#bib.bib19)\)\.

A central limitation of these approaches is the lack of a unified and scalable framework for comparing models\. Existing methods measure different aspects of humor and do not produce consistent rankings across systems, making it difficult to track progress\. As LLMs increasingly operate in interactive and creative settings, establishing a reliable and comparable evaluation protocol for humor generation becomes essential\.

To address this gap, we introduceHumorRank, a leaderboard\-oriented framework for ranking humor generation in language models\. HumorRank casts evaluation as a pairwise preference problem and aggregates outcomes into a globally consistent ranking\. We evaluate nine models on the SemEval\-2026 Task 1: MWAHAHA dataset\(Castro et al\.,[2026](https://arxiv.org/html/2604.19786#bib.bib10)\), demonstrating scalable, interpretable, and reproducible comparison of humor generation systems\.

Our contributions are as follows:

1. 1\.We introduceHumorRank, a scalable framework that formulates humor evaluation as a global ranking problem over competing language models\.
2. 2\.We formalize humor assessment as apairwise preference learning taskand show that Bradley–Terry estimation yields stable and comparable rankings\.
3. 3\.We develop atheory\-grounded LLM\-as\-a\-Judgeprotocol that produces structured and interpretable evaluation signals, and demonstrate its effectiveness through a large\-scale comparative study on SemEval\-2026 MWAHAHA test dataset with consistent cross\-judge rankings\.

## 2HumorRank

The subjective and multidimensional nature of humor presents fundamental challenges to absolute quality scoring of humor\. To address this, we operationalize humor theoretically as a continuouscognitive rewardarising from the successful resolution of deliberately constructed linguistic incongruities \(for our full formal definition and its theoretical derivation, see Appendix[A](https://arxiv.org/html/2604.19786#A1)\)\. Because lexical and semantic features of humor such as comedic delivery\(Romanowski et al\.,[2025](https://arxiv.org/html/2604.19786#bib.bib29); Kim & Chilton,[2025](https://arxiv.org/html/2604.19786#bib.bib22)\)interact in ways that resist direct quantification or scoring\(Winters & Van der Stockt,[2025](https://arxiv.org/html/2604.19786#bib.bib40)\), using pairwise comparison mitigates these limitationsRavi et al\. \([2024](https://arxiv.org/html/2604.19786#bib.bib28)\)by constraining the humor evaluation task to a relative preference judgment between two model\-generated jokes conditioned on the same prompt\(Hossain et al\.,[2020](https://arxiv.org/html/2604.19786#bib.bib20)\)\. This formulation substantially reduces the cognitive load on the evaluator and is more robust to inter\-annotator variance than uncalibrated scalar annotation\.

While pairwise comparisons provide high\-fidelity local signal, they remain fundamentally discrete and unordered thereby insufficient, in isolation, to support a system\-level leaderboard\. To transform a collection of\(K2\)\\binom\{K\}\{2\}pairwise outcomes overKKcompeting models into a globally consistent capability ranking, an aggregation framework capable of resolving local inconsistencies and propagating information across the full tournament graph is required\.

HumorRank addresses this through a two\-stage pipeline: an Adaptive Swiss Tournament that efficiently generates the pairwise comparison graph, and a Bradley\-Terry\(BT\) global Maximum Likelihood Estimation \(MLE\) that maps the observed outcomes to statistically grounded, continuous capability estimates\. We additionally derive Stable Elo ratings as a secondary reference metric, reported alongside BT scores for cross\-validation purposes\.

### 2\.1Bradley\-Terry Global Maximum Likelihood Estimation

The Bradley\-Terry model\(Bradley & Terry,[1952](https://arxiv.org/html/2604.19786#bib.bib9)\)serves as our primary, order\-independent ranking algorithm in this work\. By maximizing the likelihood of the observed pairwise outcomes across the entire tournament graph, the BT model estimates a latent “humor capability” score for each model\.

Given modelsiiandjjwith latent ratingsRiR\_\{i\}andRjR\_\{j\}, the probability that modeliiwins over modeljjis formulated as an Elo\-scaled logistic function:

P​\(i wins against j\)=11\+10\(Rj−Ri\)/400P\(\\text\{i wins against j\}\)=\\frac\{1\}\{1\+10^\{\(R\_\{j\}\-R\_\{i\}\)/400\}\}\(1\)
Instead of sequential updates, HumorRank fits the global Maximum Likelihood Estimation \(MLE\) using the iterative Minorization\-Maximization \(MM\) algorithm until a strict convergence tolerance \(ϵ<10−6\\epsilon<10^\{\-6\}\) is met and ratings are anchored at a base of 1000\. To establish statistical significance and prove that the separation between model tiers is rigorous, we compute and report 95% confidence intervals by resampling the match history with replacement via 100 bootstrap iterations\.

### 2\.2Stable Elo \(Sequential Reference\)

While the BT model provides the global MLE, we simultaneously compute a sequential Elo ratingAlbers & Vries \([2001](https://arxiv.org/html/2604.19786#bib.bib4)\)to track dynamic stability and provide a secondary reference metric\. The generalized sequential update rule is:

Rnew=Rcurr\+Kf​a​c⋅\(S−E\)R\_\{\\text\{new\}\}=R\_\{\\text\{curr\}\}\+K\_\{fac\}\\cdot\(S\-E\)\(2\)whereKf​a​c=32K\_\{fac\}=32specifies the maximum volatility factor,SSdenotes the observed outcome \(1\.01\.0for a win,0\.50\.5for a tie,0\.00\.0for a loss\), andEEis the expected probability derived from Equation 1\.

A known deficiency of standard Elo is order dependence, wherein the specific sequence of matches heavily influences the final ratings\. HumorRank mitigates this vulnerability by implementingStable Elo: the entire tournament history is replayed acrossN=5N=5randomly shuffled topological orderings\. The final assigned score is the arithmetic mean of the resulting terminal ratings, ensuring sequence\-agnostic stability \(see Appendix[I](https://arxiv.org/html/2604.19786#A9)for convergence proofs across shuffles\)\.

### 2\.3Adaptive Swiss Pairing

For large model pools, exhaustive𝒪​\(K2\)\\mathcal\{O\}\(K^\{2\}\)pairwise comparisons become financially and computationally prohibitive\. The HumorRank tournament engine resolves this through anAdaptive Swiss Pairingalgorithm controlled by a single budget parameterCmaxC\_\{\\max\}\. Rather than random assignment, the engine preferentially matches models of similar standing while avoiding previously observed pairings, maximizing information gain per match and driving Bradley–Terry convergence in𝒪​\(K​log⁡K\)\\mathcal\{O\}\(K\\log K\)comparisons\. The coverage depth is determined entirely byCmaxC\_\{\\max\}: settingCmax=\(K2\)×PC\_\{\\max\}=\\binom\{K\}\{2\}\\times P\(wherePPdenotes the number of prompts\) yields exhaustive pair coverage across all prompts; reducingCmaxC\_\{\\max\}below this threshold recovers the sublinear efficiency without altering the pairing logic \(see Algorithm[1](https://arxiv.org/html/2604.19786#alg1)\)\.

Algorithm 1Adaptive Swiss Pairing for Tournament Evaluation1:Set of models

ℳ\\mathcal\{M\}, max comparisons

Cm​a​xC\_\{max\}, tracking Elo ratings

RR
2:Initialize match history graph

G←\(V=ℳ,E=∅\)G\\leftarrow\(V=\\mathcal\{M\},E=\\emptyset\)
3:while

\|E\|<Cm​a​x\|E\|<C\_\{max\}do

4:Sort

ℳ\\mathcal\{M\}descending by temporal Elo ratings

RR
5:Initialize unmatched subset

U←ℳU\\leftarrow\\mathcal\{M\}
6:Initialize round pairings

P←∅P\\leftarrow\\emptyset
7:while

\|U\|≥2\|U\|\\geq 2do

8:

i←i\\leftarrowhighest\-rated model in

UU
9:Find

j∈U∖\{i\}j\\in U\\setminus\\\{i\\\}minimizing

\|Ri−Rj\|\|R\_\{i\}\-R\_\{j\}\|where

\(i,j\)\(i,j\)is under\-sampled in

GG
10:ifvalid match

jjexiststhen

11:

P←P∪\{\(i,j\)\}P\\leftarrow P\\cup\\\{\(i,j\)\\\}
12:

U←U∖\{i,j\}U\\leftarrow U\\setminus\\\{i,j\\\}
13:else

14:break⊳\\trianglerightFallback resolution for pairing exhaustion

15:endif

16:endwhile

17:Execute LLM\-as\-a\-judge evaluations for all pairs in

PP
18:Update match edges

E←E∪PE\\leftarrow E\\cup P
19:Update tracking ratings

RRusing observed outcomes

20:endwhile

21:returnGlobal match history graph

GG

## 3Experimental Setup

To empirically validate the HumorRank methodology, we execute a comprehensive large\-scale evaluation using officially recognized humor generation benchmarks\. Our experimental design explicitly tests the framework’s discriminative power across varying model architectures, access paradigms, and parameter scales\. Full reproducibility details, including hyperparameters and computational budget, are provided in Appendix[D](https://arxiv.org/html/2604.19786#A4)\.

### 3\.1Dataset: SemEval\-2026 MWAHAHA

We leverage the full test set of the SemEval\-2026 Task 1: MWAHAHA dataset\(Castro et al\.,[2026](https://arxiv.org/html/2604.19786#bib.bib10)\)\. The dataset comprises 300 distinct prompt conditions encompassing both headline\-based associative humor and constrained word\-combination tasks\. By utilizing an established, external, and heterogeneous test set designed as part of the competiton, we ensure robust domain generalizability and inoculate the evaluation against selective cherry\-picking\.

### 3\.2Model Evaluation Suite

We evaluate a deliberately diverse suite of 9 language models to assess the leaderboard’s capacity to resolve fine\-grained capability differences\. The inclusion criteria strictly span multiple model lineages and access paradigms:

- •Frontier Proprietary Models:GPT\-5Singh et al\. \([2025](https://arxiv.org/html/2604.19786#bib.bib33)\), Gemini 2\.5 ProComanici et al\. \([2025](https://arxiv.org/html/2604.19786#bib.bib12)\), Claude 3\.5 HaikuAnthropic \([2024](https://arxiv.org/html/2604.19786#bib.bib5)\), and Kimi K2Team et al\. \([2026](https://arxiv.org/html/2604.19786#bib.bib35)\)\.
- •Open\-Weight Competitive Models:Qwen 3 32BBai et al\. \([2025](https://arxiv.org/html/2604.19786#bib.bib8)\), GPT OSS 120BAgarwal et al\. \([2025](https://arxiv.org/html/2604.19786#bib.bib1)\), and Llama 3\.3 70B InstructGrattafiori et al\. \([2024](https://arxiv.org/html/2604.19786#bib.bib17)\)\.
- •Specialized and Baseline Models:HumorGen 7B\(Ajayi & Mitra,[2025a](https://arxiv.org/html/2604.19786#bib.bib2)\)\(a fine\-tuned specialist designed explicitly for humor generation\) alongsideBase Qwen 7BTeam\([2024](https://arxiv.org/html/2604.19786#bib.bib36)\)acting as the critical zero\-shot architecture control\.

### 3\.3Evaluation Protocol and LLM\-as\-Judge Ablation

HumorRank employs open\-weight models as pairwise judges, executing the full round\-robin schedule across all 9 contestants \(\(92\)=36\\binom\{9\}\{2\}=36matches per prompt×300\\times~300prompts=10,800=10\{,\}800total automated judgments\)\. We structure the judgeship around two complementary roles\.

Primary Judge:Llama 3\.3 70B Instruct \(quantized for local inference\) serves as the primary judge, instructed to act as a comedy critic and select the funnier joke between two options with structured reasoning grounded in the GTVH taxonomy \(humor mechanisms, delivery features, and failure modes\)\. The exact prompt template is reproduced in Appendix[E](https://arxiv.org/html/2604.19786#A5)\.

Judge Ablation & Validity Check:A recurrent critique of LLM\-as\-a\-Judge is the risk of judge\-specific preference artifacts\. To address this, we independently re\-evaluate the identical 10,800 matches using Qwen 2\.5 72B as an alternative judge\. High cross\-judge rank correlation would constitute direct evidence of a latent, model\-agnostic humor ordering\. Additionally, to ground the automated pipeline against human perception, we conduct a controlled annotation study on 60 representative pairwise matches, with independent annotators blind to model identity \(annotator guidelines and inter\-annotator agreement metrics are detailed in Appendix[H](https://arxiv.org/html/2604.19786#A8)\)\.

## 4Results

Our evaluation yields an extensive empirical profile of humor capability across current language models\. We present the system\-level Bradley\-Terry \(BT\) leaderboard, validate its stability across independent LLM judges, and subsequently decompose these ratings into interpretable psychometric features\.

### 4\.1HumorRank Leaderboard

The main tournament, evaluated by Llama 3\.3 70B across 10,800 matches, reveals a clear, statistically significant stratification of model capabilities\. Table[1](https://arxiv.org/html/2604.19786#S4.T1)details the derived BT ratings, 95% Confidence Intervals, and win rates for the evaluated suite\.

RankModelBT Rating95% CIWin Rate1GPT\-51307\.5\[1288\.5,1325\.9\]\[1288\.5,1325\.9\]84\.0%2Kimi\-K21156\.9\[1139\.9,1169\.1\]\[1139\.9,1169\.1\]67\.8%3Gemini 2\.5 Pro1115\.1\[1096\.6,1128\.1\]\[1096\.6,1128\.1\]62\.6%4HumorGen\-7B111Referred to HumorGen\-7B as HumorGen SFT 7B in plots and figures\.1092\.8\[1077\.6,1108\.5\]\\mathbf\{\[1077\.6,1108\.5\]\}59\.8%5Claude 3\.5 Haiku1037\.5\[1022\.4,1050\.9\]\[1022\.4,1050\.9\]52\.7%6GPT OSS 120B1015\.0\[1004\.2,1029\.9\]\[1004\.2,1029\.9\]49\.8%7Qwen 3 32B976\.9\[965\.0,987\.8\]\[965\.0,987\.8\]45\.0%8Llama 3\.3 70B761\.0\[745\.1,781\.4\]\[745\.1,781\.4\]21\.8%9Base Qwen 7B537\.4\[513\.5,566\.7\]\[513\.5,566\.7\]6\.5%Table 1:Primary HumorRank Leaderboard \(Llama 3\.3 70B Judge\)\. Models cleanly separate into a Frontier Elite tier \(\>1100\>1100\), a Competitive Mid\-Tier \(970−1100970\-1100\), and a Weak Baseline \(<800<800\)\.![Refer to caption](https://arxiv.org/html/2604.19786v1/images/bt_leaderboard_llama.png)![Refer to caption](https://arxiv.org/html/2604.19786v1/images/winrate_heatmap_llama.png)

Figure 1:HumorRank Leaderboard \(left\) and Pairwise Win\-Rate Heatmap \(right\) showing the performance of the 9 models\.Remarkably, the specializedHumorGen\-7Bmodel \(Rank 4, BT = 1092\.8\) successfully bridges the gap between the mid\-tier open\-weights and the proprietary frontier, cleanly outperforming models an order of magnitude larger \(e\.g\., GPT OSS 120B, Rank 6\) as shown in Table[1](https://arxiv.org/html/2604.19786#S4.T1)and Figure[1](https://arxiv.org/html/2604.19786#S4.F1)\. It is also notable that the Llama 3\.3 70B judge penalizes its own generations \(ranking itself 8th out of 9, BT = 761\.0; Table[1](https://arxiv.org/html/2604.19786#S4.T1)\), providing a strong empirical counter\-signal to frequent concerns regarding LLM self\-preference bias\.

### 4\.2Cross\-Judge Validity and Rank Stability

Evaluating subjective data is inherently sensitive to the choice of the primary evaluator\. To confirm that the derived rankings represent a generalized latent humor ordering rather than Llama\-specific architectural artifacts, we conduct a full ablation running the identical 10,800 tournament matches utilizing the Qwen 2\.5 72B evaluator with details of the ranking shown in the Appendix[B](https://arxiv.org/html/2604.19786#A2)\.

The resulting Qwen\-adjudicated BT ratings exhibit exceptionally high rank correlation against the primary Leaderboard, achieving aKendall’sτ=0\.889\\tau=0\.889\(p<0\.01p<0\.01\)\. The top\-two models \(GPT\-5, Kimi K2\) and the bottom\-two models remain rigidly fixed, with minimal mid\-tier variance \(see Appendix[B](https://arxiv.org/html/2604.19786#A2)\)\. The absence of intransitive loops \(Transitivity Score = 1\.0\) across both paradigms validates the use of BT modeling for this domain\.

### 4\.3Human Evaluation

To assess reliability of our automated evaluator, we conducted a blind annotation study with human evaluators \(n=2n=2, thenn=3n=3\) on 60 randomly sampled pairs\. We utilize Krippendorff’s Alpha \(α\\alpha\) as it accommodates the subjective nominal nature of humor preference judgments and handles incomplete annotator overlap\. Inter\-annotator reliability isα=0\.432\\alpha=0\.432with two evaluators, decreasing toα=0\.397\\alpha=0\.397with a third\. This decrease demonstrates how additional perspectives increase variance, providing direct evidence of humor’s inherent subjectivity\. Full protocol details are in Appendix[H](https://arxiv.org/html/2604.19786#A8)\.

### 4\.4Theory\-Grounded Feature Interpretability

To transcend zero\-dimensional scalar leaderboards, HumorRank utilizes a Layered Psychometric Model grounded in the General Theory of Verbal Humor \(GTVH\)\(Attardo,[2017](https://arxiv.org/html/2604.19786#bib.bib6)\), detailed in Table[2](https://arxiv.org/html/2604.19786#S4.T2)\. We isolateHumor Mechanisms\(deep semantic incongruities\),Delivery Features\(surface psycholinguistic presentations\), andFailure Modes\.

Table 2:Taxonomy of Humor Features extracted from model generations, grounded in the General Theory of Verbal Humor \(GTVH\)\.Analysis of automated tagging over 10,000 wins delineates highly distinctive comedic signatures across three model archetypes:

- •The Frontier Generalist \(e\.g\., GPT\-5\):Relies heavily on pristine delivery, exhibiting exceptionally highConciseness\(over 30% of wins\) alongside standardIncongruity\(Figure[2](https://arxiv.org/html/2604.19786#S4.F2), right\)\. Its polished, safe generations result inClichéas its primary failure mode\.
- •The Absurdist Specialist \(e\.g\., HumorGen 7B\):Surpasses its scale limitation by mastering deep structural mechanisms—the highest corpus rates ofAbsurdity\(25\.8%\) andSarcasm\(9\.2%\) across all models \(Figure[2](https://arxiv.org/html/2604.19786#S4.F2), left\)—leaning onEscalationrather than brevity \(Figure[2](https://arxiv.org/html/2604.19786#S4.F2), right\)\.
- •The Weak Baseline \(e\.g\., Llama 70B\):Over\-indexes on superficialWordplay\(29\.5%; Figure[2](https://arxiv.org/html/2604.19786#S4.F2), left\) while chronically suffering fromWeak Punchlines\(45% of failure tags; Table[3](https://arxiv.org/html/2604.19786#S4.T3)\), lacking the structural commitment of higher\-rated models\.

Table[3](https://arxiv.org/html/2604.19786#S4.T3)presents a concrete pairwise example illustrating these distinctions\. The corresponding per\-model feature distributions, adjudicated by the primary Llama 3\.3 70B judge, are visualized in Figure[2](https://arxiv.org/html/2604.19786#S4.F2); Qwen 2\.5 72B results are provided in Appendix[F](https://arxiv.org/html/2604.19786#A6)for cross\-judge comparison\.

Table 3:Qualitative pairwise example illustrating contrasting GTVH feature profiles of a winning vs\. losing generation\.![Refer to caption](https://arxiv.org/html/2604.19786v1/images/feature_winner_heatmap_llama.png)

![Refer to caption](https://arxiv.org/html/2604.19786v1/images/feature_delivery_heatmap_llama.png)

Figure 2:Per\-model winning feature distributions \(Llama 3\.3 70B judge\)\.Left:Humor mechanisms \(% of wins\)\.Right:Delivery features \(% of wins\)\. Frontier models dominate viaConciseness; the specialist model leads onAbsurdityandEscalation; baseline models over\-index onWordplay\.Complementing the winner profiles, Figure[3](https://arxiv.org/html/2604.19786#S4.F3)reveals the failure mode distribution for each model\. Notably, the dominant failure modes for most models areClichéandWeak Punchline—butHumorGen\-7Bexhibits a strikingly different signature: it accrues substantially higherOverexplained\(25\.2%\) andBuried Punchline\(20\.4%\) failure rates than any other model, indicating that its aggressive structural depth occasionally overreaches—building setups that collapse before delivery rather than landing superficial wordplay\.

![Refer to caption](https://arxiv.org/html/2604.19786v1/images/feature_loser_heatmap_llama.png)Figure 3:Per\-model failure mode distributions \(Llama 3\.3 70B judge\)\.ClichéandWeak Punchlinedominate most models, but HumorGen\-7B stands out with markedly higherOverexplainedandBuried Punchlinerates—a byproduct of its deep\-structure comedic strategy\. Qwen 2\.5 72B failure modes are in Appendix[F](https://arxiv.org/html/2604.19786#A6)\.This feature landscape confirms that scaling alone yields refined delivery, but specialized alignment is necessary for deep structural absurdity—a finding consistent across both judge architectures \(Appendix[F](https://arxiv.org/html/2604.19786#A6)\)\.

## 5Related Works

### 5\.1Model Evaluation in Humor Generation Systems

Despite growing interest in LLM humor capabilities, existing evaluation protocols remain inconsistent and largely incomparable across studies\. Existing approaches span across a wide range of paradigms: automated metrics applied to human\-AI co\-creative humorWu et al\. \([2025b](https://arxiv.org/html/2604.19786#bib.bib42)\); crowd\-sourced AI voting panels that assess different humor types independentlyGoes et al\. \([2022](https://arxiv.org/html/2604.19786#bib.bib15)\); Best\-Worst Scaling \(BWS\), which elicits relative judgments across joke sets rather than absolute ratingsYamane \([2024](https://arxiv.org/html/2604.19786#bib.bib43)\); funniness rating scales and Likert\-style templatesGorenz & Schwarz \([2024](https://arxiv.org/html/2604.19786#bib.bib16)\); and fully human evaluation, which—while considered the gold standard—is costly, slow, and typically limited to small validation setsZhang et al\. \([2024](https://arxiv.org/html/2604.19786#bib.bib44)\); Goel et al\. \([2024](https://arxiv.org/html/2604.19786#bib.bib14)\); Wang et al\. \([2025](https://arxiv.org/html/2604.19786#bib.bib39)\); Jain et al\. \([2024](https://arxiv.org/html/2604.19786#bib.bib21)\)\. Broader evaluations of LLM humor understanding and generation abilityAjayi & Mitra \([2025b](https://arxiv.org/html/2604.19786#bib.bib3)\); Zhou et al\. \([2025](https://arxiv.org/html/2604.19786#bib.bib47)\); Song et al\. \([2025](https://arxiv.org/html/2604.19786#bib.bib34)\)similarly rely on these fragmented protocols\.

Critically, none of these approaches produce*ranked preference orderings*across multiple generation systems, nor do they provide interpretable rationale for their judgments\. Each evaluation is essentially a one\-off measurement rather than a scalable, comparative framework\. This limitation stands in contrast to the broader NLP evaluation literature, where tournament\-based rating systems have emerged as a principled and scalable alternative—a development we discuss in Section[5\.3](https://arxiv.org/html/2604.19786#S5.SS3)\.

### 5\.2Computational Humor: Datasets, Theory, and Generation

The study of humor is rooted in psychologyLarkin\-Galiñanes \([2017](https://arxiv.org/html/2604.19786#bib.bib23)\)and linguisticsAttardo \([2024](https://arxiv.org/html/2604.19786#bib.bib7)\), with classical theories such as superiority, relief, and incongruity providing explanatory accounts of what constitutes humorVeatch \([1998](https://arxiv.org/html/2604.19786#bib.bib38)\)\. These frameworks motivate interpretable dimensions of humor—such as expectation violation, tension release, and social positioning—that remain influential in computational modeling\. However, they do not provide a deterministic recipe for generationLarkin\-Galiñanes \([2017](https://arxiv.org/html/2604.19786#bib.bib23)\)because humor is inherently subjective, varying across context, culture, and individual perception\. Despite this, linguistic and pragmatic analysis identify relatively stable cues—such as timing, delivery, ambiguity, and form–meaning incongruity—that support both dataset construction and automated evaluation\.

Building on these foundations, prior work has introduced a range of humor benchmarks, historically focused on text\-based tasks\. More recently, advances in large language models have expanded this landscape to include multimodal datasets and evaluation settings spanning humor generation, understanding, and rankingZhong et al\. \([2024](https://arxiv.org/html/2604.19786#bib.bib46)\); Zhang et al\. \([2024](https://arxiv.org/html/2604.19786#bib.bib44)\); He et al\. \([2024](https://arxiv.org/html/2604.19786#bib.bib18)\); Ryan et al\. \([2025](https://arxiv.org/html/2604.19786#bib.bib30)\); Jain et al\. \([2024](https://arxiv.org/html/2604.19786#bib.bib21)\)\. While these developments broaden the empirical scope of computational humor, they also underscore the need for scalable and comparable evaluation methodologies\. In particular, reliably assessing humor requires frameworks that can accommodate subjectivity while enabling consistent comparison across models, motivating structured, ranking\-based approaches to evaluation\.

### 5\.3LLM Leaderboard Rating Systems in NLP Tasks

The use of leaderboard\-based evaluation has become increasingly prevalent in NLP, providing a standardized framework for comparing model performance across tasks and benchmarksToloka Team \([2023](https://arxiv.org/html/2604.19786#bib.bib37)\); Chiang et al\. \([2024](https://arxiv.org/html/2604.19786#bib.bib11)\); Myrzakhan et al\. \([2024](https://arxiv.org/html/2604.19786#bib.bib24)\)\. Modern leaderboard platforms, such as Chatbot ArenaChiang et al\. \([2024](https://arxiv.org/html/2604.19786#bib.bib11)\)and the Open LLM LeaderboardSilva et al\. \([2026](https://arxiv.org/html/2604.19786#bib.bib32)\), often leverageLLM\-as\-a\-JudgeparadigmsZheng et al\. \([2023](https://arxiv.org/html/2604.19786#bib.bib45)\)to enable scalable and automated evaluation of model outputs\. This approach is particularly attractive as it supports both human and model\-based preference judgments, allowing for flexible and cost\-effective evaluation pipelines\. Furthermore, leaderboard\-based systems facilitate direct comparison of models under consistent conditions, making them a practical choice for benchmarking progress in NLPFederiakin \([2025](https://arxiv.org/html/2604.19786#bib.bib13)\); Myrzakhan et al\. \([2024](https://arxiv.org/html/2604.19786#bib.bib24)\)\. Prior work suggests that such ranking frameworks provide a reliable proxy for model quality and can be adapted to diverse settings, including multilingual and domain\-specific evaluation scenariosPark et al\. \([2024](https://arxiv.org/html/2604.19786#bib.bib26)\); Silva et al\. \([2026](https://arxiv.org/html/2604.19786#bib.bib32)\)\.

## 6Conclusions

We presented HumorRank, a leaderboard\-oriented framework for comparing LLM humor generation under conditioned prompts\. The pipeline combines exhaustive pairwise coverage via Adaptive Swiss pairing with Bradley–Terry MLE \(primary\) and Stable Elo as a sequence\-robust reference, enabling scalable evaluation that extends to sublinear match budgets for larger model pools\. On SemEval\-2026 MWAHAHA \(300300prompts\)Castro et al\. \([2026](https://arxiv.org/html/2604.19786#bib.bib10)\), we rank nine large language models spanning proprietary and open\-weight systems\. We used Llama 3\.3 70B as the primary judge and Qwen 2\.5 72B as a secondary evaluator, observing high rank agreement \(Kendall’sτ=0\.889\\tau=0\.889\), demonstrating that the induced rankings are stable across distinct judge architectures\.

Empirically, frontier models cluster at the top of the leaderboard; however, a domain\-specialized 7B model \(HumorGen 7B\(Ajayi & Mitra,[2025a](https://arxiv.org/html/2604.19786#bib.bib2)\)\) achieves competitive performance with substantially larger general\-purpose systems, supporting the hypothesis that humor quality depends on mechanisms beyond scale alone\. Post\-hoc analysis of judge rationales, grounded in the General Theory of Verbal Humor, reveals consistent patterns in winning responses, highlighting the importance of delivery, timing, and structured incongruity in distinguishing high\-quality humor\.

HumorRank is designed as a reusable evaluation stack, enabling consistent comparison as new models and prompts emerge\. Preliminary human evaluation indicates directional alignment between automated judgments and human preferences, with a fuller annotation study ongoing\. We defer a detailed discussion of limitations to Section[7](https://arxiv.org/html/2604.19786#S7)and outline directions for expanding coverage and validation in future work\. This work establishes humor generation as a comparable and rankable capability, enabling systematic evaluation beyond isolated metrics\.

## 7Limitations

We note several limitations of the current study:

- •Monolingual Scope:Evaluation is conducted on English data; cross\-lingual and cross\-cultural humor settings are not examined in this work\.
- •Model Coverage:Experiments are limited to nine models, which do not exhaust the full space of contemporary systems\.
- •Dataset Scope:Evaluation is directly anchored to the SemEval\-2026 MWAHAHA test dataset\. While it rigorously tests generative capability within its established distribution, it does not evaluate performance across alternative interactive or multimodal humor paradigms\.

## 8Reproducibility Statement

To ensure full reproducibility of the HumorRank framework, we detail the core hyperparameter configurations and computational hardware requirements necessary to execute the generative tournament\.

Generation Hyperparameters:For all candidate models evaluated in the tournament, we standardized the generation settings to prioritize creative diversity while maintaining structural coherence\. Specifically, we configured the candidate models with a standard sampling temperature ofT=0\.7T=0\.7\. Default nucleus sampling \(top\-p=1\.0p=1\.0\) and token limit thresholds were inherited directly from the respective model APIs to preserve native instructional adherence without imposing artificial truncation\.

Judge Hyperparameters:The LLM judges \(Llama 3\.3 70B and Qwen 2\.5 72B\) were configured with a highly constrained sampling temperature ofT=0\.1T=0\.1alongside a maximum retry threshold of 3 \(with exponential backoff\) for all pairwise JSON evaluation calls\. Given the substantial financial and computational cost of the expansive generative tournament, this near\-deterministic setting strictly ensures that the judge maintains stability and does not yield erratic or contradictory evaluations to the same prompt upon reassessment, thereby firmly preserving the integrity of the Bradley\-Terry ratings\.

Computational Hardware:The large\-scale algorithmic execution of the HumorRank tournament is densely computationally intensive\. Orchestrating the complete 10,800\-match evaluation sequence for both the primary and secondary leaderboard rankings strictly required 48 hours of dedicated NVIDIA H100 \(80GB\) GPU compute time\. This extensive hardware footprint is strictly required to identically replicate the vast comparative throughput of the tournament dynamics\. To ensure full reproducibility, our tournament code, evaluation scripts, and the Adaptive Swiss pairing implementation are provided as a ZIP file in the supplementary materials\.

## 9Ethics Statement

This work proposes a framework for the systematic evaluation and ranking of humor generation across large language models; it does not itself constitute a humor generation system\. Two ethical considerations warrant explicit acknowledgment\. First, humor is a culturally and contextually variable phenomenon whose boundaries with offensive or exclusionary expression are highly sensitive to audience and setting\. Evaluation frameworks that rank models on comedic output implicitly surface content generated by those models, and practitioners adapting such pipelines for downstream applications bear responsibility for enforcing appropriate content\-moderation constraints\. Second, the validity of automated ranking is constrained by the cultural and stylistic distribution of the judge model’s pretraining corpus\. LLM\-based evaluators trained predominantly on high\-resource, Western\-centric text may systematically disadvantage humor conventions from linguistically or culturally underrepresented communities—a limitation shared by the broader LLM\-as\-a\-judge literature\. HumorRank should therefore be interpreted as a reproducible diagnostic benchmark rather than a definitive assessment of comedic or creative quality\.

## References

- Agarwal et al\. \(2025\)Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al\.gpt\-oss\-120b & gpt\-oss\-20b model card\.*arXiv preprint arXiv:2508\.10925*, 2025\.
- Ajayi & Mitra \(2025a\)Edward Ajayi and Prasenjit Mitra\.Humorgen: Cognitive synergy for humor generation in large language models via persona\-based distillation\.[https://huggingface\.co/Jayi2424/HumorGen\-7B](https://huggingface.co/Jayi2424/HumorGen-7B), 2025a\.Preprint\.
- Ajayi & Mitra \(2025b\)Edward Ajayi and Prasenjit Mitra\.Automatic humor detection: A comprehensive survey from theoretical foundations to large language models\.December 2025b\.doi:10\.13140/RG\.2\.2\.24393\.61288\.URL[https://doi\.org/10\.13140/RG\.2\.2\.24393\.61288](https://doi.org/10.13140/RG.2.2.24393.61288)\.Preprint\.
- Albers & Vries \(2001\)Paul CH Albers and Han de Vries\.Elo\-rating as a tool in the sequential estimation of dominance strengths\.*Animal behaviour*, pp\. 489–495, 2001\.
- Anthropic \(2024\)Anthropic\.The claude 3 model family: Opus, sonnet, haiku\.[https://www\-cdn\.anthropic\.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model\_Card\_Claude\_3\.pdf](https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf), 2024\.Model card\.
- Attardo \(2017\)Salvatore Attardo\.The general theory of verbal humor\.In*The Routledge handbook of language and humor*, pp\. 126–142\. Routledge, 2017\.
- Attardo \(2024\)Salvatore Attardo\.*Linguistic theories of humor*, volume 1\.Walter de Gruyter GmbH & Co KG, 2024\.
- Bai et al\. \(2025\)Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al\.Qwen3\-vl technical report\.*arXiv preprint arXiv:2511\.21631*, 2025\.
- Bradley & Terry \(1952\)Ralph Allan Bradley and Milton E Terry\.Rank analysis of incomplete block designs: I\. the method of paired comparisons\.*Biometrika*, 39\(3/4\):324–345, 1952\.
- Castro et al\. \(2026\)Santiago Castro, Luis Chiruzzo, Santiago Góngora, Salar Rahili, Naihao Deng, Ignacio Sastre, Victoria Amoroso, Guillermo Rey, Aiala Rosá, Guillermo Moncecchi, J\. A\. Meaney, Juan José Prada, and Rada Mihalcea\.SemEval\-2026 Task 1: MWAHAHA, Models Write Automatic Humor And Humans Annotate\.In*Proceedings of the 20th International Workshop on Semantic Evaluation \(SemEval\-2026\)*, 2026\.
- Chiang et al\. \(2024\)Wei\-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al\.Chatbot arena: An open platform for evaluating llms by human preference\.In*Forty\-first International Conference on Machine Learning*, 2024\.
- Comanici et al\. \(2025\)Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al\.Gemini 2\.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.*arXiv preprint arXiv:2507\.06261*, 2025\.
- Federiakin \(2025\)Denis Federiakin\.Improving llm leaderboards with psychometrical methodology\.*arXiv preprint arXiv:2501\.17200*, 2025\.
- Goel et al\. \(2024\)Mayank Goel, Parameswari Krishnamurthy, and Radhika Mamidi\.Automating humor: A novel approach to joke generation using template extraction and infilling\.In*Proceedings of the 21st International Conference on Natural Language Processing \(ICON\)*, pp\. 442–448, 2024\.
- Goes et al\. \(2022\)Fabricio Goes, Zisen Zhou, Piotr Sawicki, Marek Grzes, and Daniel G Brown\.Crowd score: A method for the evaluation of jokes using large language model ai voters as judges\.*arXiv preprint arXiv:2212\.11214*, 2022\.
- Gorenz & Schwarz \(2024\)Drew Gorenz and Norbert Schwarz\.How funny is chatgpt? a comparison of human\-and ai\-produced jokes\.*Plos one*, 19\(7\):e0305364, 2024\.
- Grattafiori et al\. \(2024\)Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\.
- He et al\. \(2024\)Ruiqi He, Yushu He, Longju Bai, Jiarui Liu, Zhenjie Sun, Zenghao Tang, He Wang, Hanchen Xia, Rada Mihalcea, and Naihao Deng\.Chumor 2\.0: Towards benchmarking chinese humor understanding\.*arXiv preprint arXiv:2412\.17729*, 2024\.
- Horvitz et al\. \(2024\)Zachary Horvitz, Jingru Chen, Rahul Aditya, Harshvardhan Srivastava, Robert West, Zhou Yu, and Kathleen McKeown\.Getting serious about humor: Crafting humor datasets with unfunny large language models\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\)*, pp\. 855–869, 2024\.
- Hossain et al\. \(2020\)Nabil Hossain, John Krumm, Michael Gamon, and Henry Kautz\.Semeval\-2020 task 7: Assessing humor in edited news headlines\.In*Proceedings of the fourteenth workshop on semantic evaluation*, pp\. 746–758, 2020\.
- Jain et al\. \(2024\)Veedant Jain, Felipe dos Santos Alves Feitosa, and Gabriel Kreiman\.Is ai fun? humordb: a curated dataset and benchmark to investigate graphical humor\.*arXiv preprint arXiv:2406\.13564*, 2024\.
- Kim & Chilton \(2025\)Sean Kim and Lydia B Chilton\.Ai humor generation: Cognitive, social and creative skills for effective humor\.*arXiv preprint arXiv:2502\.07981*, 2025\.
- Larkin\-Galiñanes \(2017\)Cristina Larkin\-Galiñanes\.An overview of humor theory\.*The Routledge handbook of language and humor*, pp\. 4–16, 2017\.
- Myrzakhan et al\. \(2024\)Aidar Myrzakhan, Sondos Mahmoud Bsharat, and Zhiqiang Shen\.Open\-llm\-leaderboard: From multi\-choice to open\-style questions for llms evaluation, benchmark, and arena\.*arXiv preprint arXiv:2406\.07545*, 2024\.
- Narad et al\. \(2025\)Reuben Narad, Siddharth Suresh, Jiayi Chen, Pine SL Dysart\-Bricken, Bob Mankoff, Robert Nowak, Jifan Zhang, and Lalit Jain\.Which llms get the joke? probing non\-stem reasoning abilities with humorbench\.*arXiv preprint arXiv:2507\.21476*, 2025\.
- Park et al\. \(2024\)Chanjun Park, Hyeonwoo Kim, Dahyun Kim, Seonghwan Cho, Sanghoon Kim, Sukyung Lee, Yungi Kim, and Hwalsuk Lee\.Open ko\-llm leaderboard: Evaluating large language models in korean with ko\-h5 benchmark\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 3220–3234, 2024\.
- Quan et al\. \(2025\)Kexin Quan, Pavithra Ramakrishnan, and Jessie Chin\.Can ai take a joke—or make one? a study of humor generation and recognition in llms\.In*Proceedings of the 2025 Conference on Creativity and Cognition*, pp\. 431–437, 2025\.
- Ravi et al\. \(2024\)Sahithya Ravi, Patrick Huber, Akshat Shrivastava, Vered Shwartz, and Arash Einolghozati\.Small but funny: A feedback\-driven approach to humor distillation\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 13078–13090, 2024\.
- Romanowski et al\. \(2025\)Adrianna Romanowski, Pedro HV Valois, and Kazuhiro Fukui\.From punchlines to predictions: A metric to assess llm performance in identifying humor in stand\-up comedy\.In*Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics*, pp\. 36–46, 2025\.
- Ryan et al\. \(2025\)Yuriel Ryan, Rui Yang Tan, Kenny Tsu Wei Choo, and Roy Ka\-Wei Lee\.Humor in pixels: Benchmarking large multimodal models understanding of online comics\.*arXiv preprint arXiv:2509\.12248*, 2025\.
- Shafiei & Saffari \(2025\)Mohammadamin Shafiei and Hamidreza Saffari\.Not all jokes land: Evaluating large language models understanding of workplace humor\.*arXiv preprint arXiv:2506\.01819*, 2025\.
- Silva et al\. \(2026\)João Silva, Luís Gomes, and António Branco\.Clarin\-pt\-ldb: An open llm leaderboard for portuguese to assess language, culture and civility\.*arXiv preprint arXiv:2603\.12872*, 2026\.
- Singh et al\. \(2025\)Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El\-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al\.Openai gpt\-5 system card\.*arXiv preprint arXiv:2601\.03267*, 2025\.
- Song et al\. \(2025\)Changhao Song, Yazhou Zhang, Hui Gao, Ben Yao, and Peng Zhang\.Large language models for subjective language understanding: A survey\.*arXiv preprint arXiv:2508\.07959*, 2025\.
- Team et al\. \(2026\)Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al\.Kimi k2\. 5: Visual agentic intelligence\.*arXiv preprint arXiv:2602\.02276*, 2026\.
- Team \(2024\)Qwen Team\.Qwen2\.5: A party of foundation models, September 2024\.URL[https://qwenlm\.github\.io/blog/qwen2\.5/](https://qwenlm.github.io/blog/qwen2.5/)\.
- Toloka Team \(2023\)Toloka Team\.Understanding llm leaderboards: Metrics, benchmarks, and why they matter, November 2023\.URL[https://toloka\.ai/blog/llm\-leaderboard/](https://toloka.ai/blog/llm-leaderboard/)\.Accessed: 2026\-03\-23\.
- Veatch \(1998\)Thomas C Veatch\.A theory of humor\.1998\.
- Wang et al\. \(2025\)Han Wang, Yilin Zhao, Dian Li, Xiaohan Wang, Gang Liu, Xuguang Lan, and Hui Wang\.Innovative Thinking, Infinite Humor: Humor Research of Large Language Models through Structured Thought Leaps, April 2025\.URL[http://arxiv\.org/abs/2410\.10370](http://arxiv.org/abs/2410.10370)\.arXiv:2410\.10370 \[cs\]\.
- Winters & Van der Stockt \(2025\)Thomas Winters and Stijn Van der Stockt\.Evaluating humor generation in an improvisational comedy setting\.*Computational Linguistics in the Netherlands Journal*, 14:505–523, 2025\.
- Wu et al\. \(2025a\)Shih\-Hung Wu, Tsz\-Yeung Lau, and Yu\-Feng Huang\.Humour classification according to genre and technique by fine\-tuning llms\.In*International Conference of the Cross\-Language Evaluation Forum for European Languages*, pp\. 156–169\. Springer, 2025a\.
- Wu et al\. \(2025b\)Zhikun Wu, Thomas Weber, and Florian Müller\.One does not simply meme alone: Evaluating co\-creativity between llms and humans in the generation of humor\.In*Proceedings of the 30th International Conference on Intelligent User Interfaces*, pp\. 1082–1092, 2025b\.
- Yamane \(2024\)Hiroaki Yamane\.Generic joke generation with moral constraints\.In*International Conference on Artificial Neural Networks*, pp\. 340–355\. Springer, 2024\.
- Zhang et al\. \(2024\)Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan L Zhou, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, et al\.Humor in ai: Massive scale crowd\-sourced preferences and benchmarks for cartoon captioning\.*Advances in Neural Information Processing Systems*, 37:125264–125286, 2024\.
- Zheng et al\. \(2023\)Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al\.Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.*Advances in neural information processing systems*, 36:46595–46623, 2023\.
- Zhong et al\. \(2024\)Shanshan Zhong, Zhongzhan Huang, Shanghua Gao, Wushao Wen, Liang Lin, Marinka Zitnik, and Pan Zhou\.Let’s Think Outside the Box: Exploring Leap\-of\-Thought in Large Language Models with Creative Humor Generation\.pp\. 13246–13257, 2024\.URL[https://openaccess\.thecvf\.com/content/CVPR2024/html/Zhong\_Lets\_Think\_Outside\_the\_Box\_Exploring\_Leap\-of\-Thought\_in\_Large\_Language\_CVPR\_2024\_paper\.html](https://openaccess.thecvf.com/content/CVPR2024/html/Zhong_Lets_Think_Outside_the_Box_Exploring_Leap-of-Thought_in_Large_Language_CVPR_2024_paper.html)\.
- Zhou et al\. \(2025\)Kuan Lok Zhou, Jiayi Chen, Siddharth Suresh, Reuben Narad, Timothy T Rogers, Lalit K Jain, Robert D Nowak, Bob Mankoff, and Jifan Zhang\.Bridging the creativity understanding gap: Small\-scale human alignment enables expert\-level humor ranking in llms\.*arXiv preprint arXiv:2502\.20356*, 2025\.

## Appendix — Table of Contents

[A](https://arxiv.org/html/2604.19786#A1)Humor Definition\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A](https://arxiv.org/html/2604.19786#A1) [B](https://arxiv.org/html/2604.19786#A2)HumorRank Leaderboard Performance with Qwen 2\.5 72B LLM Judge\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B](https://arxiv.org/html/2604.19786#A2) [D](https://arxiv.org/html/2604.19786#A4)Hyperparameter Configurations\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D](https://arxiv.org/html/2604.19786#A4) [E](https://arxiv.org/html/2604.19786#A5)LLM\-as\-a\-Judge Prompting Framework\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E](https://arxiv.org/html/2604.19786#A5) [F](https://arxiv.org/html/2604.19786#A6)Qualitative Examples and Feature Reasoning\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[F](https://arxiv.org/html/2604.19786#A6) [H](https://arxiv.org/html/2604.19786#A8)Human Evaluation Details\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[H](https://arxiv.org/html/2604.19786#A8) [I](https://arxiv.org/html/2604.19786#A9)Mathematical Stability of the Generative Tournament\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[I](https://arxiv.org/html/2604.19786#A9)

## Appendix AHumor Definition

Building upon classic Incongruity Theory, psychological frameworks\(Larkin\-Galiñanes,[2017](https://arxiv.org/html/2604.19786#bib.bib23)\), and Normative\-Violation theory\(Veatch,[1998](https://arxiv.org/html/2604.19786#bib.bib38)\), we require a rigorous working definition that can be applied to text\-based evaluation\. For the purposes of this research, we define humor explicitly as:

> Humor is the cognitive reward—experienced as amusement—arising when an interlocutor successfully resolves a deliberately constructed incongruity, such as the narrative shift between a joke’s setup and punchline, within a harmless and non\-threatening context\.

This definition anchors psychological consensus into the practical reality of evaluating generated text\. It explicitly requires four distinct components:

- •The Joke Mechanism \(Setup & Punchline\):We evaluate humor not as random surprise, but as a structured linguistic narrative\. The setup creates a logical expectation, and the punchline deliberately subverts it\.
- •The “Cognitive Reward”:This maps to the cognitive appraisal process, describing the computational or intellectual achievement of bridging the logical gap between the setup and punchline\.
- •Experienced as Amusement:The cognitive resolution must trigger a pleasant response \(mirth\) rather than confusion\.
- •Harmless Context:Drawn from benign violation theory, the structural incongruity only produces amusement if it is appraised as non\-threatening\.

Because humor exists on a continuous spectrum determined by these mechanisms rather than as a discrete label, our methodology utilizespairwise preference ranking\. By prompting the judge to evaluate which generation produces a stronger cognitive reward, we effectively treat humor evaluation as a reward modeling paradigm across the multi\-dimensional feature space of human amusement\.

## Appendix BHumorRank Leaderboard Performance with Qwen 2\.5 72B LLM Judge

To validate the stability of our primary leaderboard, which was generated using Llama 3\.3 70B as a judge, we replicated the evaluation using Qwen 2\.5 72B as the judge\. Figure[4](https://arxiv.org/html/2604.19786#A2.F4)presents the resulting Bradley\-Terry leaderboard from this independent evaluation pipeline\.

![Refer to caption](https://arxiv.org/html/2604.19786v1/images/bt_leaderboard_qwen.png)

![Refer to caption](https://arxiv.org/html/2604.19786v1/images/winrate_heatmap_qwen.png)

Figure 4:HumorRank Leaderboard \(top\) and Pairwise Win\-Rate Heatmap \(bottom\) showing the performance of the 9 models\.RankModelBT Rating95% CIWin Rate1GPT\-51247\.7\[1192\.6,1300\.3\]\[1192\.6,1300\.3\]78\.8%2Kimi\-K21160\.9\[1118\.8,1220\.6\]\[1118\.8,1220\.6\]68\.8%3Claude 3\.5 Haiku1108\.0\[1052\.8,1154\.7\]\[1052\.8,1154\.7\]62\.1%4HumorGen\-7B111Referred to HumorGen\-7B as HumorGen SFT 7B in plots and figures\.1104\.8\[1070\.9,1149\.4\]\\mathbf\{\[1070\.9,1149\.4\]\}61\.7%5Gemini 2\.5 Pro1075\.8\[1041\.6,1115\.5\]\[1041\.6,1115\.5\]57\.9%6Qwen 3 32B992\.1\[955\.3,1046\.3\]\[955\.3,1046\.3\]47\.1%7GPT OSS 120B985\.5\[947\.7,1024\.7\]\[947\.7,1024\.7\]46\.2%8Llama 3\.3 70B714\.0\[639\.0,775\.8\]\[639\.0,775\.8\]17\.5%9Base Qwen 7B611\.2\[533\.1,666\.6\]\[533\.1,666\.6\]10\.0%Table 4:Validation HumorRank Leaderboard \(Qwen 2\.5 72B Judge\)\. The hierarchy maintains strong alignment with the primary leaderboard \(τ=0\.889\\tau=0\.889\), cleanly separating the frontier models, the mid\-tier models \(including the domain\-specialist HumorGen\-7B\), and the weak baselines\.11footnotetext:Referred to HumorGen\-7B as HumorGen SFT 7B in plots and figures\.
## Appendix CLLaMA Judge Sample Decision Making

Below are representative LLaMA judge decisions drawn from the evaluation log\. Each pair shows the preferred joke \(green\) against the rejected entry \(red\), together with the judge’s reasoning and ELO updates\.

Winning FeaturesDelivery FeaturesLoser Features

Decision \#1en\_2051·Tie: No·Confidence:Medium·Score: 1\.0·ELOΔ:\+\+7\.66/−\-7\.66

Headline: ‘We did it\!’: Countries celebrate World Cup qualification

Reasoning: Joke A has a more relatable and unexpected twist—the concept of “not washing lucky socks” as a claimed contribution to the team’s success makes it more humorous and engaging\.

Winning Featuresincongruity⋅\\cdotabsurdity⋅\\cdotsurprise⋅\\cdotirony

Delivery Featuresconciseness⋅\\cdotpunchline\_positioning⋅\\cdotframing\_commitment

Loser Featurescliché⋅\\cdotweak\_punchline

Gemini\-2\.5\-Pro⋅\\cdotWinner ✓After my team qualified for the World Cup, I ran into the street yelling, “WE DID IT\!” My neighbor looked at me and said, “What exactly didyoudo?” I told him, “I successfully avoided jinxing them by not washing my lucky socks for six months\. You’re welcome\.”

Claude\-3\.5\-Haiku⋅\\cdotLoser×\\timesWhy did the soccer player bring a ladder to celebrate World Cup qualification? Because they wanted to take their excitement to the next level… literally, they were already on cloud nine\!

Decision \#2en\_2021·Tie: No·Confidence:Medium·Score: 1\.0·ELOΔ:\+\+7\.14/−\-7\.14

Headline: ‘Mind\-blowing’ population of feral goldfish in Ottawa pond to be euthanized

Reasoning: Joke A uses a clever play on words with ‘mind\-blowing’ and creates a vivid, absurd image; Joke B relies on a more straightforward and less surprising punchline\.

Winning Featureswordplay⋅\\cdotabsurdity⋅\\cdotincongruity

Delivery Featuresconciseness⋅\\cdotpunchline\_positioning⋅\\cdotframing\_commitment

Loser Featurescliché⋅\\cdotweak\_punchline

HumorGen\-SFT\-7B⋅\\cdotWinner ✓So, they’re going to euthanize these ‘mind\-blowing’ feral goldfish in Ottawa’s pond? I hope they have anesthetic for fish because if you don’t do it right, those goldfish might just get their last swim as they float up to heaven saying, “Mind\-blowing, indeed\!”

GPT\-OSS\-120B⋅\\cdotLoser×\\timesOttawa’s pond got so many feral goldfish that the city had to “euthanize” them—turns out they were just trying to pay the water bill with their own scales\.

Decision \#3en\_2034·Tie: No·Confidence:Medium·Score: 1\.0·ELOΔ:\+\+7\.63/−\-7\.63

Headline: Three\-quarters of parents let children miss school for ‘duvet day’

Reasoning: Joke B has a more complex and layered structure—the fraction wordplay \(“three\-quarters is exactly the blanket\-to\-face ratio”\) adds an extra dimension of sarcasm and irony over Joke A’s simpler setup\.

Winning Featuresincongruity⋅\\cdotwordplay⋅\\cdotabsurdity⋅\\cdotirony⋅\\cdotsarcasm

Delivery Featurestiming⋅\\cdotconciseness⋅\\cdotpunchline\_positioning⋅\\cdotframing\_commitment

Loser Featuresweak\_punchline

GPT\-5⋅\\cdotWinner ✓Three\-quarters of parents let kids miss school for “duvet day”—teachers are fuming: “How are we supposed to cover the syllabus when everyone’s already under the covers?” On the bright side, kids finally understand fractions: three\-quarters is exactly the blanket\-to\-face ratio required to breathe\.

Base\-Qwen\-7B⋅\\cdotLoser×\\timesWhy did the kid stay home from school? Because when asked if he wanted to go, he replied, “I’d rather be sick three\-quarters of the time\!”

Decision \#4en\_2115·Tie: No·Confidence:Medium·Score: 1\.0·ELOΔ:\+\+7\.66/−\-7\.66

Headline: How to design an aesthetic home office that boosts productivity

Reasoning: Both jokes rely on irony and sarcasm, but Joke A’s darker, more unexpected twist—“the exact shade of my laptop screen at 3 a\.m\.—black with a faint glow of regret”—is more surprising and absurd\.

Winning Featuresincongruity⋅\\cdotabsurdity⋅\\cdotsurprise⋅\\cdotirony⋅\\cdotsarcasm

Delivery Featuresconciseness⋅\\cdotdeadpan⋅\\cdotpunchline\_positioning

Loser Featurescliché⋅\\cdotweak\_punchline

Kimi\-K2⋅\\cdotWinner ✓My productivity skyrocketed after I designed my aesthetic home office: I painted the walls the exact shade of my laptop screen at 3 a\.m\.—black with a faint glow of regret\.

Gemini\-2\.5\-Pro⋅\\cdotLoser×\\timesI spent all weekend creating the perfect aesthetic home office to boost my productivity\. It worked\! I am now incredibly productive at taking pictures of my aesthetic home office for Instagram\.

Figure 5:Four representative LLaMA judge decisions\. Winner ✓ \(green\) and Loser×\\times\(red\) are labelled directly on each joke box\. Feature rows indicate winning humor traits \(green\), delivery strengths \(blue\), and loser weaknesses \(red\)\. ELO deltas are approximated from the evaluation log\.
## Appendix DHyperparameter Configurations

Standardized hyperparameters across the HumorRank tournament are detailed in Table[5](https://arxiv.org/html/2604.19786#A4.T5)\(candidate generation\) and Table[6](https://arxiv.org/html/2604.19786#A4.T6)\(LLM adjudication\)\. All tournament\-level Bradley–Terry and Elo parameters are summarized in Table[7](https://arxiv.org/html/2604.19786#A4.T7)\.

Table 5:Hyperparameters for candidate humor generation across all local and API\-based models\. API\-based models used default top\-ppvalues if not explicitly exposed\.Table 6:Hyperparameters for LLM\-as\-a\-Judge adjudication\.Table 7:Tournament configuration and Bradley–Terry global MLE parameters\.
## Appendix ELLM\-as\-a\-Judge Prompting Framework

The following prompt template was used for all pairwise comparisons in the HumorRank evaluation pipeline\. The judge models received a system prompt establishing their role as comedy critics, followed by a structured user prompt presenting two jokes for comparison\. All 10,800 automated tournament comparisons for each judge utilized this exact template\.

systemPROMPT\_V3 — System"You are a comedy critic judging which of two jokes is funnier\.\\n" "Analyze both the underlying logic \(humor mechanisms\) and the presentation \(delivery\)\.\\n" "Be direct and honest\. If one joke is clearly better, pick it\. " "If they are genuinely equal in quality, say TIE\.\\n" "Do not overthink it \-\-\- trust your first impression\. Output JSON only\."

’Prompt: "\{headline\}"\\n\\n’ "JOKE A:\{joke\_a\}\\n\\n" "JOKE B:\{joke\_b\}\\n\\n" "Which is funnier? Return JSON:\\n" "\{\{\\n" ’"reasoning":"brief explanation",\\n’ ’"decision":"A"or"B"or"TIE",\\n’ ’"winner\_humor\_features": \[list ALL that apply, 1\-3, from:\{mech\_features\}\],\\n’ ’"winner\_delivery\_features": \[list ALL that apply, 1\-3, from:\{deliv\_features\}\],\\n’ ’"loser\_features": \[list ALL that apply, 1\-3, from:\{loser\_features\}\]\\n’ "\}\}"

HUMOR MECHANISMS"incongruity", "wordplay", "absurdity", "surprise", "irony", "sarcasm", "observational", "narrative"

DELIVERY FEATURES"timing", "conciseness", "deadpan", "escalation", "punchline\_positioning", "framing\_commitment"

LOSER FEATURES"cliché", "confusing", "offensive", "overexplained", "buried\_punchline", "weak\_punchline"

Figure 6:Prompt template used for all 10,800 pairwise comparisons\. Template variables \(\{headline\},\{joke\_a\},\{joke\_b\}\) are instantiated per comparison\. The three feature lists \(humor mechanisms, delivery, and loser features\) enforce structured and consistent JSON outputs across all evaluations\.
## Appendix FQualitative Examples and Feature Reasoning

A representative pairwise qualitative example is presented in the main paper \(Table[3](https://arxiv.org/html/2604.19786#S4.T3)\)\. The per\-model feature distributions adjudicated by the primary Llama 3\.3 70B judge are shown in Figure[2](https://arxiv.org/html/2604.19786#S4.F2)in the main paper\. The Qwen 2\.5 72B distributions are reproduced below \(Figure[7](https://arxiv.org/html/2604.19786#A6.F7)\) for cross\-judge comparison\.

![Refer to caption](https://arxiv.org/html/2604.19786v1/images/feature_winner_heatmap_qwen.png)

![Refer to caption](https://arxiv.org/html/2604.19786v1/images/feature_delivery_heatmap_qwen.png)

Figure 7:Per\-model winning feature distributions \(Qwen 2\.5 72B judge\)\.Left:Humor mechanisms\.Right:Delivery features\. Rank patterns are consistent with the primary Llama judge \(Figure[2](https://arxiv.org/html/2604.19786#S4.F2)\), confirming cross\-judge stability\.![Refer to caption](https://arxiv.org/html/2604.19786v1/images/feature_loser_heatmap_qwen.png)Figure 8:Per\-model failure mode distributions \(Qwen 2\.5 72B judge\)\. HumorGen\-7B again shows markedly higherOverexplained\(49\.5%\) rates compared to other models, consistent with findings under the primary Llama judge \(Figure[3](https://arxiv.org/html/2604.19786#S4.F3)\), confirming that this failure signature is evaluator\-agnostic\.### F\.1Key Observations & Findings

A comprehensive analysis of the heatmaps across both judge paradigms reveals several profound insights regarding how different model architectures approach humor generation:

1. 1\.The Shallow Heuristics of Open\-Weight Baseline Models:As visualized across both evaluations, the baseline open\-weight models \(e\.g\., Base\-Qwen\-7B and Llama\-3\.3\-70B\) demonstrate a heavy reliance on shallow lexical heuristics\. In terms of mechanics \(Figures[2](https://arxiv.org/html/2604.19786#S4.F2)and[7](https://arxiv.org/html/2604.19786#A6.F7), Left\), they rely almost exclusively on superficialWordplay\(at nearly 30% of their wins\)\. In delivery \(Figures[2](https://arxiv.org/html/2604.19786#S4.F2)and[7](https://arxiv.org/html/2604.19786#A6.F7), Right\), this heuristic behavior is even more extreme: these models devote over 85% of their winning generations to simpleConcisenessand strictPunchline Positioning, effectively functioning as rigid, template\-based joke outputters\.
2. 2\.The Structural Depth of Specialized Alignment:Conversely, the fine\-tuned specialist \(HumorGen\-SFT\-7B\) exhibits a radically different and deeper psychometric signature\. For mechanics \(Figures[2](https://arxiv.org/html/2604.19786#S4.F2)and[7](https://arxiv.org/html/2604.19786#A6.F7), Left\), it outputs the highest rates ofAbsurdity\(over 23% under both judges\) andSarcasm, successfully wielding semantic incongruity over cheap wordplay\. In delivery \(Figures[2](https://arxiv.org/html/2604.19786#S4.F2)and[7](https://arxiv.org/html/2604.19786#A6.F7), Right\), rather than dumping immediate punchlines, HumorGen uniquely leveragesFraming Commitment\(approx\. 27%\) andEscalation\(approx\. 15%\), proving it can hold the setup and build narrative tension\.
3. 3\.The Balanced Proprietary Frontier:Frontier models like GPT\-5 and Kimi\-K2 sit carefully between these extremes\. While their delivery aesthetics are generally safe—indexing highly on conciseness and optimal positioning \(Figures[2](https://arxiv.org/html/2604.19786#S4.F2)and[7](https://arxiv.org/html/2604.19786#A6.F7), Right\)—they maintain enough cognitive depth to succeed heavily onIronyandIncongruitywithout having to resort exclusively to the aggressive absurdity of the specialist or the simplistic puns of the baselines\.

## Appendix GHuman Evaluation Protocol and Annotator Guidelines

The human evaluation protocol is designed to minimize bias and ensure high\-fidelity preference judgments\. Annotators are presented with the original setup and two model\-generated responses \(labeled anonymously as ‘Joke A’ and ‘Joke B’\)\. They are instructed to evaluate which response is funnier based on the context, or formally declare a tie if both are equally humorous or equally poor\.

#### Inter\-Annotator Reliability \(Krippendorff’s Alpha\)

Due to the highly subjective nature of humor preference and the potential for sparse pairwise intersections, we utilize Krippendorff’s Alpha \(α\\alpha\) to rigorously quantify inter\-annotator agreement\. The metric is defined as:

α=1−DoDe\\alpha=1\-\\frac\{D\_\{o\}\}\{D\_\{e\}\}\(3\)whereDoD\_\{o\}represents the observed disagreement among annotators, andDeD\_\{e\}represents the disagreement expected by chance\. For our pairwise preference schema \(Model A vs\. Model B vs\. Tie\), we computeα\\alphausing a nominal distance metric\. Anα\>0\\alpha\>0indicates agreement beyond chance; notably, values approaching0\.450\.45traditionally indicate moderate agreement, which is standard for highly subjective NLP generation benchmarks such as computational comedy evaluation\.

Our human evaluation yields a raw human agreement of52\.6%52\.6\\%, anα=0\.446\\alpha=0\.446, and a49\.0%49\.0\\%micro\-averaged alignment with the primary LLM judge\. Complete cohort demographics and comprehensive statistical alignment tables will be formally incorporated upon final publication\.

## Appendix HHuman Evaluation Details

#### Instructions to participants\.

Evaluators were shown the following instructions before starting \(see Figure[9](https://arxiv.org/html/2604.19786#A8.F9)for the screen as displayed\)\.

![Refer to caption](https://arxiv.org/html/2604.19786v1/images/eval_hrank.jpg)Figure 9:Instructions screen \(HumorRank Blind Evaluation\)As shown to participants before voting\.The human evaluation protocol is designed to minimize bias and ensure high\-fidelity preference judgments\. Annotators are presented with the original setup and two model\-generated responses \(labeled anonymously as ‘Joke A’ and ‘Joke B’\)\. They are instructed to evaluate which response is funnier based on the context, or formally declare a tie if both are equally humorous or equally poor\.

#### Participants\.

Human evaluators \(n=3n=3\) completed175175votes across6060pairs via invitation without payment\. Position bias was mitigated via random A/B assignment\.

#### Inter\-Annotator Reliability\.

We utilize Krippendorff’s Alpha \(α\\alpha\) for nominal data with multiple annotators and incomplete overlap:

α=1−DoDe\\alpha=1\-\\frac\{D\_\{o\}\}\{D\_\{e\}\}\(4\)whereDoD\_\{o\}is observed disagreement andDeD\_\{e\}is expected chance disagreement\. The decrease fromα=0\.432\\alpha=0\.432\(n=2n=2\) toα=0\.397\\alpha=0\.397\(n=3n=3\) reflects increased variance with additional perspectives, confirming the inherent subjectivity of humor preference\.

![Refer to caption](https://arxiv.org/html/2604.19786v1/images/pair.jpg)Figure 10:Sample evaluated pair showing the blind comparison interface\. Evaluators see two anonymized jokes \(Option A and Option B\) for a given headline and select the funnier response\.

## Appendix IMathematical Stability of the Generative Tournament

To validate the robustness of the derived Elo ratings against sequence\-dependence \(often referred to as “late\-winner bias” in streamed continuous tournaments\), HumorRank utilizes a Stable Elo variant grounded in order\-independent aggregation\(Albers & Vries,[2001](https://arxiv.org/html/2604.19786#bib.bib4)\)\.

In standard sequential Elo implementations, a modelmmupdates its ratingRmR\_\{m\}after a sequence of matches based on the standard iterative update rule:

Rm\(t\+1\)=Rm\(t\)\+K⋅\(S−Em\)R\_\{m\}^\{\(t\+1\)\}=R\_\{m\}^\{\(t\)\}\+K\\cdot\(S\-E\_\{m\}\)\(5\)wherettindexes the chronological order of the match\. Consequently, a model earning a win at the end of the match history block gains an inherently outsized advantage over a model that earned an identical win early in the sequence\.

To completely nullify this temporal artifact, we strip the time dependencies by evaluating the match historyHHacrossNNindependently shuffled topological permutations\. The stable terminal ratingR¯m\\bar\{R\}\_\{m\}for each modelmmis defined as the arithmetic mean across all sequences:

R¯m=1N​∑k=1NRm,k\(T\)\\bar\{R\}\_\{m\}=\\frac\{1\}\{N\}\\sum\_\{k=1\}^\{N\}R\_\{m,k\}^\{\(T\)\}\(6\)whereRm,k\(T\)R\_\{m,k\}^\{\(T\)\}represents the final rating of modelmmafter iterating through allT=10,800T=10,800matches in thekk\-th shuffled permutation\.

In our experiments, we setN=10N=10to ensure convergence\. To empirically validate that this methodology produces a rigid transitive hierarchy rather than high\-variance noise, we measured the standard deviation of final ratings across the permutations:

σm=1N​∑k=1N\(Rm,k\(T\)−R¯m\)2\\sigma\_\{m\}=\\sqrt\{\\frac\{1\}\{N\}\\sum\_\{k=1\}^\{N\}\\left\(R\_\{m,k\}^\{\(T\)\}\-\\bar\{R\}\_\{m\}\\right\)^\{2\}\}\(7\)
Tracking this distribution across both the primary \(Llama 3\.3 70B\) and validation \(Qwen 2\.5 72B\) judges for all 9 contestants yielded the following internal stability metrics on a base 1000\-point scale:

- •Maximum Variance:Bounded strictly atσm​a​x=37\.5\\sigma\_\{max\}=37\.5Elo points across all models\.
- •Mean Variance:Clustered tightly aroundσ¯≈29\.5\\bar\{\\sigma\}\\approx 29\.5Elo points\.

Given that inter\-model spreads on the leaderboard exceed 200 points, this stringent empirical convergence \(σ<38\.0\\sigma<38\.0\) confirms that the generative tournament produced a highly stable latent hierarchy entirely free from tournament\-ordering artifacts\.

Similar Articles