One Ruler: A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tuebingen, with a Parameter-Free Compression Baseline

arXiv cs.LG 06/24/26, 04:00 AM Papers

causal-inference benchmark compression evaluation reproducibility bivariate-causal tubingen

Summary

This paper conducts a same-hands re-evaluation of bivariate causal direction methods on the Tübingen cause-effect pairs, introducing a parameter-free compression baseline that ties with SLOPE. It documents how published accuracy figures are inflated by protocol differences and releases all code and data.

arXiv:2606.23767v1 Announce Type: new Abstract: Headline accuracies on the Tuebingen cause-effect pairs are routinely compared across papers even though each is measured under its authors' own protocol -- different pair subsets, weightings, model-selection, and decision rates. We argue this is the wrong comparison and run the right one: a same-hands re-evaluation in which every method is run by us on the identical 102 pairs, with one strict rule -- no tuning and a decision forced on every pair. As a clean reference point we introduce a deliberately minimal baseline: sorted-conditional compression, which feeds quantized, sorted, first-differenced data to an off-the-shelf compressor (bz2) and has zero fitted parameters. Under the common ruler the ranking differs sharply from the literature. Our baseline reaches 74.7% weighted accuracy (p = 3.7e-7); on the same 100 pairs that SLOPE is evaluated on it scores 76.0%, a 1.2-point gap below the authors' own forced-decision SLOPE (77.2%) that is well inside noise (McNemar p = 0.39). A faithful re-run of RECI lands at 70.7% -- inside the original authors' reported error bar, not the 77.5% often quoted (which we trace to a mis-copied cell). SLOPE's published 82.4% is a decided-subset figure: scoring the authors' own stored output only on the pairs its significance test chose to answer reproduces 81.7%. Under the common ruler the methods cluster in the low-to-mid 70s and the zero-parameter compressor ties the strongest of them. We document the mechanisms that inflate published figures (test-set model selection, significance-gated abstention) and contribute two further results: compression score magnitude is a model-free confounding flag (p = 2.8e-68), and a pre-registered falsification test fails in an instructive way that bounds the method's theoretical interpretation. Code, pre-registrations, and per-pair outputs are released.

Original Article

View Cached Full Text

Cached at: 06/24/26, 07:48 AM

# A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tübingen, with a Parameter-Free Compression Baseline
Source: [https://arxiv.org/html/2606.23767](https://arxiv.org/html/2606.23767)
\(June 2026 — preliminary working paper\)

###### Abstract

Headline accuracies on the Tübingen cause\-effect pairs are routinely compared across papers even though each is measured under its authors’ own protocol — different pair subsets, weightings, model\-selection, and decision rates\. We argue this is the wrong comparison and run the right one: a*same\-hands*re\-evaluation in which every method is run by us, on the identical 102 pairs, with one strict rule — no tuning and a decision*forced on every pair*\. As a clean reference point we introduce a deliberately minimal baseline:*sorted\-conditional compression*, which feeds quantized, sorted, first\-differenced data to an off\-the\-shelf compressor \(bz2\) and has*zero*fitted parameters\. Under the common ruler the ranking differs sharply from the literature\. Our baseline reaches74\.7%74\.7\\%weighted accuracy \(p=3\.7×10−7p=3\.7\\times 10^\{\-7\}\) on our 102\-pair set — and on the*same 100 pairs*that SLOPE is evaluated on it scores76\.0%76\.0\\%, a1\.21\.2\-point gap below the authors’ own forced\-decision SLOPE \(77\.2%77\.2\\%\) that is well inside noise \(McNemarp=0\.39p=0\.39\)\. The authors’ journal SLOPER variant on the same pairs scores71\.1%71\.1\\%weighted \(from the bundledsloper\.tab\), so neither SLOPE configuration significantly beats ours on Tübingen\. A faithful re\-run of RECI lands at70\.7%70\.7\\%— inside the original authors’ own reported error bar, and*not*the77\.5%77\.5\\%often quoted \(which we trace to a mis\-copied cell for a different method on a synthetic dataset\)\. SLOPE’s published82\.4%82\.4\\%is a*decided\-subset*figure: scoring the authors’ own stored output only on the pairs its significance test chose to answer reproduces81\.7%81\.7\\%, demonstrating the inflation mechanism directly\. Under the common ruler the four methods cluster in the low\-to\-mid7070s and the zero\-parameter compressor is a statistical tie with the strongest of them\. We document the mechanisms that inflate the published figures \(test\-set model selection, significance\-gated abstention\) and contribute two further results: the compression score magnitude is a model\-free confounding flag \(p=2\.8×10−68p=2\.8\\times 10^\{\-68\}\), and a pre\-registered falsification test fails in an instructive way that bounds the method’s theoretical interpretation without overturning its empirical result\. All code, pre\-registrations, and per\-pair outputs are released\.

## 1Introduction

The Tübingen cause\-effect pairs \(CEP\)\[[1](https://arxiv.org/html/2606.23767#bib.bib1)\]are the de\-facto real\-world benchmark for the bivariate causal\-direction problem: given samples of two variablesX,YX,Yknown to be causally related, decide whetherX→YX\\to YorY→XY\\to Xwith no intervention\. Progress is tracked by weighted accuracy on this set, and method papers routinely place a new result next to a column of prior numbers copied from earlier papers\.

That comparison is unsound\. Each published number is produced under its own protocol: a slightly different eligible\-pair subset \(95–108 pairs appear in the literature\), different weighting, different handling of low\-confidence pairs \(forced decision vs\. abstention / area\-under\-the\- decision\-rate\-curve\), and — critically — model and hyperparameter choices made after observing benchmark performance\. Differences of several points routinely arise from these choices alone, so a cross\-paper table measures protocol as much as method\.

This paper does the apples\-to\-apples version\. Our contributions:

1. 1\.A same\-hands protocol\(§[2](https://arxiv.org/html/2606.23767#S2)\): one operator, the standard CEP pairs, official weights, no tuning, a decision forced on every pair\. Each competitor is run under the same rule — SLOPE from the authors’*own released code*, RECI matched to the authors’ own reported value, IGCI and our baseline measured directly — rather than quoting headline cells\.
2. 2\.A parameter\-free reference baseline\(§[3](https://arxiv.org/html/2606.23767#S3)\): sorted\-conditional bz2 compression, a single function with zero test\-time parameters, as a fixed yardstick\.
3. 3\.A re\-evaluation\(§[4](https://arxiv.org/html/2606.23767#S4)\) showing the literature ranking does not survive the common ruler\. We reproduce RECI’s*own*reported value \(70\.7%\), show the widely quoted “77\.5%” is a mis\-attribution, run the authors’ own SLOPE code to77\.2%77\.2\\%under forced decision, and reproduce its published82\.4%82\.4\\%as a decided\-subset artifact \(81\.7%81\.7\\%on the pairs it answered\)\. We name the inflation mechanisms\.
4. 4\.Two further results\(§[5](https://arxiv.org/html/2606.23767#S5)\): a model\-free confounding detector from the score magnitude, and an error analysis localizing the baseline’s failures to discrete and near\-bijective pairs\.
5. 5\.An honest boundary\(§[6](https://arxiv.org/html/2606.23767#S6)\): a pre\-registered falsification test that fails, separating what the method*is*\(a validated empirical heuristic\) from what it is*not*\(a clean estimator of the algorithmic asymmetry\)\.

We do not claim state of the art\. We claim that, measured fairly, a zero\-parameter compressor stands level with the best of this particular pack rather than behind it, and that the field’s headline numbers deserve same\-hands scrutiny\.

## 2The Same\-Hands Protocol

We fix one evaluation and apply it to every method:

- •Data\.The 102 CEP pairs with scalar cause, scalar effect, and≥50\\geq 50samples, weighted by the officialpairmeta\.txtweights\.
- •Forced decision\.Every pair receives a\{X→Y,Y→X\}\\\{X\\to Y,Y\\to X\\\}call; ties count as wrong\. No abstention, no decision\-rate curve, no top\-kkreporting\.
- •No tuning\.No hyperparameter or model class is selected using benchmark accuracy\.
- •Own hands\.We run each competitor ourselves \(re\-implemented from the paper or its reference toolbox\), rather than importing a number measured under a different protocol\.

This is intentionally the*strictest*reasonable setting; it is the setting under which a parameter\-free method is naturally evaluated, and we hold the competitors to the same bar\.

## 3A Parameter\-Free Baseline

GivenX,YX,Ywithnnsamples, quantize each independently to 256 levels by linear min\-max scaling, giving integer sequencesqx,qy∈\{0,…,255\}nq\_\{x\},q\_\{y\}\\in\\\{0,\\dots,255\\\}^\{n\}\. For an ordered pair \(key, value\) define the*sorted\-conditional code*

sc\(key,val\)=\|bz2\(Δ\(val\[arg⁡sortkey\]\)\)\|,sc\(\\text\{key\},\\text\{val\}\)=\\big\|\\,\\mathrm\{bz2\}\\big\(\\Delta\(\\text\{val\}\[\\arg\\\!\\mathrm\{sort\}\\,\\text\{key\}\]\)\\big\)\\big\|,wherearg⁡sort\\arg\\\!\\mathrm\{sort\}is a stable sort,Δ\\Deltathe first difference cast toint8, and\|⋅\|\|\\cdot\|the compressed byte length\. With marginal codedc\(q\)=\|bz2\(Δ\(q\)\)\|dc\(q\)=\|\\mathrm\{bz2\}\(\\Delta\(q\)\)\|, the score is

s=sc\(qy,qx\)max⁡\(dc\(qx\),1\)−sc\(qx,qy\)max⁡\(dc\(qy\),1\),s=\\frac\{sc\(q\_\{y\},q\_\{x\}\)\}\{\\max\(dc\(q\_\{x\}\),1\)\}\-\\frac\{sc\(q\_\{x\},q\_\{y\}\)\}\{\\max\(dc\(q\_\{y\}\),1\)\},and the rule is a\-priori:s\>0⇒X→Ys\>0\\Rightarrow X\\to Y,s<0⇒Y→Xs<0\\Rightarrow Y\\to X, confidence\|s\|\|s\|\.

Intuition\.IfY=f\(X\)\+εY=f\(X\)\+\\varepsilon, sortingYYbyXXyields a smoother sequence whose first differences carry less entropy and compress shorter; the reverse ordering does not\. The marginal normalization corrects forX,YX,Yhaving different intrinsic complexity\. This is a computable stand\-in for the Kolmogorov argumentK\(Y\|X\)<K\(X\|Y\)K\(Y\|X\)<K\(X\|Y\)under cause\[[2](https://arxiv.org/html/2606.23767#bib.bib2),[3](https://arxiv.org/html/2606.23767#bib.bib3)\]; it is the algorithmic\-information / MDL tradition, not a new principle\. The*instantiation*— a generic compressor on sorted, delta\-coded, quantized data with no fitted parameters — is the contribution\.

Settingswere frozen before any benchmark run: bz2 level 9; 256 levels; series\>8000\>8000points linearly subsampled to 8000\. bz2 \(not lzma\) is required: lzma’s container overhead makes output length\-determined below∼\\sim500 bytes, so both directions tie by construction on typical CEP series lengths \(§[6](https://arxiv.org/html/2606.23767#S6)\)\.

## 4Re\-Evaluation Results

### 4\.1The baseline on Tübingen

On the 102 pairs the baseline scores74\.7%weighted accuracy \(a\-priori sign, 0 ties\), bootstrap 95% CI\[63\.2,84\.7\]%\[63\.2,84\.7\]\\%\(n=1000n\{=\}1000\),p=3\.7×10−7p=3\.7\\times 10^\{\-7\}vs\. chance\. A200×200\\timessplit\-half out\-of\-sample check gives74\.8%74\.8\\%, confirming the fixed sign was not fitted\. Confidence is usable: restricting to the top\-\|s\|\|s\|fraction raises accuracy monotonically \(Table[1](https://arxiv.org/html/2606.23767#S4.T1)\)\.

Table 1:Abstention curve: accuracy rises with confidence\|s\|\|s\|\.
### 4\.2Every method on one ruler

Table[2](https://arxiv.org/html/2606.23767#S4.T2)reports each method run by us, forced decision, on the same 102 pairs, against the headline each method’s own paper reports\.

Table 2:Same\-hands re\-evaluation, forced decision, no tuning\. Rows ordered by same\-hands accuracy\.‡SLOPE is the authors’*own released code*run on their bundled CEP pairs, forced to decide; its82\.4%82\.4\\%headline reproduces as a*decided\-subset*score \(81\.7%81\.7\\%on the pairs its significance test answered\)\.†The “77\.5%” commonly quoted for RECI is a mis\-attribution \(see text\); the value here matches the RECI authors’ own table\. The two “Compression \(ours\)” rows:74\.7%74\.7\\%is the pre\-registered headline on our 102\-pair set;76\.0%76\.0\\%is the same\-subset re\-run on the authors’ bundled 100 pairs that SLOPE is evaluated on\. On the*identical 100 pairs*ours scores76\.0%76\.0\\%vs SLOPE77\.2%77\.2\\%— a1\.21\.2\-point gap, well inside noise on∼100\\sim 100pairs\.
### 4\.3RECI: reproduced, and the 77\.5% mis\-attribution

We implement RECI faithfully per the authors\[[5](https://arxiv.org/html/2606.23767#bib.bib5)\]and the Causal Discovery Toolbox: min\-max\[0,1\]\[0,1\]rescaling of both variables, a deliberately weak regressor \(shifted monomialax2\+cax^\{2\}\+c, the class the paper reports as best on real data\), held\-out test\-set MSE averaged over random splits, and the smaller\-error\-is\-cause rule\. This yields69\.969\.9–70\.9%70\.9\\%across our two implementations — squarely inside the authors’ own reported70\.73%±1\.5570\.73\\%\\pm 1\.55for the CEP set\. The number frequently quoted as RECI’s Tübingen accuracy,77\.5%77\.5\\%, does not appear for RECI in the source; the nearest value,77\.72%77\.72\\%, is a different method \(ANM\) on a*synthetic*suite \(SIM\-G\)\. Two common reproduction errors collapse RECI to chance and are worth recording: using an expressive regressor \(e\.g\. an RBF kernel\) fits*both*directions equally well and erases the asymmetry; and standardizing to unit variance instead of min\-max breaks the scale condition the method’s theorem requires\.

### 4\.4SLOPE: the 82\.4% is a decided\-subset score

Rather than re\-implement SLOPE we obtained the authors’*own released R code*\[[6](https://arxiv.org/html/2606.23767#bib.bib6)\]\(slope\-v20181208\) and ran it on their bundled CEP pairs\. In their code the causal direction is assigned fromsign\(ϵ\)\\mathrm\{sign\}\(\\epsilon\)only when a seeded Wilcoxon–Mann–Whitney significance test passes \(p<αp<\\alpha\); otherwise the method abstains\. Their own benchmark script setsα=1\.01\\alpha\{=\}1\.01, i\.e\. forced decision\. Run this way the authors’ code scores77\.2%\\mathbf\{77\.2\\%\}weighted \(forced\) on their 100\-pair subset\.

The published82\.4%82\.4\\%is not this number — it is the*decided\-subset*accuracy\. Scoring the authors’ own stored output \(slope\.tab\) only on the pairs it chose to answer — removing its two abstentions from the denominator — yields81\.7%\\mathbf\{81\.7\\%\}, reproducing the headline and exhibiting the inflation mechanism directly: the significance gate lets SLOPE skip the pairs it is least sure of, and accuracy is reported on the remainder\. Two independent evaluations agree the full\-benchmark figure is lower: the RECI paper’s own comparison finds SLOPE only “slightly better than RECI” \(≈\\approxlow\-70s\) on CEP, and a 2025 spline\-MDL paper\[[7](https://arxiv.org/html/2606.23767#bib.bib7)\]drops SLOPE and RECI because they had “similar or worse results than” its baseline\. \(The 2018 journal “mixed” variant in the same package scores lower still,71\.1%71\.1\\%forced /75\.2%75\.2\\%decided\.\)

*Provenance\.*The77\.2%77\.2\\%/81\.7%81\.7\\%figures are the authors’*own code and stored output*, not a re\-implementation; the only operator choice is the forced\-decision setting, taken from their own benchmark script\. The RECI figure likewise matches the authors’ own table\. Both competitor numbers are thus authoritative, not ports\.

### 4\.5Why the headlines read higher

The recurring pattern is selection and abstention\. RECI’s protocol explicitly “tried different \[train\] ratios and selected the best performing model on the test data” and reports the best\-performing function class; fixing the class a\-priori instead drops RECI to∼\\sim63–65%\. SLOPE gains from significance\-gated abstention\. Both are legitimate choices, but they make the published numbers best\-case rather than apples\-to\-apples\. The parameter\-free baseline has no such knobs, yet under the strict shared rule it is not behind — it is level with the best\.

### 4\.6Is the same\-hands tie statistically real? \(McNemar\)

The76\.076\.0vs77\.277\.2gap on the identical 100 CEP pairs is a paired classifier comparison; the appropriate test is McNemar on the pair\-level correct/wrong table \(unweighted: ours67/10067/100, SLOPE73/10073/100\)\. The2×22\{\\times\}2table is: both right5353, ours\-only1414, SLOPE\-only2020, both wrong1313\. McNemar’s exact binomial \(two\-sided\) givesp=0\.39p=0\.39; the continuity\-correctedχ2\\chi^\{2\}likewise yieldsp=0\.39p=0\.39\. We do not reject the null of equal accuracy on the unweighted table; the weighted accuracies sit at76\.0%76\.0\\%and77\.2%77\.2\\%as reported\. The “statistical tie” is now a statistical statement, not rhetoric\.

### 4\.7Generalization: Mooij synthetic suites

The Mooij synthetic suites\[[1](https://arxiv.org/html/2606.23767#bib.bib1)\]test generalization to controlled noise regimes\. We run all methods same\-hands on the three suites bundled with the SLOPE distribution \(SIM, SIM\-G, SIM\-ln; 100 pairs each, balanced direction, forced decision\)\. For SLOPE we report both the default \(plain\) variant used as a conference baseline and SLOPER — the journal mixed\-function variant \(mixedFunctions=T, nof=8\) that the authors use as their headline configuration\.

Table 3:Same\-hands accuracy on the real Tübingen pairs and the three Mooij synthetic suites \(100 pairs each, forced decision\)\.*Bold = row maximum\. Weighted on Tübingen, unweighted on SIM\-\* \(uniform weights\)\.*SLOPER is the authors’ journal mixed\-function variant \(mixedFunctions=T, nof=8\); for CEP we report the authors’ bundledsloper\.taboutput \(71\.1%71\.1\\%weighted forced\) for consistency with how we cite plain SLOPE \(77\.2%77\.2\\%fromslope\.tab\); for SIM we report fresh re\-runs of the sameSlope\(\)function under forced decision since the package bundles no SIM outputs\.*Two findings\.*\(i\) The two SLOPE variants have opposite regimes: plain wins Tübingen and loses SIM; SLOPER wins SIM\-ln but drops to71\.1%71\.1\\%on Tübingen \(−6\.1\-6\.1pp below plain\)\. Configuration matters and the right configuration is data\-dependent\. \(ii\) Against plain SLOPE, ours is a statistical tie on Tübingen and wins all three synthetic suites by1212–2222pp\. Against SLOPER, ours wins Tübingen by\+4\.9\+4\.9pp, wins SIM\-G by\+8\+8, ties on SIM, and loses SIM\-ln by−13\-13\. Across both variants there is no single SLOPE configuration that significantly beats the parameter\-free compressor on Tübingen\.

## 5Beyond the Decision

### 5\.1A model\-free confounding detector — and SLOPE cannot do it

In a pre\-registered simulation \(500 directX→YX\\to Ypairs vs\. 500 confoundedX←H→YX\\leftarrow H\\to Ypairs, matched marginals\), the score*magnitude*\|s\|\|s\|separates the two classes sharply: median\|s\|=0\.082\|s\|=0\.082\(direct\) vs\.0\.0130\.013\(confounded\), Mann–Whitneyp=2\.8×10−68p=2\.8\\times 10^\{\-68\}\. Direction accuracy is88\.6%88\.6\\%on direct pairs but50\.6%50\.6\\%\(chance\) on confounded ones — as expected, a hidden common cause leaves no privileged compression direction\. The top quartile by\|s\|\|s\|is96%96\\%pure for direct links; the bottom quartile recovers74%74\\%of confounded pairs\.

*Head\-to\-head with SLOPE on the same 1,000 pairs\.*We re\-ran the authors’ SLOPE on every pair from the same simulation and asked whether its confidence proxy\|ϵ\|\|\\epsilon\|also separates direct from confounded\. It does not: median\|ϵ\|=0\.0088\|\\epsilon\|=0\.0088\(direct\) vs\.0\.01450\.0145\(confounded\), Mann–Whitney one\-sidedp≈1\.000p\\approx 1\.000*in the wrong direction*\(confounded marginally larger\)\. Direction accuracy on direct pairs is69\.8%69\.8\\%\(decent\), but SLOPE’s score gives no signal that an emitted direction came from a confounded pair\. This is a structural difference: a regression\-MDL method always picks the direction that explains the data with shorter description length, even when no causal direction exists; the compression score collapses precisely because no privileged sort key exists\. To our knowledge no prior compression/MDL causal method reports a confounding flag as a free by\-product of the same score\.

### 5\.2Where the baseline fails

By raw pair count the baseline is right on68/10268/102\(66\.7%66\.7\\%unweighted; the74\.7%74\.7\\%headline is weighted\)\. The errors are not uniform\. Bucketing by the smaller of the two variables’ cardinalities, accuracy on the1515–5050unique\-value band is42%42\\%—*below chance*— and that band holds a third of the benchmark; continuous pairs \(\>50\>50levels\) sit near80%80\\%\. Mechanism: when one variable has few levels, sorting the other by it creates long constant runs that bz2 over\-rewards, manufacturing a spurious score\. The second failure mode is strong monotone near\-bijective relations \(§[6](https://arxiv.org/html/2606.23767#S6)\)\. Series length is*not*a discriminator \(correct and incorrect pairs have near\-identical median length\); confidence is — most errors are low\-\|s\|\|s\|near\-ties, exactly where the abstention curve \(Table[1](https://arxiv.org/html/2606.23767#S4.T1)\) says they should be\.

## 6Falsification and Limitations

A pre\-registered falsification test — and it fails\.The theory predicts that on a perfectly reversible, noise\-free map there is no asymmetry, so accuracy should be∼\\sim50%\. It is not: mean accuracy at zero noise is27\.6%27\.6\\%and strongly function\-dependent \(linear0%0\\%/all\-ties as predicted, but tanh78%78\\%, cubic0%0\\%\)\. The cause is the quantization grid interacting with the function’s curvature, leaking a direction where none exists\. This is a real dent in the*theory*: the score is not a pure measure of causal asymmetry but a blend of genuine signal and a quantization\-geometry artifact — the same marginal\-geometry cue IGCI uses openly\. It is not a dent in the*result*: on real data the74\.7%74\.7\\%stands, and the method is best read as a validated empirical heuristic, not a clean instantiation of the algorithmic principle\. A rank\-transform of the inputs \(which flattens marginal shape\) costs1616points, confirming the signal is partly marginal\-geometric and that inputs must not be rank\-normalized\.

Other boundaries\.\(i\)*Small samples*: thinning the long pairs to fixednncollapses accuracy to chance byn≈30n\\approx 30\(50\.3%50\.3\\%\) and only∼\\sim55–61% up ton=250n\{=\}250, vs\.76\.7%76\.7\\%at full length on the same pairs — the method needs hundreds–thousands of points, and the score there is confident sign\-noise, not abstention\. \(ii\)*Noise*: 10% added noise costs∼\\sim6 points\. \(iii\)*Synthetic gap*: additive\-noise SIM suites score 58–68%, below real\-data performance\. \(iv\)*Calibration*:\|s\|\|s\|is only monotone in its upper range; a band just above the abstain threshold is worse than chance\. \(v\)*Compressor*: lzma ties on 24\.5% of pairs by frame\-overhead artifact; bz2 is required\. \(vi\)*Scope*: scalar bivariate only\. \(vii\)*Categorical data, naïvely label\-encoded\.*On a balanced 100\-pair categorical benchmark \(fan\-in and fan\-out, ground truth,N=2000N\{=\}2000/pair\), ours scores100%100\\%on deterministic many\-to\-one mechanisms and0%0\\%on stochastic one\-to\-many \(overall50%50\\%\); the authors’ SLOPE exhibits the mirror\-image pattern \(0%0\\%/100%100\\%\)\. The two methods make perfectly orthogonal errors on label\-encoded categorical data — complementary, not interchangeable\. The score detects which conditional is more deterministic, not which side is cause\.

## 7Reproducibility

All design choices \(256 levels, bz2, 8000\-cap, decision sign, gate conditions\) were frozen in a pre\-registration before benchmark evaluation; the confounding and falsification experiments were likewise pre\-registered with fixed seeds\. We release the scorer \(a single importable function\), our RECI re\-implementation, the harness that runs the authors’ released SLOPE code and scores its forced and decided\-subset accuracies for Table[2](https://arxiv.org/html/2606.23767#S4.T2), the per\-pair outputs, bootstrap and split\-half scripts, and all pre\-registration documents at[stateoftheart\.nl/causal](https://stateoftheart.nl/causal)\. The headline result reproduces bit\-for\-bit \(102/102102/102pairs agree on re\-run\)\. The SLOPE figures come from the authors’ own code and stored output; the RECI reproduction matches the authors’ table\.*Cost\.*On the same 100 CEP pairs and the same machine, the median per\-pair wall clock is1\.01\.0ms for ours \(pure\-Pythonbz2\) vs\.24\.424\.4ms for the authors’ SLOPE \(≈24×\\approx 24\\times\); aggregate0\.200\.20s vs\.4\.14\.1s\. Ours has no external dependencies on the decision path\.

## 8Conclusion

Compared fairly — one operator, one set of pairs, forced decision, no tuning — a zero\-parameter compressor is level with the best of the Tübingen pack\. On the*same 100 pairs*that SLOPE is evaluated on, ours scores76\.0%76\.0\\%vs the authors’ own forced\-decision SLOPE at77\.2%77\.2\\%— a1\.21\.2\-point gap that is well inside noise \(McNemarp=0\.39p=0\.39\)\. The journal SLOPER variant on the same pairs scores71\.1%71\.1\\%weighted \(authors’ bundledsloper\.tab\), so neither SLOPE configuration significantly beats ours on Tübingen\. On our 102\-pair set the pre\-registered headline is74\.7%74\.7\\%; both are ahead of faithful RECI \(70\.7%70\.7\\%\) and IGCI \(66\.1%66\.1\\%\)\. The published82\.4%82\.4\\%/77\.5%77\.5\\%headlines do not survive the common ruler: SLOPE’s82\.4%82\.4\\%is a decided\-subset score \(we reproduce81\.7%81\.7\\%from the authors’ own output by removing its abstentions\), and RECI’s77\.5%77\.5\\%is a mis\-attribution\. We do not claim to beat SLOPE; we claim a stock compressor reaches the same shelf as the strongest tuned MDL method — with one structural extra: head\-to\-head on1,0001\{,\}000direct vs\. confounded pairs, our score magnitude separates the two classes atp=2\.8×10−68p=2\.8\\times 10^\{\-68\}while SLOPE’s\|ϵ\|\|\\epsilon\|does not separate them at all \(p≈1p\\approx 1\)\. The broader point for benchmarking practice is that real\-world causal\-direction leaderboards mix protocol with method, and that a parameter\-free baseline run by the same hands is a useful and sobering control\. We offer the compression score additionally as a free confounding flag, and we report — via a failed pre\-registered falsification — exactly where its theoretical interpretation stops and its empirical usefulness begins\.

#### Acknowledgements\.

Concept and direction by the author; implementation, experiments, and analysis were carried out with an autonomous AI coding agent\. We thank the maintainers of the Tübingen CEP benchmark\.

## References

- \[1\]J\. Mooij, J\. Peters, D\. Janzing, J\. Zscheischler, B\. Schölkopf\. Distinguishing cause from effect using observational data: methods and benchmarks\.*JMLR*17\(32\):1–102, 2016\.
- \[2\]D\. Janzing, B\. Schölkopf\. Causal inference using the algorithmic Markov condition\.*IEEE Trans\. Inf\. Theory*56\(10\):5168–5194, 2010\.
- \[3\]J\. Lemeire, D\. Janzing\. Replacing causal faithfulness with algorithmic independence\.*Minds and Machines*23\(2\):227–249, 2013\.
- \[4\]D\. Janzing et al\. Information\-geometric approach to inferring causal directions\.*Artificial Intelligence*182–183:1–31, 2012\.
- \[5\]P\. Blöbaum, D\. Janzing, T\. Washio, S\. Shimizu, B\. Schölkopf\. Cause\-effect inference by comparing regression errors\.*AISTATS*2018; extended analysis,*PeerJ CS*2019\.
- \[6\]A\. Marx, J\. Vreeken\. Telling cause from effect using MDL\-based local and global regression\.*ICDM*2017\.
- \[7\]K\. Hlavackova\-Schindler, A\. Marsela\. Identifying causal direction via dense functional classes\. arXiv:2509\.00538, 2025\.

One Ruler: A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tuebingen, with a Parameter-Free Compression Baseline

Similar Articles

Evaluating Bivariate Causal Statements Based on Mutual Compatibility

Using Text-Based Causal Inference to Disentangle Factors Influencing Online Review Ratings

Optimal Experiments for Partial Causal Effect Identification

Lifted Causal Inference

Score-Based Causal Discovery of Latent Variable Causal Models

Submit Feedback

Similar Articles

Evaluating Bivariate Causal Statements Based on Mutual Compatibility

Using Text-Based Causal Inference to Disentangle Factors Influencing Online Review Ratings

Optimal Experiments for Partial Causal Effect Identification

Score-Based Causal Discovery of Latent Variable Causal Models