The Dark Regulome: Disentangling Predictability from Regulation in Genomic Foundation Models

arXiv cs.CL 06/08/26, 04:00 AM Papers
Summary
This paper introduces a residualization-and-permutation diagnostic to separate predictability-driven from regulation-driven variance in regulatory importance scores from genomic foundation models, applied to dark genome elements at glioma-relevant loci.
arXiv:2606.06834v1 Announce Type: new Abstract: High-grade gliomas integrate into neural circuits through functional synapses with neurons, raising the question of which noncoding elements shape synaptogenic gene expression in tumor cells. The regulatory program written across the dark genome, what we call the $\textit{dark regulome}$, is the natural substrate to probe, and sequence foundation models offer a zero-shot route through in-silico mutagenesis (ISM); yet likelihood-based scoring is tautologically coupled to local sequence predictability, leaving the regulatory interpretation underdetermined. Across three architecturally distinct foundation models (Caduceus-Ph, HyenaDNA, Enformer) and 30,448 dark genome elements at 92 glioma-relevant loci, we introduce a residualization-and-permutation diagnostic that separates predictability-driven from regulation-driven RIS variance. A sharp 10kb proximal-regulatory horizon survives every control we apply, but the LM-derived element-class hierarchy does not: a six-feature linear baseline matches Caduceus top-decile membership at AUC $= 0.985$. Cross-architecture decomposition cleanly separates a sequence-predictability layer (the two language models co-rank long well-predicted transposable elements) from a regulatory-output layer (Enformer alone retains residual cCRE-discriminative signal), with literally zero overlap between the two top-100 lists. Conservation, brain cis-eQTL, and STRING-PPI cross-checks then anchor what biology survives: top-100 elements across all three models are $3.3\times$ enriched per model for matching brain eQTLs ($p_\mathrm{emp} < 5\times 10^{-3}$), while a tempting transposable-element regulatory layer and a striking NRXN1+NLGN1 protein-pair convergence both fail proper permutation tests once those tests are constructed. We deliver the diagnostic as a general methodological tool for any ISM-based regulatory study.
Original Article
View Cached Full Text
Cached at: 06/08/26, 09:20 AM
# The Dark Regulome: Disentangling Predictability from Regulation in Genomic Foundation Models
Source: [https://arxiv.org/html/2606.06834](https://arxiv.org/html/2606.06834)
Chahat Baranwal IIT Jodhpur b22bb014@iitj\.ac\.in &Aaditya Baranwal University of Central Florida aaditya\.baranwal@ucf\.edu &Lakshya Nitin Tandon Northeastern University lakshya\.tandon@neu\.edu

###### Abstract

High\-grade gliomas integrate into neural circuits through functional synapses with neurons, raising the question of which noncoding elements shape synaptogenic gene expression in tumor cells\. The regulatory program written across the dark genome, what we call the*dark regulome*, is the natural substrate to probe, and sequence foundation models offer a zero\-shot route through in\-silico mutagenesis \(ISM\); yet likelihood\-based scoring is tautologically coupled to local sequence predictability, leaving the regulatory interpretation underdetermined\. Across three architecturally distinct foundation models \(Caduceus\-Ph, HyenaDNA, Enformer\) and 30,448 dark genome elements at 92 glioma\-relevant loci, we introduce a residualization\-and\-permutation diagnostic that separates predictability\-driven from regulation\-driven RIS variance\. A sharp 10 kb proximal\-regulatory horizon survives every control we apply, but the LM\-derived element\-class hierarchy does not: a six\-feature linear baseline matches Caduceus top\-decile membership at AUC=0\.985=0\.985\. Cross\-architecture decomposition cleanly separates a sequence\-predictability layer \(the two language models co\-rank long well\-predicted transposable elements\) from a regulatory\-output layer \(Enformer alone retains residual cCRE\-discriminative signal\), with literally zero overlap between the two top\-100 lists\. Conservation, brain cis\-eQTL, and STRING\-PPI cross\-checks then anchor what biology survives: top\-100 elements across all three models are3\.3×3\.3\\timesenriched per model for matching brain eQTLs \(pemp<5×10−3p\_\{\\mathrm\{emp\}\}<5\\times 10^\{\-3\}\), while a tempting transposable\-element regulatory layer and a striking NRXN1\+NLGN1 protein\-pair convergence both fail proper permutation tests once those tests are constructed\. We deliver the diagnostic as a general methodological tool for any ISM\-based regulatory study\.

## 1Introduction

High\-grade gliomas are not merely masses of proliferating cells\. They are electrically integrated members of neural circuits\(Venkateshet al\.,[2019](https://arxiv.org/html/2606.06834#bib.bib1); Venkataramaniet al\.,[2019](https://arxiv.org/html/2606.06834#bib.bib2)\), forming functional glutamatergic synapses with cortical neurons, receiving excitatory input, and propagating calcium waves through tumor microtubes in a feed\-forward loop in which neural activity accelerates tumor growth \(FigureLABEL:fig:synapse\_schematic\)\(Venkateshet al\.,[2017](https://arxiv.org/html/2606.06834#bib.bib4); Osswaldet al\.,[2015](https://arxiv.org/html/2606.06834#bib.bib5); Tayloret al\.,[2023](https://arxiv.org/html/2606.06834#bib.bib11)\)\. The protein machinery supporting this hijacking is increasingly well characterized\(Venkateshet al\.,[2015](https://arxiv.org/html/2606.06834#bib.bib3); Krishnaet al\.,[2023](https://arxiv.org/html/2606.06834#bib.bib6)\), leaving the upstream question wide open: what regulatory program activates this synaptogenic gene expression in tumor cells, and which noncoding elements constitute its substrate? A natural place to look is the dark genome, the∼\\sim98% of non\-protein\-coding sequence comprising transposable elements, G\-quadruplex motifs, enhancers, and chromatin insulators, and the regulatory program encoded across it, the*dark regulome*, forms a reservoir of innovation\(Adami and others,[2025](https://arxiv.org/html/2606.06834#bib.bib23); Chakraborty and others,[2023](https://arxiv.org/html/2606.06834#bib.bib19); Feng and Yang,[2025](https://arxiv.org/html/2606.06834#bib.bib8)\)\. Yet experimental dissection across hundreds to thousands of candidate elements per locus is intractable, and the field has lacked a principled, scalable readout\. Even at the scale of hundreds to thousands of elements per locus, exhaustive experimental interrogation remains impractical without computational prioritization\.

Sequence foundation models offer a tempting zero\-shot solution\. Architectures that learn regulatory grammar directly from DNA sequence enable in\-silico mutagenesis \(ISM\): mask a candidate element, score the model’s prediction at the target TSS, and rank elements by the resulting*Regulatory Influence Score*\(RIS\)\(Kelleyet al\.,[2018](https://arxiv.org/html/2606.06834#bib.bib44); Avsecet al\.,[2021](https://arxiv.org/html/2606.06834#bib.bib42)\)\. We instantiate three architecturally distinct models, the bidirectional Mamba masked\-LM Caduceus\-Ph\(Schiffet al\.,[2024](https://arxiv.org/html/2606.06834#bib.bib40)\), the causal Hyena\-based HyenaDNA\(Nguyenet al\.,[2023](https://arxiv.org/html/2606.06834#bib.bib43)\), and the supervised convolutional\-transformer Enformer\(Avsecet al\.,[2021](https://arxiv.org/html/2606.06834#bib.bib42)\), that span the unsupervised\-masked, unsupervised\-causal, and supervised\-regression objectives respectively\. The implicit promise is triangulation: signals that survive all three architectures should reflect genuine regulatory organization rather than artifacts of any one objective\.

The promise has a hidden cost\. Likelihood\-based RIS in masked or causal language models is by construction coupled to local sequence likelihood, since removing any sequence with high mutual information with its neighborhood \(including repetitive elements that the model has effectively memorized in pretraining\) lowers regional likelihood whether or not the element is regulatory\. Cross\-architecture agreement, by itself, does not separate this predictability layer from a genuinely regulatory layer\. Without a diagnostic that explicitly decomposes the two, an ISM\-based regulatory study reads as evidence for whatever its rankings happen to surface, and reported "convergences" can be statistical artifacts of largennrather than biological signal\.

Our contribution is threefold\. First, we introduce a*residualization\-and\-permutation diagnostic*that takes any ISM ranking and separates the variance attributable to four nuisance covariates \(k\-mer entropy, GC content, log element length, log TSS distance\) from the variance that survives, then evaluates every reported overlap or top\-KKagreement against a marginal\-preserving per\-gene permutation null\. Second, we apply the diagnostic across three foundation models on 30,448 dark\-genome elements at 92 glioma\-relevant loci, recovering a clean cross\-architecture decomposition: the two language models share a sequence\-predictability layer that co\-ranks long well\-predicted transposable elements, while Enformer alone retains residual cCRE\-discriminative signal once predictability is controlled, and the two layers have literally zero top\-100 overlap\. Third, we identify the surviving biology: a sharp 10 kb proximal\-regulatory horizon that holds across architectures, scoring windows, perturbation schemes, and residualization, together with a3\.3×3\.3\\timesenrichment of matching brain cis\-eQTLs in each model’s top\-100 elements, supplying a small set of synaptogenic\-locus candidates worth experimental follow\-up\. The same diagnostic also retires several headline patterns that the original framing of this work centered on, including a TE\-mediated regulatory layer claim and a NRXN1\+NLGN1 protein\-pair convergence, both of which fail proper permutation tests\.

## 2Background and Related Work

Glioma as a Circuit Disease and the Dark Regulome:

Glioblastoma remains nearly uniformly fatal, with median survival under fifteen months\. The discovery that gliomas form functional glutamatergic synapses with cortical neurons\(Venkateshet al\.,[2019](https://arxiv.org/html/2606.06834#bib.bib1); Venkataramaniet al\.,[2019](https://arxiv.org/html/2606.06834#bib.bib2)\)has reframed the disease as activity\-dependent: neuronal firing triggers NLGN3 release, PI3K\-mTOR and MAPK activation\(Venkateshet al\.,[2015](https://arxiv.org/html/2606.06834#bib.bib3)\), and LTP\-like BDNF\-TrkB\-CaMKII plasticity\(Tayloret al\.,[2023](https://arxiv.org/html/2606.06834#bib.bib11)\); the degree of cortical circuit remodeling inversely predicts survival\(Krishnaet al\.,[2023](https://arxiv.org/html/2606.06834#bib.bib6)\), and excess glutamate plus disrupted chloride homeostasis form a HEx loop that accelerates tumor growth\(Zhanget al\.,[2025](https://arxiv.org/html/2606.06834#bib.bib7); Picart and Hervey\-Jumper,[2024](https://arxiv.org/html/2606.06834#bib.bib12)\)\. Despite advances in defining the coding machinery, the upstream architecture in the dark regulome that enables this synaptogenic program remains largely uncharacterized at scale\.

![Refer to caption](https://arxiv.org/html/2606.06834v1/x1.png)Figure 1:Four regulatory layers of the dark genome converging on the glioma circuit phenotype\.L1: BRD4\-anchored super\-enhancer hubs\. L2: lncRNA\-miRNA\-circRNA networks \(miR\-128/NRXN1 axis\)\. L3: cohesin\-mediated 3D chromatin rewiring and ecDNA amplification\. L4: structure\-dependent G\-quadruplex and Z\-DNA regulation\.The dark genome supplies a plausible substrate \(Figure[1](https://arxiv.org/html/2606.06834#S2.F1)\): transposable elements form a documented regulatory reservoir, with LINEs mediating cis\-acting transcriptional control in neural cells\(Adami and others,[2025](https://arxiv.org/html/2606.06834#bib.bib23)\), ERV\-derived LTRs functioning as tissue\-specific promoters\(Thompsonet al\.,[2016](https://arxiv.org/html/2606.06834#bib.bib17)\), and TE subfamilies co\-opted as tissue\-specific enhancers in cancer\(Karttunen and others,[2023](https://arxiv.org/html/2606.06834#bib.bib20)\); somatic noncoding mutations in glioblastoma enhancers trigger synaptogenic cascades\(Iñiguez\-Muñozet al\.,[2025](https://arxiv.org/html/2606.06834#bib.bib33)\), and 3D chromatin reorganization activates circuit gene modules\(Feng and Yang,[2025](https://arxiv.org/html/2606.06834#bib.bib8)\)\. ENCODE cCREs \(promoters, enhancers, CTCF insulators\) provide the orthogonal regulatory annotation\(ENCODE Project Consortium,[2020](https://arxiv.org/html/2606.06834#bib.bib37)\)\.

Table 1:Tier\-level RIS summary statistics across 30,448 dark genome elements\.Genomic Foundation Models, ISM, and the Predictability Confound:

We use three genomic foundation models that together span the training\-objective landscape\. Caduceus\-Ph\(Schiffet al\.,[2024](https://arxiv.org/html/2606.06834#bib.bib40)\)is a bidirectional Mamba masked language model with 131 kb context and reverse\-complement equivariance\. HyenaDNA\(Nguyenet al\.,[2023](https://arxiv.org/html/2606.06834#bib.bib43)\)is a causal Hyena\-based language model with 160 kb context and single\-nucleotide resolution\. Enformer\(Avsecet al\.,[2021](https://arxiv.org/html/2606.06834#bib.bib42)\)is a supervised convolutional\-transformer \(196 kb\) that predicts 5,313 epigenomic tracks at 128 bp resolution\. ISM has been applied to regulatory variant scoring\(Kelleyet al\.,[2018](https://arxiv.org/html/2606.06834#bib.bib44)\)and enhancer grammar\(Avsecet al\.,[2021](https://arxiv.org/html/2606.06834#bib.bib42)\), and Integrated Gradients\(Sundararajanet al\.,[2017](https://arxiv.org/html/2606.06834#bib.bib46)\)provides a gradient\-based perturbation\-free cross\-check on the resulting rankings\. The methodological gap our work addresses is the implicit equation of "model is sensitive to this element" with "this element is regulatory": for likelihood\-based scoring, removing any element with high mutual information to its surrounding sequence will lower regional likelihood whether or not the element carries regulatory function, so the resulting rankings can be dominated by sequence predictability rather than by regulation\.

## 3Methods

Gene Panel and Dark\-Genome Annotation

We curated 92 human genes across three functional tiers designed to isolate circuit\-specific regulatory effects from generic brain expression\. Tier 1 \(synaptogenic circuit, 32 genes\) covers genes with established roles in glioma\-neuron synapse formation; Tier 2 \(proliferative, 30 genes\) collects canonical glioma drivers without synaptic roles; Tier 3 \(brain control, 30 genes\) covers brain\-expressed genes not implicated in glioma\. For each gene we extracted the canonical TSS from GENCODE v44 \(GRCh38/hg38\) and defined a model\-context\-matched window𝒲g=\[TSSg−L/2,TSSg\+L/2\]\\mathcal\{W\}\_\{g\}=\[\\mathrm\{TSS\}\_\{g\}\-L/2,\\,\\mathrm\{TSS\}\_\{g\}\+L/2\]\(L=131L=131kb for Caduceus\-Ph, 160 kb for HyenaDNA, 196 kb for Enformer\)\. Each window was annotated with three orthogonal tracks: transposable elements from UCSC RepeatMasker \(19,947 LINE/SINE/LTR/DNA\-transposon elements at≥10\\geq 10bp\), G\-quadruplex motifs via canonical G4 regex with G4Hunter score≥1\.2\\geq 1\.2\(3,213 motifs\), and ENCODE SCREEN cCREs v3 \(7,288 elements classified as PLS, pELS, dELS, CTCF\-bound, or DNase\-H3K4me3\)\. The merged*ISM manifest*containsN=30,448N=30\{,\}448elements across the 92 windows \(mean 331 per gene; appx[2](https://arxiv.org/html/2606.06834#Ax1.T2)\)\.

![Refer to caption](https://arxiv.org/html/2606.06834v1/x2.png)Figure 2:Schematic of the residualization\-and\-permutation diagnostic\.Dark\-genome elements across 92 loci are processed via in\-silico mutagenesis across three architecturally distinct foundation models\(Schiffet al\.,[2024](https://arxiv.org/html/2606.06834#bib.bib40); Nguyenet al\.,[2023](https://arxiv.org/html/2606.06834#bib.bib43); Avsecet al\.,[2021](https://arxiv.org/html/2606.06834#bib.bib42)\)\. Regulatory Influence Scores are residualized and evaluated against a permutation null to isolate regulation\-driven variance from sequence\-predictability confounds\. Surviving signals are cross\-validated using conservation, brain eQTL, and protein\-interaction datasets\.In\-Silico Mutagenesis and the Regulatory Influence Score

We loaded Caduceus\-Ph111kuleshov\-group/caduceus\-ph\_seqlen\-131k\_d\_model\-256\_n\_layer\-16\(7\.7M parameters,L=131,072L=131\{,\}072bp\) in float16 on a single NVIDIA A6000 48 GB GPU\. As a bidirectional masked language model, Caduceus\-Ph estimates the per\-position conditional distributionsPθ\(xi∣𝐱∖i\)P\_\{\\theta\}\(x\_\{i\}\\mid\\mathbf\{x\}\_\{\\setminus i\}\), from which we compute the per\-position log\-likelihoodℓ\(i;𝐱\)=log⁡Pθ\(xi∣𝐱∖i\)\\ell\(i;\\mathbf\{x\}\)=\\log P\_\{\\theta\}\(x\_\{i\}\\mid\\mathbf\{x\}\_\{\\setminus i\}\)and the TSS\-proximal mean

ℓ¯\(𝐱\)=1\|ℛ\|∑i∈ℛℓ\(i;𝐱\),ℛ=\{i:\|i−iTSS\|≤W\},W=10,000bp,\\bar\{\\ell\}\(\\mathbf\{x\}\)=\\frac\{1\}\{\|\\mathcal\{R\}\|\}\\sum\_\{i\\in\\mathcal\{R\}\}\\ell\(i;\\mathbf\{x\}\),\\quad\\mathcal\{R\}=\\bigl\\\{i:\|i\-i\_\{\\mathrm\{TSS\}\}\|\\leq W\\bigr\\\},\\quad W=10\{,\}000\\;\\text\{bp\},\(1\)which focuses the metric on the promoter\-proximal regulatory neighborhood;W=10W=10kb is the default and §[4](https://arxiv.org/html/2606.06834#S4)sweepsW∈\{5,10,20,50,full\}W\\in\\\{5,10,20,50,\\,\\text\{full\}\\\}kb\. For each annotated elementeke\_\{k\}spanning\[sk,tk\)\[s\_\{k\},t\_\{k\}\)we construct a mutant sequence by replacing the element with N tokens,

xi\(k\)=\{Nsk≤i<tk,xiotherwise,x^\{\(k\)\}\_\{i\}=\\begin\{cases\}\\texttt\{N\}&s\_\{k\}\\leq i<t\_\{k\},\\\\ x\_\{i\}&\\text\{otherwise,\}\\end\{cases\}\(2\)and define the midpoint distance to TSS asδk=\|\(sk\+tk\)/2−iTSS\|\\delta\_\{k\}=\|\(s\_\{k\}\+t\_\{k\}\)/2\-i\_\{\\mathrm\{TSS\}\}\|\. The*Regulatory Influence Score*is the resulting change in TSS\-proximal expectation,

RIS\(ek\)=ℓ¯\(𝐱\(k\)\)−ℓ¯\(𝐱\),\\text\{RIS\}\(e\_\{k\}\)=\\bar\{\\ell\}\(\\mathbf\{x\}^\{\(k\)\}\)\-\\bar\{\\ell\}\(\\mathbf\{x\}\),\(3\)so that negative RIS indicates the model’s expectations near the TSS drop when the element is removed\. The full ISM completed all 30,448 ablations in approximately 90 minutes on a single A6000 \(sharded across four jobs of 23 genes each\)\. HyenaDNA222LongSafari/hyenadna\-medium\-160k\-seqlen\-hf\(6\.5M parameters, 160 kb\) uses the forward conditionalℓcausal\(i;𝐱\)=log⁡Pθ\(xi\+1∣x1,…,xi\)\\ell^\{\\mathrm\{causal\}\}\(i;\\mathbf\{x\}\)=\\log P\_\{\\theta\}\(x\_\{i\+1\}\\mid x\_\{1\},\\ldots,x\_\{i\}\)in place of the masked likelihood with Eqs\.[1](https://arxiv.org/html/2606.06834#S3.E1)–[3](https://arxiv.org/html/2606.06834#S3.E3)otherwise unchanged\. Enformer \(∼\\sim251M parameters, 196 kb context,∼\\sim31 s per gene\) requires a different functional: with no per\-position likelihood, RIS is redefined as the change in predicted brain CAGE activity at the TSS bin,

RISenf\(ek\)=CAGE¯brain\(𝐱\(k\),𝒯\)−CAGE¯brain\(𝐱,𝒯\),\\text\{RIS\}^\{\\mathrm\{enf\}\}\(e\_\{k\}\)=\\overline\{\\mathrm\{CAGE\}\}\_\{\\mathrm\{brain\}\}\(\\mathbf\{x\}^\{\(k\)\},\\mathcal\{T\}\)\-\\overline\{\\mathrm\{CAGE\}\}\_\{\\mathrm\{brain\}\}\(\\mathbf\{x\},\\mathcal\{T\}\),\(4\)averaged over𝒯=\{448±3\}\\mathcal\{T\}=\\\{448\\pm 3\\\}output bins \(∼\\sim896 bp around the TSS\) and over 31 brain\-relevant CAGE tracks \(filtered from Basenji2 metadata forbrain,cerebellum,cortex,neuron,astrocyte\)\. Integrated Gradients\(Sundararajanet al\.,[2017](https://arxiv.org/html/2606.06834#bib.bib46)\)provide a perturbation\-free per\-nucleotide attribution map computed via Captum\(Kokhlikyanet al\.,[2020](https://arxiv.org/html/2606.06834#bib.bib47)\)withM=20M=20Riemann steps and an all\-N baseline; we report contiguous IG peaks as regions where\|ai\|\|a\_\{i\}\|exceeds the 95th percentile of the genome\-wide distribution, merging peaks separated by less than 50 bp\.

![Refer to caption](https://arxiv.org/html/2606.06834v1/x3.png)Figure 3:The 10 kb regulatory horizon and element\-class hierarchy\.\(A\)Mean\|RIS\|\|\\text\{RIS\}\|as a function of distance to TSS \(shown atW=10W=10kb; symlog scale; fold\-enrichment across all windows in Table[3](https://arxiv.org/html/2606.06834#Ax1.T3)\) reveals a sharp transition near 10 kb \(dashed red line\); all three tiers trace nearly identical decay profiles\.\(B\)Heatmap of mean RIS by element class and gene tier \(all distances\); promoters and proximal enhancers dominate; LTR retrotransposons lead within the proximal<<10 kb subset\.The Diagnostic: Residualization, Permutation Nulls, Effect Sizes

Likelihood\-based RIS is tautologically coupled to local sequence likelihood, since removing any element with high mutual information to its surrounding sequence will lower regional likelihood whether or not the element is regulatory; reading raw RIS rankings as direct regulatory evidence therefore conflates two layers\. We separate them with three tools\. First, for each element we compute four nuisance covariates from the unmasked sequence \(4\-mer Shannon entropy of𝐱\[sk,tk\)\\mathbf\{x\}\_\{\[s\_\{k\},t\_\{k\}\)\}, GC content,log⁡\(1\+Lk\)\\log\(1\+L\_\{k\}\),log⁡\(1\+δk\)\\log\(1\+\\delta\_\{k\}\)\) and fit ordinary least squares

\|RIS\(ek\)\|=β0\+β1H4\(ek\)\+β2GC\(ek\)\+β3log⁡\(1\+Lk\)\+β4log⁡\(1\+δk\)\+εk,\|\\text\{RIS\}\(e\_\{k\}\)\|=\\beta\_\{0\}\+\\beta\_\{1\}\\mathrm\{H\}\_\{4\}\(e\_\{k\}\)\+\\beta\_\{2\}\\mathrm\{GC\}\(e\_\{k\}\)\+\\beta\_\{3\}\\log\(1\+L\_\{k\}\)\+\\beta\_\{4\}\\log\(1\+\\delta\_\{k\}\)\+\\varepsilon\_\{k\},\(5\)reporting residualizedε^k\\widehat\{\\varepsilon\}\_\{k\}alongside raw\|RIS\(ek\)\|\|\\text\{RIS\}\(e\_\{k\}\)\|\. Before running the analysis we pre\-committed to a binary decision rule, accepting a cCRE\-anchored regulatory framing only if the partial Spearman correlation betweenε^k\\widehat\{\\varepsilon\}\_\{k\}and ENCODE cCRE membership reaches\|ρ\|≥0\.15\|\\rho\|\\geq 0\.15atp<10−4p<10^\{\-4\}for at least two of three architectures\. Second, all cross\-model and IG\-vs\-ISM overlap percentages are evaluated against marginal\-preserving permutation nulls \(per\-gene rank shuffle, 2,000 permutations\); observed values are reported as fold\-enrichment over null mean and as empiricalpp\. Third, everypp\-value is paired with an effect size: Cliff’sδ\\delta\(bootstrap 95% CI from 1,000 stratified resamples\) for two\-sample comparisons,ϵ2\\epsilon^\{2\}for Kruskal–Wallis, Spearmanρ\\rhowith bootstrap CI for distance decay; we adopt Romano \(2006\) thresholds and report Benjamini–Hochberg\-correctedppthroughout\. For Tier 1 we additionally re\-run the entire ISM under three perturbation schemes \(N\-mask, in\-place shuffle, random\-base substitution\) at five scoring windows, so that both the choice ofWWin Eq\.[1](https://arxiv.org/html/2606.06834#S3.E1)and the N\-mask in Eq\.[2](https://arxiv.org/html/2606.06834#S3.E2)are independently audited \(§[4](https://arxiv.org/html/2606.06834#S4)\)\.

We anchor the residualized rankings against three external datasets: UCSC phastCons100way \(per\-element mean conservation, correlated with raw and residualized\|RIS\|\|\\text\{RIS\}\|\); GTEx v8 brain\-tissue significant cis\-eQTL pairs across thirteen brain regions \(an element counts as an eQTL hit if its genomic span contains any significant variant whose target gene matches the host gene, with top\-KKenrichment tested against a uniform\-selection null\); and STRING v12\.0 human PPI at confidence thresholds\{400,700,900\}\\\{400,700,900\\\}, where top\-KKhost genes are tested for interacting protein pairs against a per\-element gene\-label\-shuffle null over 10,000 permutations\.

## 4Results

A Sharp 10kb Regulatory Horizon

Across all 30,448 dark genome elements, ablation predominantly reduces TSS\-proximal sequence likelihood \(mean RIS−0\.035\-0\.035; 15\.6–18\.2% of elements per tier carry\|RIS\|\>0\.01\|\\text\{RIS\}\|\>0\.01, 11–14% carry\|RIS\|\>0\.1\|\\text\{RIS\}\|\>0\.1; Table[1](https://arxiv.org/html/2606.06834#S2.T1), Fig\.[4](https://arxiv.org/html/2606.06834#S4.F4)\)\. Stratified bootstrap over genes shows the three tier means are statistically indistinguishable \(Tier 1−0\.036\-0\.036, 95% CI\[−0\.040,−0\.032\]\[\-0\.040,\-0\.032\]; Tier 2−0\.038\-0\.038\[−0\.041,−0\.035\]\[\-0\.041,\-0\.035\]; Tier 3−0\.031\-0\.031\[−0\.035,−0\.027\]\[\-0\.035,\-0\.027\]; Kruskal–WallisH=4\.50H=4\.50,p=0\.105p=0\.105, Benjamini–Hochberg corrected throughout\), so circuit\-vs\-proliferative differences will have to come from finer stratification\. The first sharp pattern that does emerge is quite geometric in observation\. Lettingδk\\delta\_\{k\}denote midpoint distance to the TSS, elements within 10 kb exert mean\|RIS\|≈0\.21\|\\text\{RIS\}\|\\approx 0\.21, while those beyond 10 kb show effectively zero influence \(\|RIS\|<0\.001\|\\text\{RIS\}\|<0\.001, Fig\.[3](https://arxiv.org/html/2606.06834#S3.F3)A\)\. The 5–10 kb bin and the adjacent 10–20 kb bin differ by roughly200×200\\times, the steepest single\-step transition in an otherwise continuous distance\-dependent decay; aggregated across the proximal\-vs\-distal split,

𝔼\[\|RIS\(ek\)\|\|δk<10kb\]≈480×𝔼\[\|RIS\(ek\)\|\|δk≥10kb\],\\mathbb\{E\}\\bigl\[\|\\text\{RIS\}\(e\_\{k\}\)\|\\bigm\|\\delta\_\{k\}<10\\,\\text\{kb\}\\bigr\]\\;\\approx\\;480\\times\\mathbb\{E\}\\bigl\[\|\\text\{RIS\}\(e\_\{k\}\)\|\\bigm\|\\delta\_\{k\}\\geq 10\\,\\text\{kb\}\\bigr\],\(6\)ranging from481×481\\timeson circuit genes to2,012×2\{,\}012\\timeson controls; the Spearman correlation betweenδk\\delta\_\{k\}and\|RIS\|\|\\text\{RIS\}\|is strongly negative across all three tiers \(ρ=−0\.583,−0\.590,−0\.556\\rho=\-0\.583,\-0\.590,\-0\.556; bootstrap 95% CI within±0\.02\\pm 0\.02\)\. The horizon under sequence\-only modeling coincides with the empirical scale of promoter\-proximal regulatory domains, is recovered identically across tiers, and persists through every control we apply\. It is the one signal in our data that no nuisance covariate explains away\.

![Refer to caption](https://arxiv.org/html/2606.06834v1/x4.png)Figure 4:RIS distributions across 30,448 dark genome elements\.\(A\)Violin plots of RIS across the three gene tiers reveal broadly similar distributions with long negative tails\.\(B\)Mean RIS by element class and tier \(all distances\); promoters and proximal enhancers dominate overall, while LTR retrotransposons lead within the 10 kb proximal window \(Fig\.[3](https://arxiv.org/html/2606.06834#S3.F3)B\)\. Error bars: SEM\.Reading the Class Hierarchy: Length and Predictability Drive It

Within the 10 kb proximal regime the raw element\-class hierarchy is dramatic: LTR retrotransposons top atRIS¯=−0\.307\\overline\{\\text\{RIS\}\}=\-0\.307\(ϵ2=0\.42\\epsilon^\{2\}=0\.42for the cross\-class Kruskal–Wallis\), followed closely by promoters \(−0\.266\-0\.266\), LINEs \(−0\.264\-0\.264\), distal enhancers \(−0\.247\-0\.247\), CTCF insulators \(−0\.237\-0\.237\), proximal enhancers \(−0\.235\-0\.235\), SINEs \(−0\.208\-0\.208\), DNA transposons \(−0\.187\-0\.187\), and G\-quadruplexes \(−0\.024\-0\.024, Table[4](https://arxiv.org/html/2606.06834#Ax1.T4), appx Fig\.[10](https://arxiv.org/html/2606.06834#Ax1.F10)\)\. The single strongest circuit\-gene effect is an ERV3\-derived LTR 1\.9 kb upstream ofNRXN1\(RIS=−2\.53\\text\{RIS\}=\-2\.53\); the second is an L1PA7 LINE 11\.2 kb fromNLGN1\(RIS=−1\.93\\text\{RIS\}=\-1\.93, in the 10–20 kb decay tail\); LINEs \(18 of 30\) and LTRs \(7 of 30\) dominate the top\-30 hits across all tiers \(Table[5](https://arxiv.org/html/2606.06834#Ax1.T5), Appendix\)\. Without the diagnostic, this picture invites an immediate biological reading: a TE\-mediated regulatory layer at the synaptogenic axis\. The diagnostic refuses that reading\.

The first cut is a length normalization\. Class mean lengths span a factor of thirteen \(LTR 326 bp, G\-quadruplex 25 bp\), and so dividing\|RIS\|\|\\text\{RIS\}\|by element length collapses the entire 12\-fold raw spread into a 2\.5% interval \(\|RIS\|/kb\|\\text\{RIS\}\|/\\text\{kb\}ranging 0\.937 to 0\.960 across all classes\), implying that each ablated base contributes almost similarly regardless of their class identity\. The second cut is the residualization of Eq\.[5](https://arxiv.org/html/2606.06834#S3.E5): four nuisance covariates capture 36% of\|RIS\|\|\\text\{RIS\}\|variance for Caduceus\-Ph and 28% for HyenaDNA, and the partial Spearman correlation between residualized\|RIS\|\|\\text\{RIS\}\|and ENCODE cCRE membership isρ=−0\.018\\rho=\-0\.018\(Caduceus\) andρ=\+0\.015\\rho=\+0\.015\(HyenaDNA\), both below the pre\-committed decision threshold of\|ρ\|≥0\.15\|\\rho\|\\geq 0\.15\. Enformer is the only architecture whose RIS retains a measurable cCRE\-discriminative residual \(ρ=−0\.100\\rho=\-0\.100,p<10−68p<10^\{\-68\}\), but Enformer also has the lowest fraction of\|RIS\|\|\\text\{RIS\}\|variance attributable to nuisance covariates \(R2=0\.09R^\{2\}=0\.09\), consistent with its training objective being directly tied to experimental output\. The third cut is the simplest and most damning: a six\-feature linear baseline \(GC,log⁡L\\log L,log⁡δ\\log\\delta, plus is\-TE / is\-cCRE / is\-G4 indicators\) predicts whether an element falls in the top decile by Caduceus\|RIS\|\|\\text\{RIS\}\|at five\-fold AUC=0\.985±0\.001=0\.985\\pm 0\.001, with HyenaDNA at0\.9450\.945and Enformer at0\.8180\.818; ablatinglog⁡δ\\log\\deltaalone reduces AUC by0\.230\.23\(Caduceus\), confirming that distance to TSS does most of the work\. Re\-examining the SINE circuit\-vs\-proliferative comparison under this light, the headline Wilcoxonpadj=4\.87×10−7p\_\{\\mathrm\{adj\}\}=4\.87\\times 10^\{\-7\}is paired with Cliff’sδ=−0\.080\\delta=\-0\.080\(Caduceus\) andδ=\+0\.097\\delta=\+0\.097\(HyenaDNA\), both negligible by Romano \(2006\) thresholds and*disagreeing in direction*, and the residualized effect collapses below\|δ\|=0\.07\|\\delta\|=0\.07across all three models \(Fig\.[5](https://arxiv.org/html/2606.06834#S4.F5)\)\. The class hierarchy in raw RIS is essentially a length\-and\-distance hierarchy with a regulatory veneer\.

![Refer to caption](https://arxiv.org/html/2606.06834v1/x5.png)Figure 5:The SINE tier comparison:pp\-significance without effect size or directional consistency\.\(A\)SINE RIS distributions by gene tier \(Caduceus\-Ph\): the Wilcoxon test reachespadj=4\.87×10−7p\_\{\\mathrm\{adj\}\}=4\.87\\times 10^\{\-7\}but Cliff’sδ=−0\.080\\delta=\-0\.080\(negligible\)\.\(B\)Cross\-model: both LMs hitp<10−6p<10^\{\-6\}but disagree in direction \(Caduceusδ=−0\.080\\delta=\-0\.080; HyenaDNAδ=\+0\.097\\delta=\+0\.097\); Enformer is non\-significant \(p=0\.189p=0\.189,δ=\+0\.019\\delta=\+0\.019\)\. After residualization\|δ\|<0\.07\|\\delta\|<0\.07across all three models\.Cross\-Architecture Decomposition: Predictability vs\. Expression Output

The three\-architecture design becomes diagnostic once permutation nulls are in place\. Within\-tier\|RIS\|\|\\text\{RIS\}\|correlates at Pearsonr=0\.82r=0\.82between Caduceus\-Ph and HyenaDNA, and 76 of Caduceus\-Ph’s top 100 elements appear in HyenaDNA’s top 100 against a per\-gene marginal\-preserving null mean ofμ=1\.27\\mu=1\.27, a60×60\\timesenrichment with empiricalp<5×10−4p<5\\times 10^\{\-4\}\(Fig\.[6](https://arxiv.org/html/2606.06834#S4.F6)\)\. This is genuine cross\-LM agreement\. The corresponding Caduceus\-Enformer top\-100 intersection, however, is exactly zero \(null mean0\.430\.43\), as is the HyenaDNA\-Enformer intersection \(0vs\. null0\.420\.42\) and the triple top\-100 intersection \(0vs\. null0\)\. Above\-null overlap with Enformer appears only atK=500K=500\(Caduceus\-Enformer6565vs\. null8\.668\.66,7\.5×7\.5\\times; triple3535vs\. null0\.200\.20,178×178\\times\)\. Two top\-list layers therefore exist, and at the most stringent cutoffs they are disjoint\. The eight\-cell Venn decomposition atK=100K=100resolves the layers cleanly\. The Caduceus∩\\capHyenaDNA\-only cell \(76 elements\) is 91% transposable elements with mean length 1,168 bp and mean TSS distance 5,845 bp; the Enformer\-only cell \(100 elements\) is 87% promoters and proximal enhancers with mean length 292 bp and mean TSS distance 813 bp\. The two layers differ by4×4\\timesin mean element length and by7×7\\timesin mean TSS distance, and atK=100K=100literally no element is in both\. The reading that follows from the residualization analysis is that the two language models share a sequence\-predictability layer that surfaces long well\-predicted TSS\-moderate\-proximity sequences regardless of regulatory annotation, while Enformer \(whose training objective is CAGE prediction\) scores a regulatory\-output layer of short proximal cCREs\. Cross\-model agreement therefore acts as a layer indicator: agreement\-with\-Enformer marks elements whose ranking is anchored to expression output, while agreement\-only\-between\-LMs flags elements whose ranking is at risk of being a sequence\-grammar artifact\.

![Refer to caption](https://arxiv.org/html/2606.06834v1/x6.png)Figure 6:Three\-model cross\-validation\.\(A\)\|RIS\|\|\\text\{RIS\}\|scatter, Caduceus vs\. Enformer: proximal elements \(<<10 kb, green\) and distal \(grey\); Spearmanρ\\rhoannotated in panel\.\(B\)Distance\-decay across all three models; all recover the 10 kb boundary, sharply for the language models and gently for Enformer\.\(C\)Element\-class hierarchy: language models rank TEs highest, Enformer ranks promoters and enhancers highest\.\(D\)Top\-KKintersection: the two LMs share 76 of 100 top elements \(60×60\\timesabove null\), while neither shares any with Enformer atK=100K=100\.Orthogonal Evidence: Conservation, eQTLs, and Protein Interactions

Three external datasets test whether the LM\-derived top\-KKrankings carry regulatory signal independent of our internal predictability and length covariates\. UCSC phastCons100way mean conservation across 30,389 elements has a small positive Spearman correlation with raw\|RIS\|\|\\text\{RIS\}\|\(Caduceusρ=\+0\.106\\rho=\+0\.106, HyenaDNAρ=\+0\.047\\rho=\+0\.047, Enformerρ=\+0\.138\\rho=\+0\.138; allp<10−15p<10^\{\-15\}\), but the residualized RIS shows the sign*flip*for the language models \(ρ=−0\.105\\rho=\-0\.105Caduceus,−0\.084\-0\.084HyenaDNA\) and reduce to essentially zero for Enformer \(ρ=−0\.013\\rho=\-0\.013\)\. The raw positive association is therefore an artifact of length and distance covariates; what remains of the LM signal after residualization is anti\-correlated with conservation, consistent with LM responses being driven by recently\-evolved \(and therefore less\-conserved\) repetitive sequences\. GTEx v8 cis\-eQTLs across thirteen brain tissues offer a stronger control, and the picture is more positive there: among the top 100 elements per model, 8% overlap a brain eQTL whose target matches the host gene, a3\.33\.3to3\.4×3\.4\\timesenrichment over a uniform\-selection null withpemp≤3\.5×10−3p\_\{\\mathrm\{emp\}\}\\leq 3\.5\\times 10^\{\-3\}, consistent across all three architectures and persistent atK≤500K\\leq 500\. The LM RIS therefore does carry an eQTL\-aligned signal at the top of the ranking, even though that signal is not aligned with cCRE\-class membership in residualized space\. The third cross\-check refutes a different headline\. The "interacting protein pair" framing of the NRXN1\+NLGN1 result, when tested against the full STRING v12\.0 human PPI graph at confidence thresholds\{400,700,900\}\\\{400,700,900\\\}, gives observed pair counts*below*the gene\-shuffle null at everyK≤500K\\leq 500and every model \(fold enrichment0\.460\.46–1\.181\.18\); only the CaduceusK=10K\{=\}10, threshold\-900900cell shows a borderline signal \(33vs\.1\.61\.6expected,pemp=0\.23p\_\{\\mathrm\{emp\}\}=0\.23\)\. The trans\-synaptic adhesion narrative is post\-hoc storytelling onn=2n=2data points and does not survive a proper PPI null\.

Robustness, Saliency, and Generalization

The 10 kb horizon and the cross\-architecture decomposition both survive every robustness check we have run\. For the Tier 1 cohort \(9,512 elements\), proximal/distal\|RIS\|\|\\text\{RIS\}\|enrichment is21,197×21\{,\}197\\timesatW=5W=5kb and463\.7×463\.7\\timesatW=10W=10kb, decaying expectedly onceWWoverlaps the distal region \(Appendix Table[3](https://arxiv.org/html/2606.06834#Ax1.T3)\); class rank order is preserved across narrow windows \(Spearmanρ≥0\.90\\rho\\geq 0\.90forW≤20W\\leq 20kb\), so neither boundary nor hierarchy is manufactured by the choice ofWW\. Three independent perturbation schemes \(N\-token masking, in\-place shuffling, random\-base substitution\) agree atW=10W=10kb \(Spearmanρ=0\.750\\rho=0\.750to0\.8890\.889; Pearsonr=0\.841r=0\.841to0\.9850\.985\), with top\-100 overlaps of 32%, 30%, and a 28% triple intersection, well above the∼1%\\sim 1\\%chance baseline \(Fig\.[7](https://arxiv.org/html/2606.06834#S4.F7)\)\. Held\-out gene generalization gives mean train\-vs\-test Spearmanρ=0\.76\\rho=0\.76\(Caduceus\),0\.670\.67\(HyenaDNA\),0\.460\.46\(Enformer\) over fifty 5\-fold splits, so the patterns are not gene\-leakage artifacts\. Integrated Gradients adds a perturbation\-free cross\-check: 82% of Caduceus IG peaks \(430 of 526\) overlap ISM\-significant elements with\|RIS\|\>0\.01\|\\text\{RIS\}\|\>0\.01across the 32 Tier 1 genes \(Fig\.[8](https://arxiv.org/html/2606.06834#Ax1.F8), Appendix\), withNRXN1attribution maximal at the same ERV3\-derived LTR that topped the ISM ranking\. The 10 kb horizon, the predictability\-vs\-output decomposition, and the eQTL\-anchored top\-KKcandidates pass every robustness, saliency, and held\-out cross\-check we have devised\.

![Refer to caption](https://arxiv.org/html/2606.06834v1/x7.png)Figure 7:Robustness across scoring windows and perturbation schemes \(Tier 1\)\.\(A\)Distance\-decay across fiveWWvalues\. The 10 kb transition is reproduced for narrowWWand necessarily flattens onceWWoverlaps the distal region\.\(B\)Per\-element RIS scatter atW=10W=10kb, N\-mask vs\. shuffle \(blue\) and N\-mask vs\. random \(green\), tightly along the diagonal\.\(C\)Element\-class rank under eachWW\.\(D\)Top\-KKoverlap across the three perturbation schemes forK∈\{100,200,500\}K\\in\\\{100,200,500\\\}\.
## 5Discussion

The 10 kb horizon is the one significant signature in our data that survives every control we have applied: in three architectures, within three gene tiers, under three perturbation schemes, for five scoring windows, and residualization on four nuisance covariates\. Also, distance\-binned class stratification, and held\-out\-gene generalization\. The boundary likely reflects the empirical reach over which a primary\-sequence model connects a candidate element to its target TSS without help from 3D chromatin contacts\(Chakraborty and others,[2023](https://arxiv.org/html/2606.06834#bib.bib19); Feng and Yang,[2025](https://arxiv.org/html/2606.06834#bib.bib8)\), so it marks the resolution of the probe rather than the boundary of the regulatory landscape\. Beneath that apparent horizon the diagnostic refines what the rankings mean: the two language models share a sequence\-predictability layer that ranks long well\-predicted TSS\-moderate\-proximity transposable elements highly regardless of class, while Enformer \(trained on CAGE\) scores a regulatory\-output layer of short proximal cCREs largely orthogonal to it, so cross\-model overlap acts as a layer indicator\. The orthogonal cross\-checks narrate a coherent triadic story: residualized LM\-RIS is anti\-correlated with phastCons \(what survives in the LMs leans on recently\-evolved repetitive sequence rather than conserved regulatory elements\); brain cis\-eQTL overlap nonetheless reaches3\.3×3\.3\\timesenrichment in each model’s top\-100, a real if modest signal the LM rankings do carry; and the full STRING\-PPI test refuses the trans\-synaptic adhesion narrative the original framing of this work centered on \(observed pair counts at or below a gene\-shuffle null at every confidence threshold andK≤500K\\leq 500\)\. The methodological consequence is concrete: largennproduces smallppfor many class\-tier comparisons, but the "effect" they describe can have negligible Cliff’sδ\\deltaand can disagree in direction across architectures, so headlinepp\-values without effect sizes and permutation nulls are not enough\.

Limitations\.Residualization controls predictability in expectation but cannot fully exclude a pretraining memorization of repeat\-family signatures; also, the 92\-gene panel is quite small for class\-by\-tier interactions, the sequence\-only models in turn miss distal TAD\-mediated contacts, and the orthogonal computational signals we report are corroborative, rather than an equivalent to wet\-lab confirmation\(Fulcoet al\.,[2016](https://arxiv.org/html/2606.06834#bib.bib39)\)\.

## 6Conclusion

Sequence foundation models promise a zero\-shot lens on the dark regulome, but the lens is silently miscalibrated: likelihood\-based ISM scoring conflates regulatory function with sequence predictability, and cross\-architecture “convergences” can be statistical artifacts of largenn\. We deliver the calibration\. Our residualization\-and\-permutation diagnostic equips any ISM\-based study with a principled separation of the predictability layer from the regulation layer, marginal\-preserving nulls for every overlap percentage, and effect sizes alongside everypp\-value, across 30,448 dark\-genome ablations spanning 92 glioma synaptic loci\. Three results emerge with confidence: a sharp 10 kb proximal\-regulatory horizon that survives every control we apply, a clean architecture\-level decomposition into a sequence\-predictability layer shared by the language models and a regulatory\-output layer recovered uniquely by Enformer \(top\-100 overlap exactly zero\), and a3\.3×3\.3\\timesbrain cis\-eQTL\-enriched shortlist of synaptogenic\-locus candidates primed for closed\-loop CRISPRi perturbation\(Fulcoet al\.,[2016](https://arxiv.org/html/2606.06834#bib.bib39); Tan and others,[2023](https://arxiv.org/html/2606.06834#bib.bib34); Zhanget al\.,[2025](https://arxiv.org/html/2606.06834#bib.bib7)\)\. The diagnostic is architecture\-agnostic and ports immediately to next\-generation genomic foundation models, to epigenomic re\-weighting via ATAC\-seq and HiChIP\(Bi and others,[2025](https://arxiv.org/html/2606.06834#bib.bib31)\), and to overlap with patient noncoding mutation catalogs\(Iñiguez\-Muñozet al\.,[2025](https://arxiv.org/html/2606.06834#bib.bib33)\)\.

## References

- LINE\-1 retrotransposons mediate cis\-acting transcriptional control in human pluripotent stem cells and regulate early brain development\.Cell Genomics5\(10\),pp\. 100979\.Cited by:[§1](https://arxiv.org/html/2606.06834#S1.p1.1),[§2](https://arxiv.org/html/2606.06834#S2.p3.1)\.
- Ž\. Avsec, V\. Agarwal, D\. Visentin, J\. R\. Ledsam, A\. Grabska\-Barwinska, K\. R\. Taylor, Y\. Assael, J\. Jumper, P\. Kohli, and D\. R\. Kelley \(2021\)Effective gene expression prediction from sequence by integrating long\-range interactions\.Nature Methods18\(10\),pp\. 1196–1203\.Cited by:[§1](https://arxiv.org/html/2606.06834#S1.p2.1),[§2](https://arxiv.org/html/2606.06834#S2.p5.1),[Figure 2](https://arxiv.org/html/2606.06834#S3.F2),[Figure 2](https://arxiv.org/html/2606.06834#S3.F2.4.2.1)\.
- J\. Biet al\.\(2025\)Systematic decoding of functional enhancer connectomes and risk variants in human glioma\.Nature Cell Biology27\(10\),pp\. 1838–1847\.Cited by:[§6](https://arxiv.org/html/2606.06834#S6.p1.3)\.
- C\. Chakrabortyet al\.\(2023\)Rewiring of the promoter\-enhancer interactome and regulatory landscape in glioblastoma orchestrates gene expression underlying neurogliomal synaptic communication\.Nature Communications14\(1\),pp\. 6446\.Cited by:[§1](https://arxiv.org/html/2606.06834#S1.p1.1),[§5](https://arxiv.org/html/2606.06834#S5.p1.6)\.
- ENCODE Project Consortium \(2020\)Expanded encyclopaedias of DNA elements in the human and mouse genomes\.Nature583\(7818\),pp\. 699–710\.Cited by:[§2](https://arxiv.org/html/2606.06834#S2.p3.1)\.
- J\. Feng and J\. Yang \(2025\)Glioma–neuron interactions: insights from neural plasticity\.Frontiers in Oncology15,pp\. 1661897\.Cited by:[§1](https://arxiv.org/html/2606.06834#S1.p1.1),[§2](https://arxiv.org/html/2606.06834#S2.p3.1),[§5](https://arxiv.org/html/2606.06834#S5.p1.6)\.
- C\. P\. Fulco, M\. Munschauer, R\. Anyoha, G\. Munson, S\. R\. Grossman, E\. M\. Perez, M\. Kane, B\. Cleary, E\. S\. Lander, and J\. M\. Engreitz \(2016\)Systematic mapping of functional enhancer–promoter connections with CRISPR interference\.Science354\(6313\),pp\. 769–773\.Cited by:[§5](https://arxiv.org/html/2606.06834#S5.p2.1),[§6](https://arxiv.org/html/2606.06834#S6.p1.3)\.
- S\. Iñiguez\-Muñoz, P\. Llinàs\-Arias, M\. Ensenyat\-Méndez,et al\.\(2025\)Non\-coding somatic single\-nucleotide variations affecting glioblastoma\-specific enhancer elements regulate tumor\-promoting gene networks\.Genes & Diseases13\(1\),pp\. 101762\.External Links:[Document](https://dx.doi.org/10.1016/j.gendis.2025.101762)Cited by:[§2](https://arxiv.org/html/2606.06834#S2.p3.1),[§6](https://arxiv.org/html/2606.06834#S6.p1.3)\.
- K\. Karttunenet al\.\(2023\)Transposable elements as tissue\-specific enhancers in cancers of endodermal lineage\.Nature Communications14\(1\),pp\. 5313\.Cited by:[§2](https://arxiv.org/html/2606.06834#S2.p3.1)\.
- D\. R\. Kelley, Y\. A\. Reshef, M\. Bileschi, D\. Belanger, C\. Y\. McLean, and J\. Snoek \(2018\)Sequential regulatory activity prediction across chromosomes with convolutional neural networks\.Genome Research28\(5\),pp\. 739–750\.Cited by:[§1](https://arxiv.org/html/2606.06834#S1.p2.1),[§2](https://arxiv.org/html/2606.06834#S2.p5.1)\.
- N\. Kokhlikyan, V\. Miglani, M\. Martin, E\. Wang, B\. Alsallakh, J\. Reynolds, A\. Melnikov, N\. Kliber, C\. Fan, D\. Zou,et al\.\(2020\)Captum: a unified and generic model interpretability library for PyTorch\.arXiv preprint arXiv:2009\.07896\.Cited by:[§3](https://arxiv.org/html/2606.06834#S3.p4.15)\.
- S\. Krishna, A\. Choudhury, M\. B\. Keough, K\. Seo, L\. Ni, S\. Kaber, N\. Samaha,et al\.\(2023\)Glioblastoma remodelling of human neural circuits decreases survival\.Nature617\(7961\),pp\. 599–607\.Cited by:[§1](https://arxiv.org/html/2606.06834#S1.p1.1),[§2](https://arxiv.org/html/2606.06834#S2.p2.1)\.
- E\. Nguyen, M\. Poli, M\. Faber, J\. Arber, R\. Bai, T\. Dao, S\. Ermon, C\. Ré, S\. Massari,et al\.\(2023\)HyenaDNA: long\-range genomic sequence modeling at single nucleotide resolution\.Advances in Neural Information Processing Systems \(NeurIPS\)\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2306.15794)Cited by:[§1](https://arxiv.org/html/2606.06834#S1.p2.1),[§2](https://arxiv.org/html/2606.06834#S2.p5.1),[Figure 2](https://arxiv.org/html/2606.06834#S3.F2),[Figure 2](https://arxiv.org/html/2606.06834#S3.F2.4.2.1)\.
- M\. Osswald, E\. Jung, F\. Sahm, G\. Solecki, V\. Venkataramani, J\. Blaes, S\. Weil, H\. Horstmann, B\. Wiestler,et al\.\(2015\)Brain tumour cells interconnect to a functional and resistant network\.Nature528\(7580\),pp\. 93–98\.Cited by:[§1](https://arxiv.org/html/2606.06834#S1.p1.1)\.
- T\. Picart and S\. Hervey\-Jumper \(2024\)Central nervous system regulation of diffuse glioma growth and invasion: from single unit physiology to circuit remodeling\.Journal of Neuro\-Oncology169\(1\),pp\. 1–10\.Cited by:[§2](https://arxiv.org/html/2606.06834#S2.p2.1)\.
- Y\. Schiff, C\. Kao, A\. Gokaslan, T\. Dao, A\. Gu, and V\. Kuleshov \(2024\)Caduceus: bi\-directional equivariant long\-range DNA sequence modeling\.International Conference on Machine Learning\.Cited by:[§1](https://arxiv.org/html/2606.06834#S1.p2.1),[§2](https://arxiv.org/html/2606.06834#S2.p5.1),[Figure 2](https://arxiv.org/html/2606.06834#S3.F2),[Figure 2](https://arxiv.org/html/2606.06834#S3.F2.4.2.1)\.
- M\. Sundararajan, A\. Taly, and Q\. Yan \(2017\)Axiomatic attribution for deep networks\.InInternational Conference on Machine Learning,pp\. 3319–3328\.Cited by:[§2](https://arxiv.org/html/2606.06834#S2.p5.1),[§3](https://arxiv.org/html/2606.06834#S3.p4.15)\.
- I\. L\. Tanet al\.\(2023\)Targeting the non\-coding genome and temozolomide signature enables CRISPR\-mediated glioma oncolysis\.Cell Reports42\(11\),pp\. 113339\.Cited by:[§6](https://arxiv.org/html/2606.06834#S6.p1.3)\.
- K\. R\. Taylor, T\. Barron, A\. Hui,et al\.\(2023\)Glioma synapses recruit mechanisms of adaptive plasticity\.Nature623,pp\. 366–374\.Cited by:[§1](https://arxiv.org/html/2606.06834#S1.p1.1),[§2](https://arxiv.org/html/2606.06834#S2.p2.1)\.
- P\. J\. Thompson, T\. S\. Macfarlan, and M\. C\. Lorincz \(2016\)Long terminal repeats: from parasitic elements to building blocks of the transcriptional regulatory repertoire\.Molecular Cell62\(5\),pp\. 766–776\.Cited by:[§2](https://arxiv.org/html/2606.06834#S2.p3.1)\.
- V\. Venkataramani, D\. I\. Tanev, C\. Strahle, A\. Studier\-Fischer, L\. Fankhauser, T\. Kessler, C\. Körber, M\. Geodber,et al\.\(2019\)Glutamatergic synaptic input to glioma cells drives brain tumour progression\.Nature573\(7775\),pp\. 532–538\.Cited by:[§1](https://arxiv.org/html/2606.06834#S1.p1.1),[§2](https://arxiv.org/html/2606.06834#S2.p2.1)\.
- H\. S\. Venkatesh, T\. B\. Johung, V\. Caretti, A\. Noll, Y\. Tang, S\. Nagaraja, E\. M\. Gibson, C\. W\. Mount, J\. Polepalli,et al\.\(2015\)Neuronal activity promotes glioma growth through neuroligin\-3 secretion\.Cell161\(4\),pp\. 803–816\.Cited by:[§1](https://arxiv.org/html/2606.06834#S1.p1.1),[§2](https://arxiv.org/html/2606.06834#S2.p2.1)\.
- H\. S\. Venkatesh, W\. Morishita, A\. C\. Geraghty, D\. Silber, J\. N\. Gabriel, M\. Berés, E\. H\. Wang, A\. Luo, A\. Demir,et al\.\(2019\)Electrical and synaptic integration of glioma into neural circuits\.Nature573\(7775\),pp\. 539–545\.Cited by:[§1](https://arxiv.org/html/2606.06834#S1.p1.1),[§2](https://arxiv.org/html/2606.06834#S2.p2.1)\.
- H\. S\. Venkatesh, L\. T\. Tam, P\. J\. Woo, J\. Lennon, S\. Nagaraja, S\. M\. Gillespie, J\. Ni, D\. Y\. Duveau, P\. J\. Morris,et al\.\(2017\)Targeting neuronal activity\-regulated neuroligin\-3 dependency in high\-grade glioma\.Nature549\(7673\),pp\. 533–537\.Cited by:[§1](https://arxiv.org/html/2606.06834#S1.p1.1)\.
- C\. Zhang, H\. Zhang, J\. Cao, and M\. Liu \(2025\)Neuroscience in glioma biology\.Oncology Reports54\(6\)\.Cited by:[§2](https://arxiv.org/html/2606.06834#S2.p2.1),[§6](https://arxiv.org/html/2606.06834#S6.p1.3)\.

## Supplementary Material

### Gene Sets

Tier 1 \(synaptogenic circuit, 32 genes\):ADAM10, BDNF, CAMK2A, CREB1, DLG4, GAD1, GAD2, GJA1, GPC6, GRIA1, GRIA2, GRIA3, GRIA4, GRIN1, GRIN2A, GRIN2B, HOMER1, NLGN1, NLGN3, NRXN1, NRXN2, NTRK2, RELN, SHANK2, SLC12A2, SLC12A5, SLC1A2, SLC7A11, SNAP25, SYT1, THBS1, THBS2\.

Tier 2 \(proliferative/non\-circuit, 30 genes\):AKT1, ATRX, BRAF, CCND1, CCND2, CDK4, CDK6, CDKN2A, EGFR, FGFR1, FGFR3, H3\-3A, HIST1H3B, IDH1, KIT, KRAS, MDM2, MET, MGMT, MTOR, MYC, NF1, NRAS, PDGFRA, PIK3CA, PTEN, RAF1, RB1, TERT, TP53\.

Tier 3 \(brain housekeeping/control, 30 genes\):ALDH1L1, ALDOC, AQP4, CALB1, CALB2, CKB, CNP, ENO2, GAPDH, GFAP, GLUL, MAP2, MBP, MOG, NEFL, NPY, NRGN, OLIG2, PLP1, PVALB, S100B, SLC17A7, SOX10, SST, SYN1, SYP, TH, TUBB3, UCHL1, VHL\.

### Annotation Statistics

Table 2:Dark genome element counts by class across all 92 gene windows\.Element ClassTotal CountMean per GeneMean Length \(bp\)SINE9,676105\.2248LINE5,81463\.2583Distal enhancer \(dELS\)4,46648\.5343G\-quadruplex motif3,21334\.931LTR retrotransposon2,33825\.4494DNA transposon2,09422\.8267Proximal enhancer \(pELS\)1,99721\.7309Promoter \(PLS\)5285\.7381CTCF insulator1972\.1317DNase\-H3K4me31001\.1298Retroposon250\.3187Total30,448331\.0–
### Robustness

Table 3:Robustness across scoring windows and perturbation schemes \(Tier 1,nelem=9,512n\_\{\\text\{elem\}\}\\\!=\\\!9\{,\}512\)\.*Left:*TSS\-proximal vs\. distal\|RIS\|\|\\text\{RIS\}\|fold\-enrichment as a function of scoring windowWW\.*Right:*Concordance of N\-token masking, in\-place sequence shuffling, and random\-base substitution atW=10W=10kb\.Scoring windowWWProx\./dist\.\|RIS\|\|\\text\{RIS\}\|5kb21,197×\\times10kb463\.7×\\times20kb4\.66×\\times50kb1\.19×\\timesfull0\.85×\\times
Scheme pairSpearmanρ\\rhoPearsonrrTop\-100N\-mask vs\. shuffle0\.7450\.85732%N\-mask vs\. random0\.7500\.84130%Shuffle vs\. random0\.8890\.985—Triple int\. \(top 100\)28%

### Computational Resources

All experiments were conducted on a single NVIDIA A6000 48 GB GPU\. For Caduceus\-Ph, wild\-type scoring of all 92 genes used float16 \(∼\\sim1\.7 GB,∼\\sim20 s total\); ISM \(30,448 ablations\) was parallelized across four shards \(∼\\sim6\.9 GB aggregate,∼\\sim90 min\); IG attribution for 32 Tier 1 genes required float32 \(∼\\sim30 GB,∼\\sim12 min\)\. For HyenaDNA, ISM used the same four\-shard parallelization in float16 \(∼\\sim12\.5 GB aggregate,∼\\sim83 min\) and IG used float32 \(∼\\sim11 min for 32 genes\)\. For Enformer, ISM was parallelized across four shards in float16 \(∼\\sim2 GB per shard,∼\\sim31 s/gene\); one gene \(TUBB3\) required a float32 re\-run due to fp16 overflow; IG used float32 \(∼\\sim13\.6 GB peak,∼\\sim8 s/gene\)\. Total compute across all three models was less than 6 GPU\-hours on a single A6000\.

### Element\-Class Regulatory Influence

Table 4:Dark genome element class regulatory influence\. “Proximal” denotes elements within 10kb of TSS\.
### Integrated Gradients Attribution Tracks

![Refer to caption](https://arxiv.org/html/2606.06834v1/x8.png)Figure 8:Integrated Gradients attribution tracks for three circuit genes\.Smoothed\|IG\|\|\\mathrm\{IG\}\|signal \(1 kb rolling mean\) forNLGN3,NRXN1, andGRIA2\. Red vertical line marks the TSS; pink shading indicates the±\\pm10 kb regulatory horizon\. Colored bars at bottom denote annotated dark genome elements \(LINE, SINE, LTR, G4, distal enhancer, promoter\)\. Attribution signal concentrates sharply within the 10 kb boundary, with peaks at annotated elements, orthogonally validating the ISM distance\-decay finding\. TheNRXN1panel shows elevated attribution at the ERV3\-derived LTR element \(RIS=−2\.53\\text\{RIS\}=\-2\.53\)\.![Refer to caption](https://arxiv.org/html/2606.06834v1/x9.png)Figure 9:Element\-class regulatory hierarchy within the 10 kb proximal window\.Mean RIS \(Caduceus\-Ph\) restricted to elements with TSS distance<10<10kb, ranked by mean influence\. LTR retrotransposons lead atRIS¯=−0\.307\\overline\{\\text\{RIS\}\}=\-0\.307, followed by promoters \(−0\.266\-0\.266\), LINEs \(−0\.264\-0\.264\), distal enhancers \(−0\.247\-0\.247\), CTCF insulators \(−0\.237\-0\.237\), proximal enhancers \(−0\.235\-0\.235\), SINEs \(−0\.208\-0\.208\), DNA transposons \(−0\.187\-0\.187\), and G\-quadruplexes \(−0\.024\-0\.024\)\. Sample sizes per class shown in parentheses\. Error bars: SEM\.
![Refer to caption](https://arxiv.org/html/2606.06834v1/x10.png)Figure 10:Top 20 dark genome elements by\|RIS\|\|\\text\{RIS\}\|\(Caduceus\-Ph\)\.Each bar labels the gene and transposable element family; element class and TSS distance annotated at right for elements 14–20\. Colors denote gene tier: circuit \(red\), proliferative \(blue\), brain control \(grey\)\. The top hit NPY⋅\\cdotL1PA6 \(RIS=−4\.6\\text\{RIS\}=\-4\.6\) is a control\-tier outlier; NRXN1⋅\\cdotERV3\-16A3\_I\-int \(RIS=−2\.53\\text\{RIS\}=\-2\.53\) is the strongest circuit\-tier hit and the element with maximal Integrated Gradients attribution\.

### Top Hits and Circuit\-Gene Specifics

Table 5:Top 10 dark genome elements by\|RIS\|\|\\text\{RIS\}\|across all 92 gene loci\.Beyond the top 10 overall \(Table[5](https://arxiv.org/html/2606.06834#Ax1.T5)\), notable circuit\-gene\-specific hits includeSNAP25L1PA6 \(RIS=−1\.87\\text\{RIS\}=\-1\.87, 7\.8 kb\),THBS2Tigger1 \(RIS=−1\.69\\text\{RIS\}=\-1\.69, 3\.7 kb\),CREB1L2a \(RIS=−1\.36\\text\{RIS\}=\-1\.36, 6\.3 kb\),GRIA2L1MB1 \(RIS=−1\.31\\text\{RIS\}=\-1\.31, 9\.5 kb\),GAD2L1MA8 \(RIS=−1\.29\\text\{RIS\}=\-1\.29, 9\.3 kb\),NTRK2L1ME1 \(RIS=−1\.27\\text\{RIS\}=\-1\.27, 2\.9 kb\),GRIA4L2c \(RIS=−1\.25\\text\{RIS\}=\-1\.25, 8\.8 kb\),CREB1L2b \(RIS=−1\.22\\text\{RIS\}=\-1\.22, 1\.9 kb\), andSYT1L1ME3E \(RIS=−1\.07\\text\{RIS\}=\-1\.07, 7\.9 kb\)\. All are LINE or DNA transposon elements within the 10 kb regulatory horizon\.

### Broader Impact Statement

This work develops computational tools for characterizing noncoding regulatory elements in the context of glioma biology\. The identified regulatory elements are computational predictions that require experimental validation before any clinical translation\. We do not foresee negative societal impacts from this research, as it contributes to basic understanding of gene regulation in cancer\. The methods and gene sets are derived from publicly available datasets \(GENCODE, ENCODE, UCSC RepeatMasker\) and publicly available pretrained models \(Caduceus\-Ph, HyenaDNA, Enformer\)\. No patient data or human subjects were involved\. The computational framework is generalizable and could accelerate noncoding variant interpretation in other diseases, potentially benefiting precision medicine efforts\.
The Dark Regulome: Disentangling Predictability from Regulation in Genomic Foundation Models

Similar Articles

Towards Universal Gene Regulatory Network Inference: Unlocking Generalizable Regulatory Knowledge in Single-cell Foundation Models

Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods

Unveiling Multi-regime Patterns in SciML: Distinct Failure Modes and Regime-specific Optimization

GENEB: Why Genomic Models Are Hard to Compare

CoDiffGRN: Rethinking Gene Regulatory Network Inference via the BEELINE-KGC Benchmark and Co-evolutionary Discrete Diffusion

Submit Feedback

Similar Articles

Towards Universal Gene Regulatory Network Inference: Unlocking Generalizable Regulatory Knowledge in Single-cell Foundation Models
Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods
Unveiling Multi-regime Patterns in SciML: Distinct Failure Modes and Regime-specific Optimization
GENEB: Why Genomic Models Are Hard to Compare
CoDiffGRN: Rethinking Gene Regulatory Network Inference via the BEELINE-KGC Benchmark and Co-evolutionary Discrete Diffusion