GEESE: Genotype-aware End-to-End Spatio-temporal Embedding for Behavioral Phenotyping
Summary
GEESE is an end-to-end deep learning framework that learns behavioral representations directly from 3D pose dynamics without hand-crafted features, surpassing traditional baselines in behavior classification and genotype prediction across three autism-associated genetic models. It also introduces HONK, an interactive tool for natural language-based behavioral phenotyping.
View Cached Full Text
Cached at: 05/26/26, 09:05 AM
# GEESE: Genotype-aware End-to-End Spatio-temporal Embedding for Behavioral Phenotyping
Source: [https://arxiv.org/html/2605.24370](https://arxiv.org/html/2605.24370)
MS1; Yuen GaoPhD2; Chunqi QianPhD2; Zijun CuiPhD1 1Department of Computer Science and EngineeringMichigan State UniversityEast LansingMichiganUSA 2Department of RadiologyMichigan State UniversityEast LansingMichiganUSA
## Abstract
Behavioral phenotyping of genetic animal models currently requires labor\-intensive manual feature engineering that limits reproducibility and scalability\. We present GEESE, an end\-to\-end deep learning framework that learns behavioral representations directly from 3D pose dynamics without hand\-crafted features\. Using a pretrained time series foundation model, we encode movement sequences into a behavioral manifold that supports both behavior classification and genotype prediction\. Evaluated across three autism\-associated genetic models \(CNTNAP2, CHD8, FMR1\), our deep learning approach surpasses hand\-crafted feature baselines in both tasks, revealing that learned representations capture genotype\-specific behavioral signatures\. The framework generalizes across genetic backgrounds, and an all\-cohort model identifies both genetic background and genotype from movement patterns alone\. We further provide HONK, an interactive intelligent tool enabling researchers without programming expertise to perform behavioral phenotyping from pose data through natural language interaction\.
## 1Introduction
Behavioral assessment is fundamental to the diagnosis and treatment monitoring of neurological and psychiatric conditions, from motor symptoms in Parkinson’s[11](https://arxiv.org/html/2605.24370#bib.bib23)and Huntington’s disease[1](https://arxiv.org/html/2605.24370#bib.bib24)to social and repetitive behaviors in Autism Spectrum Disorder \(ASD\)[10](https://arxiv.org/html/2605.24370#bib.bib28)\. Despite this central role, clinical assessment remains largely subjective, time\-intensive, and dependent on specialist availability\. These limitations underscore the need for behavior phenotyping, which is the process of transforming raw movement into standardized, objective, and scalable digital signatures\. Moving beyond qualitative observation, automated phenotyping would enable direct comparison of phenotypes across genetic models and accelerate drug development pipelines, while accessible screening could shorten the time from initial concern to diagnosis and intervention\.
While the objectives of behavioral phenotyping are clear in clinical settings, achieving them requires overcoming technical challenges\. In preclinical research, rodent models carrying ASD\-associated genetic variants are typically evaluated through behavior assays such as open field tests, social interaction paradigms, and grooming analysis[2](https://arxiv.org/html/2605.24370#bib.bib31);[13](https://arxiv.org/html/2605.24370#bib.bib32)\. Modern markerless pose estimation methods have made high\-resolution behavioral kinematic data increasingly accessible[15](https://arxiv.org/html/2605.24370#bib.bib6);[17](https://arxiv.org/html/2605.24370#bib.bib12);[18](https://arxiv.org/html/2605.24370#bib.bib7);[5](https://arxiv.org/html/2605.24370#bib.bib11)\. Recent advances in artificial intelligence, particularly deep learning and foundation models pretrained on large\-scale data, offer an opportunity to automate this translation and reduce the barrier to clinical adoption of behavioral phenotyping tools\. But translating raw pose coordinates into meaningful behavioral descriptions remains challenging, largely due to reliance on hand\-crafted feature engineering that introduces researcher bias and limits reproducibility across laboratories\. Existing deep learning approaches provide behavioral visualization and clustering but do not support genotype predictive downstream tasks\.
In this work, we propose GEESE \(Genotype\-aware End\-to\-End Spatio\-temporal Embeddings\), a representation learning pipeline for behavioral phenotyping from 3D pose dynamics\. We leverage a time series foundation model[7](https://arxiv.org/html/2605.24370#bib.bib16), pretrained on large\-scale temporal data as the encoder backbone, which is then fine\-tuned on behavioral labels to learn a behavioral representations\. The resulting latent space serves as a unified representation space supporting multiple downstream tasks without task\-specific feature engineering\. We evaluate GEESE on 146 recording sessions across three ASD\-associated genetic models \(CNTNAP2, CHD8, FMR1\), showing that learned representations outperform hand\-crafted baselines in both behavior classification and genotype prediction, generalize across cohorts, and capture genotype\-specific behavioral signatures\.
To support broader adoption, we additionally present HONK \(Hands\-On Natural\-language Knowledgebase\), an interactive analysis tool that enables researchers without programming expertise to perform behavioral phenotyping directly from pose data\. This framework represents a step toward automated behavioral assessment tools that could ultimately support clinical screening and therapeutic development for ASD and other behaviorally\-defined conditions\.
## 2Related Works
Hand\-crafted Feature Engineering for Behavioral Analysis\.Traditional behavioral analysis pipelines rely on hand\-crafted feature engineering to manage the high dimensionality of kinematic data\. Principal Component Analysis \(PCA\)[24](https://arxiv.org/html/2605.24370#bib.bib1)is commonly used to reduce coordinates, while wavelet transforms[27](https://arxiv.org/html/2605.24370#bib.bib3);[12](https://arxiv.org/html/2605.24370#bib.bib2)approximate temporal dynamics\. The s\-DANNCE pipeline[14](https://arxiv.org/html/2605.24370#bib.bib5)combines 3D pose tracking with wavelet\-based feature extraction to map behavioral structure, but still requires manually designed spectral features\. These hand\-crafted pipelines require extensive domain expertise to design appropriate features for each specific experimental context, inherently introducing researcher bias into the quantification process\. Besides, the resulting metrics may miss subtle patterns not captured by predefined measurements\. Additionally, subsecond behaviors are often missed[8](https://arxiv.org/html/2605.24370#bib.bib8), and keypoint jitter can produce spurious segmentation transitions[23](https://arxiv.org/html/2605.24370#bib.bib9)\.
Deep Learning Approaches for Behavioral Representation\.Learning behavioral representations directly from movement kinematics offers a fundamentally different paradigm that bypasses the limitations[20](https://arxiv.org/html/2605.24370#bib.bib33)of human\-defined features\. Rather than summarizing motion through predefined descriptors, learning\-based approaches embeds entire movement sequences into a continuous latent space[14](https://arxiv.org/html/2605.24370#bib.bib5)\. Recent deep learning methods have been applied in this field: CEBRA uses contrastive learning to produce consistent latent embeddings from behavioral and neural data[20](https://arxiv.org/html/2605.24370#bib.bib33); Keypoint\-MoSeq applies generative modeling to parse continuous pose trajectories into discrete behavioral syllables[23](https://arxiv.org/html/2605.24370#bib.bib9); the Social Behavior Atlas \(SBeA\) framework uses few\-shot learning to reduce annotation requirements for pose estimation and identity recognition[8](https://arxiv.org/html/2605.24370#bib.bib8)\. These approaches demonstrate that end\-to\-end representation learning can extract behavioral structure from movement data without hand\-crafted features\. However, none of these methods support genotype prediction from learned representations, leaving an end\-to\-end predictive framework absent from existing behavioral analysis pipelines\.
## 3Methodology
GEESE operates as a three\-step pipeline \(Figure[1](https://arxiv.org/html/2605.24370#S3.F1)\)\. First, continuous 3D pose recordings are segmented into overlapping temporal windows, each represented as a matrix𝐗∈ℝT×D\\mathbf\{X\}\\in\\mathbb\{R\}^\{T\\times D\}capturing the instantaneous posture and short\-term dynamics of all skeletal keypoints\. Second, each window is passed through a pretrained time series foundation model that compresses the high\-dimensional pose sequence into a compact embedding𝐳∈ℝd\\mathbf\{z\}\\in\\mathbb\{R\}^\{d\}\. The embeddings form a behavioral manifold where proximity reflects kinematic similarity\. Task\-specific classification heads are then attached for downstream prediction\.
Figure 1:System architecture\. Pose sequences are processed by a pretrained time series model, producing behavioral representations\. Training on behavioral labels organizes these representations by behavior type, enabling behavior classification\. The same representations support genotype prediction after brief fine\-tuning on genotype labels\.### 3\.1Preliminary: Time Series Foundation Models
MOMENT[6](https://arxiv.org/html/2605.24370#bib.bib14)is an open\-source foundation model pretrained on the Time Series Pile, a large\-scale collection spanning healthcare, finance, and engineering domains\. Its transformer encoder \(L=24L=24layers\) processes multivariate time series through patch embedding and supports transfer learning to classification tasks\.
### 3\.2Model Architecture
The encoderfθf\_\{\\theta\}maps each input window to a fixed\-dimensional embedding:
𝐳=fθ\(𝐗\)∈ℝd\\mathbf\{z\}=f\_\{\\theta\}\(\\mathbf\{X\}\)\\in\\mathbb\{R\}^\{d\}\(1\)whered=1024d=1024\. Internally, MOMENT processes theD=69D=69input channels through a patch embedding layer, appliesL=24L=24transformer encoder blocks with multi\-head self\-attention, and produces per\-token representations that are aggregated via mean pooling:
𝐳=1Npatches∑i=1Npatches𝐡i\\mathbf\{z\}=\\frac\{1\}\{N\_\{\\text\{patches\}\}\}\\sum\_\{i=1\}^\{N\_\{\\text\{patches\}\}\}\\mathbf\{h\}\_\{i\}\(2\)where𝐡i∈ℝd\\mathbf\{h\}\_\{i\}\\in\\mathbb\{R\}^\{d\}is the output of the final transformer block for patchii, andNpatchesN\_\{\\text\{patches\}\}is the number of temporal patches\.
For downstream tasks, we attach task\-specific linear classification heads to the encoder\. For behavior classification, a linear headgϕbg\_\{\\phi\_\{b\}\}maps the embedding to class logits:
𝐲^b=gϕb\(𝐳\)=𝐖b𝐳\+𝐛b\\hat\{\\mathbf\{y\}\}\_\{b\}=g\_\{\\phi\_\{b\}\}\(\\mathbf\{z\}\)=\\mathbf\{W\}\_\{b\}\\mathbf\{z\}\+\\mathbf\{b\}\_\{b\}\(3\)where𝐖b∈ℝCb×d\\mathbf\{W\}\_\{b\}\\in\\mathbb\{R\}^\{C\_\{b\}\\times d\},𝐛b∈ℝCb\\mathbf\{b\}\_\{b\}\\in\\mathbb\{R\}^\{C\_\{b\}\}, andCb=9C\_\{b\}=9is the number of behavior categories\. For genotype classification, a separate linear headgϕgg\_\{\\phi\_\{g\}\}maps to genotype logits:
𝐲^g=gϕg\(𝐳\)=𝐖g𝐳\+𝐛g\\hat\{\\mathbf\{y\}\}\_\{g\}=g\_\{\\phi\_\{g\}\}\(\\mathbf\{z\}\)=\\mathbf\{W\}\_\{g\}\\mathbf\{z\}\+\\mathbf\{b\}\_\{g\}\(4\)where𝐖g∈ℝCg×d\\mathbf\{W\}\_\{g\}\\in\\mathbb\{R\}^\{C\_\{g\}\\times d\},𝐛g∈ℝCg\\mathbf\{b\}\_\{g\}\\in\\mathbb\{R\}^\{C\_\{g\}\}, andCg∈\{2,3\}C\_\{g\}\\in\\\{2,3\\\}depending on the cohort\.
### 3\.3Training Strategy
We adopt a two\-stage training strategy: the encoder is first trained on expert\-annotated behavioral labels to organize the manifold by behavior type, then fine\-tuned end\-to\-end on genotype labels with a reduced learning rate to gain sensitivity to genotype\-associated movement patterns while preserving learned behavioral structure\.
Stage 1: Supervised Training for Behavior Classification\.HLAC \(High\-Level Action Classification\), manually defined by the developers of s\-DANNCE based on visual inspection, provides human\-annotated behavioral labels including locomotion, grooming, rearing, and other stereotyped actions[14](https://arxiv.org/html/2605.24370#bib.bib5)\. We train the model by minimizing the cross\-entropy loss over behavior labels:
ℒbehav=−1N∑i=1N∑c=1Cbyb,i\(c\)logexp\(y^b,i\(c\)\)∑j=1Cbexp\(y^b,i\(j\)\)\\mathcal\{L\}\_\{\\text\{behav\}\}=\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{c=1\}^\{C\_\{b\}\}y\_\{b,i\}^\{\(c\)\}\\log\\frac\{\\exp\(\\hat\{y\}\_\{b,i\}^\{\(c\)\}\)\}\{\\sum\_\{j=1\}^\{C\_\{b\}\}\\exp\(\\hat\{y\}\_\{b,i\}^\{\(j\)\}\)\}\(5\)whereyb,i\(c\)y\_\{b,i\}^\{\(c\)\}is the one\-hot encoded ground truth label for sampleiiand classcc, andNNis the number of training samples\. During this stage, the encoder parametersθ\\thetaare frozen and only the classification head parametersϕb\\phi\_\{b\}are updated\.
Stage 2: Fine\-tuning for Genotype Classification\.We then replace the classification head and fine\-tune on genotype labels\. The genotype loss is:
ℒgeno=−1N∑i=1N∑c=1Cgyg,i\(c\)logexp\(y^g,i\(c\)\)∑j=1Cgexp\(y^g,i\(j\)\)\\mathcal\{L\}\_\{\\text\{geno\}\}=\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{c=1\}^\{C\_\{g\}\}y\_\{g,i\}^\{\(c\)\}\\log\\frac\{\\exp\(\\hat\{y\}\_\{g,i\}^\{\(c\)\}\)\}\{\\sum\_\{j=1\}^\{C\_\{g\}\}\\exp\(\\hat\{y\}\_\{g,i\}^\{\(j\)\}\)\}\(6\)During this stage, both the encoder parametersθ\\thetaand the genotype head parametersϕg\\phi\_\{g\}are updated with a reduced learning rate \(η=10−5\\eta=10^\{\-5\}\) to preserve learned behavioral structure while gaining sensitivity to genotype\-associated movement patterns\. Training runs for up to 10 epochs with early stopping \(patience 5\)\.
## 4Evaluation
### 4\.1Experimental Settings
Datasets\.We used 3D pose tracking data from the s\-DANNCE repository[14](https://arxiv.org/html/2605.24370#bib.bib5), which contains recordings from rodent models of autism\-associated genes\. We analyzed three cohorts: CNTNAP2 \(42 sessions; WT, HET, HOM\), CHD8 \(80 sessions; WT, HET\), and FMR1 \(24 sessions; WT, HET\)\. All recordings were lone \(single\-animal\) sessions\. Each session contains continuous 3D coordinates for 23 skeletal keypoints captured at 30 fps, with expert\-annotated behavioral labels for nine categories \(idle, sniff, groom, scrunch, active crouch, rearing, explore, locomotion, fast locomotion\)\. Behavioral annotations were defined by the s\-DANNCE developers based on visual inspection of rodent pose dynamics\. For data preprocessing, continuous pose sequences were segmented into overlapping windows for model input\. Each window is represented as a matrix𝐗∈ℝT×D\\mathbf\{X\}\\in\\mathbb\{R\}^\{T\\times D\}, whereT=32T=32frames \(approximately 1 second at 30 fps\) is the window length andD=J×3=69D=J\\times 3=69is the number of input channels obtained by flattening the 3D coordinates ofJ=23J=23skeletal keypoints\. Windows are extracted with strideS=16S=16for 50% overlap\. To prevent data leakage from temporally correlated windows, we employed session\-based splitting: train \(64%\), validation \(16%\), and test \(20%\), ensuring all windows from a given session appear exclusively in one split\.
Evaluation Metrics\.We report test accuracy and normalized confusion matrices for both behavior and genotype classification\. For unsupervised analysis, we apply K\-Means clustering \(k=9\) and evaluate using silhouette score[19](https://arxiv.org/html/2605.24370#bib.bib46)for cluster compactness and normalized mutual information \(NMI\)[21](https://arxiv.org/html/2605.24370#bib.bib47)for correspondence with ground truth labels, in order to examine learned representations beyond supervised labels\.
Implementation Details\.Both stages use Adam with mixed\-precision \(FP16\) training\. Stage 1 uses learning rate reduction on plateau \(factor 0\.5, patience 5\); Stage 2 uses a fixed learning rate of10−510^\{\-5\}with early stopping \(patience 5\)\.
### 4\.2Quantitative Evaluation
We evaluate GEESE on three levels: comparison against baseline methods on within\-cohort tasks \(Section[4\.2\.1](https://arxiv.org/html/2605.24370#S4.SS2.SSS1)\), cross\-cohort generalization including unified multi\-cohort modeling \(Section[4\.2\.2](https://arxiv.org/html/2605.24370#S4.SS2.SSS2)\), and unified multi\-cohort phenotyping \(Section[4\.2\.3](https://arxiv.org/html/2605.24370#S4.SS2.SSS3)\)\.
#### 4\.2\.1Comparison to SOTA Methods
We train and evaluate each method independently on each cohort using session\-based splits\. Baselines fall into three categories: hand\-crafted features \(raw pose, PCA, s\-DANNCE\) paired with a linear SVM, learned representations \(CEBRA, MOMENT frozen\) also paired with a linear SVM, and our end\-to\-end fine\-tuned GEESE\. MOMENT \(frozen\) uses the pretrained encoder without any parameter updates, serving as a direct measure of off\-the\-shelf foundation model representations\. Random guess accuracy reflects class\-balanced chance levels\. Table[1](https://arxiv.org/html/2605.24370#S4.T1)summarizes behavior classification and genotype prediction accuracy across all cohorts and methods\.
Table 1:Downstream task accuracy across methods\. Hand\-crafted and learned representation methods extract fixed features from 3D pose sequences, on which a linear SVM is trained for each task\. GEESE fine\-tunes the encoder end\-to\-end\. Genotype prediction for representation\-based methods uses direct SVM classification on genotype labels; GEESE uses a two\-stage protocol \(behavior training→\\rightarrowgenotype fine\-tuning\)\.Behavior Classification\.We first examine whether training on behavioral labels improves the model’s ability to distinguish different behavioral patterns\. As shown in Table[1](https://arxiv.org/html/2605.24370#S4.T1), GEESE outperforms all baselines on behavior classification across all three cohorts \(78\.56%, 77\.99%, 76\.30%\) without any cohort\-specific modification\.
s\-DANNCE achieves 75\.57%, 75\.33%, and 74\.38% using its 421\-dimensional wavelet\+PCA features\. GEESE surpasses s\-DANNCE on all three cohorts \(78\.56% vs\. 75\.57% on CNTNAP2, 77\.99% vs\. 75\.33% on CHD8, 76\.30% vs\. 74\.38% on FMR1\) without any hand\-crafted features, demonstrating that end\-to\-end fine\-tuning of a pretrained foundation model can outperform carefully designed spectral feature pipelines\.
Among learned representations, MOMENT frozen \(53\.24%, 50\.92%, 49\.75%\) performs comparably to raw pose, confirming that off\-the\-shelf pretraining does not capture behavioral structure\. CEBRA achieves intermediate performance \(64\.91%, 61\.43%, 59\.84%\)\. The improvement from frozen to fine\-tuned demonstrates that fine\-tuning is essential for domain adaptation\.
Genotype Prediction\.The model achieves 58\.83% accuracy on CNTNAP2\(chance: 33\.3%\), 62\.82% on CHD8 \(chance: 50\.00%\), and 67\.70% on FMR1 \(chance: 50\.00%\) through the two\-stage protocol \(Table[1](https://arxiv.org/html/2605.24370#S4.T1)\)\. HET animals are most reliably identified \(70%\), followed by HOM \(47%\) and WT \(47%\)\. The confusion pattern reveals a biologically meaningful gradient: WT is often misclassified as HET \(48%\), consistent with the expected gene\-dosage effect where loss of one CNTNAP2 copy produces an intermediate phenotype[4](https://arxiv.org/html/2605.24370#bib.bib19)\. HOM animals show a more distinct behavioral profile, likely reflecting the more severe social and repetitive behavior abnormalities in complete knockouts\.
Note that baseline methods train directly on genotype labels, while GEESE adopts the two\-stage protocol described in Section[3\.3](https://arxiv.org/html/2605.24370#S3.SS3)\. Under direct classification, s\-DANNCE achieves 57\.50%, 61\.72%, and 66\.45%, consistent with its use of spectral features designed to capture fine\-grained kinematic differences\. Despite this more constrained protocol, GEESE matches or exceeds s\-DANNCE on all three cohorts \(58\.83%, 62\.82%, 67\.70%\), indicating that the behaviorally\-organized manifold not only retains genotype\-discriminative information but provides a more effective basis for genotype prediction than features designed purely for kinematic description\. Raw pose coordinates also achieve notable accuracy on FMR1 \(68\.67%\), slightly exceeding GEESE \(67\.70%\)\. This suggests that FMR1\-associated behavioral differences manifest as detectable postural signatures even without temporal feature extraction, consistent with the pronounced motor phenotype reported in Fragile X models[3](https://arxiv.org/html/2605.24370#bib.bib21)\.
We compared GEESE against ETSformer[25](https://arxiv.org/html/2605.24370#bib.bib45)and DLinear[26](https://arxiv.org/html/2605.24370#bib.bib44)as alternative backbones; MOMENT\-Large consistently outperformed both in frozen and fine\-tuned settings \(Appendix[4](https://arxiv.org/html/2605.24370#S4.T4)\), confirming that large\-scale pretraining provides a stronger initialization for behavioral classification\.
#### 4\.2\.2Cross Cohort Generalization
We evaluate whether representations transfer across genetic backgrounds by applying source\-trained models to unseen target cohorts, comparing against s\-DANNCE as the strongest baseline\.
Cross\-cohort Behavior Transfer\.Table[2](https://arxiv.org/html/2605.24370#S4.T2)reports behavior classification accuracy when a model trained on one cohort is applied directly to another without retraining\. For s\-DANNCE, this involves fitting PCA on the source cohort and computing features for the target cohort using the source PCA basis, then applying the source\-trained classifier\. For GEESE, the source\-trained encoder and classification head are applied directly to target cohort data\.
GEESE consistently outperforms s\-DANNCE in both within\-cohort \(diagonal\) and cross\-cohort \(off\-diagonal\) settings\. Notably, GEESE models trained on CNTNAP2 achieve 76\.4% on CHD8 and 75\.2% on FMR1, retaining within\-cohort performance\. The relatively small drop from diagonal to off\-diagonal for both methods suggests that the nine behavioral categories share consistent kinematic signatures across genetic backgrounds\. However, GEESE’s larger advantage in the cross\-cohort setting indicates that learned representations capture more transferable behavioral structure than hand\-crafted spectral features\.
Table 2:Cross\-cohort generalization\. Behavior classification uses zero\-shot transfer; genotype classification trains an MLP on 30% of target labels\. Rows indicate source cohort; columns indicate target\. Diagonal entries \(shaded\) are within\-cohort results from Table[1](https://arxiv.org/html/2605.24370#S4.T1)\.Cross\-cohort Genotype Classification\.While cross\-cohort behavior transfer applies a source classifier directly to target data, cross\-cohort genotype evaluation requires a different protocol: the genes differ across cohorts, so a genotype classifier trained on CNTNAP2 \(WT/HET/HOM\) cannot be applied to CHD8 \(WT/HET\)\. Instead, we evaluate whether a source cohort’s representation supports genotype classification on a target cohort when given limited target labels\. Specifically, we use source\-trained feature extractors \(GEESE encoder or s\-DANNCE PCA basis\) to compute target cohort representations, then train an MLP classifier on 30% of the target cohort’s genotype labels and evaluate on the remaining data\. Table[2](https://arxiv.org/html/2605.24370#S4.T2)reports the results\.
Both methods show comparable performance with off\-diagonal entries close to diagonal values, indicating that learned representations preserve genotype\-discriminative information across genetic backgrounds\.
Notably, s\-DANNCE represents a strong baseline derived from domain\-expert\-designed spectral features[14](https://arxiv.org/html/2605.24370#bib.bib5); GEESE’s comparable or superior performance across all source\-target pairs demonstrates that learned representations match expert\-crafted features in cross\-cohort transferability\.
#### 4\.2\.3GEESE as A Unified Model for Multi\-cohort Phenotyping
To evaluate whether a single model can perform phenotyping across all genetic backgrounds simultaneously, we trained a single model on combined data from all three cohorts and evaluated on two tasks: overall behavior classification and unified 7\-class genotype identification \(CNTNAP2\-WT/HET/HOM, CHD8\-WT/HET, FMR1\-WT/HET\)\. Table[3](https://arxiv.org/html/2605.24370#S4.T3)reports the results\. GEESE achieves 74\.17% behavior accuracy, comparable to the average of cohort\-specific models, confirming that a single encoder can represent behavioral structure across diverse genetic backgrounds without cohort\-specific tuning\. On the 7\-class genotype task \(chance: 14\.29%\), GEESE achieves 59\.47% compared to s\-DANNCE’s 57\.73%, demonstrating that both methods can distinguish not only genotype dosage within a cohort but also genetic background across cohorts from behavioral data alone\.
Table 3:Joint training across all cohorts\. A single model is trained on combined data from all three cohorts\. Behavior classification accuracy is reported on the combined test set\. Genotype classification is a 7\-class task \(CNTNAP2\-WT/HET/HOM, CHD8\-WT/HET, FMR1\-WT/HET\)\.
#### 4\.2\.4Ablation Study
Table[4](https://arxiv.org/html/2605.24370#S4.T4)compares different foundation model configurations on behavior classification to isolate the contributions of model architecture and fine\-tuning\.
In the frozen setting, all three models perform comparably, confirming that pretrained representations alone do not capture rodent behavioral structure\. Fine\-tuning reveals clear differences: ETSformer plateaus below 70%, DLinear shows inconsistent gains, while GEESE \(MOMENT\-Large\) achieves the highest accuracy across all cohorts \(78\.56%, 77\.99%, 76\.30%\)\. The gap between frozen and fine\-tuned MOMENT indicates that the value of the foundation model lies in its learned inductive biases that facilitate efficient domain adaptation through fine\-tuning\.
Table 4:Ablation study: foundation model comparison
### 4\.3Qualitative Evaluation
#### 4\.3\.1Learned Representations Reveal Interpretable Behavioral Structure
To examine this, we applied K\-Means clustering[9](https://arxiv.org/html/2605.24370#bib.bib48)\(k=9\) to the learned embeddings without using any labels\.
For visualization, we project embeddings to two dimensions using UMAP[16](https://arxiv.org/html/2605.24370#bib.bib15)with cosine distance\. The grouping revealed nine distinct clusters corresponding to different movement patterns \(Figure[2\(a\)](https://arxiv.org/html/2605.24370#S4.F2.sf1)\)\. Visual inspection of representative skeleton sequences \(Figure[2\(c\)](https://arxiv.org/html/2605.24370#S4.F2.sf3)[2\(d\)](https://arxiv.org/html/2605.24370#S4.F2.sf4)[2\(e\)](https://arxiv.org/html/2605.24370#S4.F2.sf5)[2\(f\)](https://arxiv.org/html/2605.24370#S4.F2.sf6)[2\(g\)](https://arxiv.org/html/2605.24370#S4.F2.sf7)\) shows that each cluster captures characteristic behaviors: Cluster 0 contains primarily idle and forward looking postures; Cluster 2 and 8 are characterized by head\-turning and lateral head\-turning; Cluster 4 shows face\-grooming; Cluster 5 reflects rearing and forward\-locomotion; Cluster 7 captures rearing and ground\-sniffing\. We further compared cluster assignments with human\-labeled behavioral categories to quantify how well these groups correspond to expert annotations\. Behaviors with distinctive motion signatures \(e\.g\., fast locomotion, grooming\) map cleanly to individual clusters, while kinematically similar behaviors show more overlap\.
\(a\)Behavioral clusters
\(b\)Genotype enrichment
\(c\)Idle
\(d\)Grooming
\(e\)Locomotion
\(f\)Rearing
\(g\)Head\-turning
Figure 2:Learned behavioral representations\. \(a\) Behavioral clusters via K\-Means \(k=9\), visualized with UMAP\. \(b–f\) Representative skeleton sequences\. \(g\) Genotype enrichment per cluster, showing differential distribution of WT, HET, and HOM animals\.
#### 4\.3\.2Genotype Signatures Emerge in Learned Behavioral Representations
The quantitative results in Section[4\.2](https://arxiv.org/html/2605.24370#S4.SS2)establish that GEESE can classify both behavior and genotype\. Here we examine how genotype information is organized within the learned representation, and how fine\-tuning transforms the manifold from a behavior\-only space into one that jointly encodes behavior and genotype\.
Genotype Signatures in Behavior\-trained Representations\.We first asked whether the behavior\-trained embedding space already contains genotype information, before any genotype\-specific training\. We examined how WT, HET, and HOM animals distribute across the behavioral clusters identified in Section[4\.3\.1](https://arxiv.org/html/2605.24370#S4.SS3.SSS1)\.
Genotype distributions vary substantially across clusters \(Figure[2\(b\)](https://arxiv.org/html/2605.24370#S4.F2.sf2)\), with HOM enriched in Clusters 0, 4, and 6\. These clusters correspond to idle postures and repetitive grooming movements, consistent with the reduced exploration and increased stereotypy reported in CNTNAP2 knockout models[22](https://arxiv.org/html/2605.24370#bib.bib17)\. This indicates that genotype\-associated behavioral patterns are already present in the embedding space after behavioral category training alone, suggesting that genotype and behavior information naturally coexist in the learned representation\.
Fine\-tuning Transforms the Manifold into a Genotype\-aware Space\.Although genotype information exists in the behavior\-trained manifold, the model was not explicitly trained to detect it\. We fine\-tuned the encoder on genotype labels with a reduced learning rate, aiming to preserve learned behavioral structure while gaining sensitivity to genotype\.
Before fine\-tuning, predicted genotype distributions show poor correspondence with ground truth \(Figure[3\(a\)](https://arxiv.org/html/2605.24370#S4.F3.sf1)\)\. After fine\-tuning, predicted distributions align more closely with ground truth: WT predictions expand toward the center, better matching the true distribution, HET spans the central area, and HOM forms a distinct cluster in the lower\-right corner \(Figure[3\(b\)](https://arxiv.org/html/2605.24370#S4.F3.sf2)\)\. The particularly clear separation of HOM indicates that complete knockouts exhibit more distinctive behavioral signatures\. Critically, this genotype separation emerges within the existing behavioral structure\. The behavioral clusters remain intact while gaining genotype sensitivity\. The resulting representations encode both behavior type and genotype dosage in a shared space, enabling the model to capture not only what an animal is doing but how its genetic background modulates that behavior\.
To quantify this transformation, we computed mean squared error \(MSE\) between predicted and ground truth genotype enrichment per behavior class\. After fine\-tuning, MSE decreases from 0\.465 to 0\.287 \(38% reduction\), with particularly notable gains in Idle, Locomotion, and Fast Locomotion, where CNTNAP2 models have shown altered activity levels and repetitive patterns[22](https://arxiv.org/html/2605.24370#bib.bib17)\. The behavior\-specific nature of this improvement confirms that the model captures genotype differences as they manifest within individual behavioral contexts, rather than learning a global genotype signal divorced from behavior\.
\(a\)Before genotype finetuning
\(b\)After genotype finetuning
Figure 3:Genotype\-behavior association learning\. \(a\) Before and \(b\) after genotype fine\-tuning\. Top row: ground truth genotype distributions; bottom row: predicted distributions\. Columns show WT \(blue\), HET \(red\), and HOM \(green\)\. After fine\-tuning, predicted distributions converge toward ground truth, with HOM occupying a distinct region\.
### 4\.4An Interactive and Intelligent Toolbox: from Manifold to Real\-time Phenotyping
To make these methods accessible to users without programming expertise, we developed HONK \(Hands\-On Natural\-language Knowledgebase\), an interactive analysis agent built on the GEESE pipeline for behavioral phenotyping from pose dynamics \(Figure[4](https://arxiv.org/html/2605.24370#S4.F4)\)\.
Figure 4:Concept diagram of HONK, the interactive analysis agent built on the GEESE pipeline\. Users pose natural language queries that are routed to the pipeline, which encodes 3D pose data into a behavioral manifold for behavior classification, genotype prediction, and interactive visualization\. The system provides real\-time analysis without requiring programming expertise or local computational resources\.Functionality and Accessibility\.Given a raw pose sequence, HONK processes it through the all\-cohort model and returns predicted behavior distributions, genotype probabilities \(including 7\-class identification\), temporal behavior sequences, and manifold visualizations\. Users interact through natural language queries \(e\.g\., "What is the predicted genotype?"\)\. The platform prioritizes interpretability: users can trace how individual windows map onto the manifold and export results for downstream analysis\. HONK requires no local installation or computational resources\. A web\-based deployment of HONK will be available soon\.
## 5Conclusion
We present GEESE, an end\-to\-end framework that learns behavioral representations directly from 3D pose dynamics using a pretrained foundation model, eliminating hand\-crafted feature engineering entirely\. Across three autism\-associated genetic models, GEESE surpasses the leading hand\-crafted baseline in both behavior classification and genotype prediction\. A single model trained on all cohorts identifies genetic background and genotype from movement patterns alone, enabling unified phenotyping without prior knowledge of the experimental cohort\. The accompanying tool HONK makes this pipeline accessible to researchers through natural language interaction, providing a step toward automated, reproducible behavioral assessment in genetic disease models\.
Limitations and Future Directions\.The current framework operates on single\-animal 3D pose data, providing a controlled setting for isolating genotype\-behavior associations; extending to multi\-animal social interactions and integrating additional modalities such as facial features are natural next steps\. The reliance on 3D pose estimation is mitigated by the architecture’s agnosticism to spatial dimensionality, substituting 2D pose inputs from tools such as DeepLabCut[15](https://arxiv.org/html/2605.24370#bib.bib6)or SLEAP[18](https://arxiv.org/html/2605.24370#bib.bib7)requires only a change in input channel dimension\. Incorporating explicit spatial relationships between joints into the encoder is a promising direction for capturing coordinated movement patterns that may carry additional phenotypic information\.Validation on independent datasets from other laboratories would further establish the generalizability of the learned representations\.
## References
- G\. P\. Bates, R\. Dorsey, J\. F\. Gusella,et al\.\(2015\)Huntington disease\.Nature reviews Disease primers1\(1\),pp\. 1–21\.Cited by:[§1](https://arxiv.org/html/2605.24370#S1.p1.1.2.1)\.
- A\. L\. Bey and Y\. Jiang \(2014\)Overview of mouse models of autism spectrum disorders\.Current protocols in pharmacology66\(1\),pp\. 5–66\.Cited by:[§1](https://arxiv.org/html/2605.24370#S1.p2.1.1.1)\.
- Y\. Chen, S\. Zhang, C\. Yue, P\. Xiang, J\. Li, Z\. Wei, L\. Xu, and Y\. Zeng \(2022\)Early environmental enrichment for autism spectrum disorder fmr1 mice models has positive behavioral and molecular effects\.Experimental Neurology352,pp\. 114033\.Cited by:[§4\.2\.1](https://arxiv.org/html/2605.24370#S4.SS2.SSS1.p6.1.1.1)\.
- K\. R\. Cording, E\. M\. Tu, H\. Wang, A\. H\. Agopyan\-Miu, and H\. S\. Bateup \(2025\)Cntnap2 loss drives striatal neuron hyperexcitability and behavioral inflexibility\.bioRxiv,pp\. 2024–05\.Cited by:[§4\.2\.1](https://arxiv.org/html/2605.24370#S4.SS2.SSS1.p5.1.2.1)\.
- T\. W\. Dunn, J\. D\. Marshall, K\. S\. Severson,et al\.\(2021\)Geometric deep learning enables 3d kinematic profiling across species and environments\.Nature methods18\(5\),pp\. 564–573\.Cited by:[§1](https://arxiv.org/html/2605.24370#S1.p2.1.2.1)\.
- M\. Goswami, K\. Szafer, A\. Choudhry, Y\. Cai, S\. Li, and A\. Dubrawski \(2024\)Moment: a family of open time\-series foundation models\.arXiv preprint arXiv:2402\.03885\.Cited by:[§3\.1](https://arxiv.org/html/2605.24370#S3.SS1.p1.1.1.1)\.
- F\. Guo, R\. Guan, Y\. Li,et al\.\(2025\)Foundation models in bioinformatics\.National science review12\(4\),pp\. nwaf028\.Cited by:[§1](https://arxiv.org/html/2605.24370#S1.p3.1.1.1)\.
- Y\. Han, K\. Chen, Y\. Wang,et al\.\(2024\)Multi\-animal 3d social pose estimation, identification and behaviour embedding with a few\-shot learning framework\.Nature machine intelligence6\(1\),pp\. 48–61\.Cited by:[§2](https://arxiv.org/html/2605.24370#S2.p1.1.5.1),[§2](https://arxiv.org/html/2605.24370#S2.p2.1.6.1)\.
- J\. A\. Hartigan and M\. A\. Wong \(1979\)Algorithm as 136: a k\-means clustering algorithm\.Journal of the royal statistical society\. series c \(applied statistics\)28\(1\),pp\. 100–108\.Cited by:[§4\.3\.1](https://arxiv.org/html/2605.24370#S4.SS3.SSS1.p1.1.1.1)\.
- T\. Hirota and B\. H\. King \(2023\)Autism spectrum disorder: a review\.Jama329\(2\),pp\. 157–168\.Cited by:[§1](https://arxiv.org/html/2605.24370#S1.p1.1.3.1)\.
- J\. Jankovic \(2008\)Parkinson’s disease: clinical features and diagnosis\.Journal of neurology, neurosurgery & psychiatry79\(4\),pp\. 368–376\.Cited by:[§1](https://arxiv.org/html/2605.24370#S1.p1.1.1.1)\.
- W\. Jia, M\. Sun, J\. Lian, and S\. Hou \(2022\)Feature dimensionality reduction: a review\.Complex & Intelligent Systems8\(3\),pp\. 2663–2693\.Cited by:[§2](https://arxiv.org/html/2605.24370#S2.p1.1.3.1)\.
- M\. J\. Kas, J\. C\. Glennon, J\. Buitelaar,et al\.\(2014\)Assessing behavioural and cognitive domains of autism spectrum disorders in rodents: current status and future perspectives\.Psychopharmacology231\(6\),pp\. 1125–1146\.Cited by:[§1](https://arxiv.org/html/2605.24370#S1.p2.1.1.1)\.
- U\. Klibaite, T\. Li, D\. Aldarondo, J\. F\. Akoad, B\. P\. Ölveczky, and T\. W\. Dunn \(2025\)Mapping the landscape of social behavior\.Cell188\(8\),pp\. 2249–2266\.Cited by:[§2](https://arxiv.org/html/2605.24370#S2.p1.1.4.1),[§2](https://arxiv.org/html/2605.24370#S2.p2.1.3.1),[§3\.3](https://arxiv.org/html/2605.24370#S3.SS3.p2.7.2.1),[§4\.1](https://arxiv.org/html/2605.24370#S4.SS1.p1.5.2.1),[§4\.2\.2](https://arxiv.org/html/2605.24370#S4.SS2.SSS2.p6.1.1.1),[Table 1](https://arxiv.org/html/2605.24370#S4.T1.3.1.1.1.2.1)\.
- A\. Mathis, P\. Mamidanna, K\. M\. Cury,et al\.\(2018\)DeepLabCut: markerless pose estimation of user\-defined body parts with deep learning\.Nature neuroscience21\(9\),pp\. 1281–1289\.Cited by:[§1](https://arxiv.org/html/2605.24370#S1.p2.1.2.1),[§5](https://arxiv.org/html/2605.24370#S5.p2.1.2.1)\.
- L\. McInnes, J\. Healy, and J\. Melville \(2018\)Umap: uniform manifold approximation and projection for dimension reduction\.arXiv preprint arXiv:1802\.03426\.Cited by:[§4\.3\.1](https://arxiv.org/html/2605.24370#S4.SS3.SSS1.p2.1.1.1)\.
- T\. D\. Pereira, D\. E\. Aldarondo, L\. Willmore,et al\.\(2019\)Fast animal pose estimation using deep neural networks\.Nature methods16\(1\),pp\. 117–125\.Cited by:[§1](https://arxiv.org/html/2605.24370#S1.p2.1.2.1)\.
- T\. D\. Pereira, N\. Tabris, A\. Matsliah,et al\.\(2022\)SLEAP: a deep learning system for multi\-animal pose tracking\.Nature methods19\(4\),pp\. 486–495\.Cited by:[§1](https://arxiv.org/html/2605.24370#S1.p2.1.2.1),[§5](https://arxiv.org/html/2605.24370#S5.p2.1.3.1)\.
- P\. J\. Rousseeuw \(1987\)Silhouettes: a graphical aid to the interpretation and validation of cluster analysis\.Journal of computational and applied mathematics20,pp\. 53–65\.Cited by:[§4\.1](https://arxiv.org/html/2605.24370#S4.SS1.p2.1.2.1)\.
- S\. Schneider, J\. H\. Lee, and M\. W\. Mathis \(2023\)Learnable latent embeddings for joint behavioural and neural analysis\.Nature617\(7960\),pp\. 360–368\.Cited by:[§2](https://arxiv.org/html/2605.24370#S2.p2.1.2.1),[§2](https://arxiv.org/html/2605.24370#S2.p2.1.4.1),[Table 1](https://arxiv.org/html/2605.24370#S4.T1.4.2.2.1.2.1)\.
- A\. Strehl and J\. Ghosh \(2002\)Cluster ensembles—a knowledge reuse framework for combining multiple partitions\.Journal of machine learning research3\(Dec\),pp\. 583–617\.Cited by:[§4\.1](https://arxiv.org/html/2605.24370#S4.SS1.p2.1.3.1)\.
- E\. V\. Valeeva, I\. S\. Sabirov, L\. R\. Safiullina,et al\.\(2024\)The role of the cntnap2 gene in the development of autism spectrum disorder\.Research in Autism Spectrum Disorders114,pp\. 102409\.Cited by:[§4\.3\.2](https://arxiv.org/html/2605.24370#S4.SS3.SSS2.p3.1.1.1),[§4\.3\.2](https://arxiv.org/html/2605.24370#S4.SS3.SSS2.p6.1.1.1)\.
- C\. Weinreb, J\. E\. Pearl, S\. Lin,et al\.\(2024\)Keypoint\-moseq: parsing behavior by linking point tracking to pose dynamics\.Nature Methods21\(7\),pp\. 1329–1339\.Cited by:[§2](https://arxiv.org/html/2605.24370#S2.p1.1.6.1),[§2](https://arxiv.org/html/2605.24370#S2.p2.1.5.1)\.
- S\. Wold, K\. Esbensen, and P\. Geladi \(1987\)Principal component analysis\.Chemometrics and intelligent laboratory systems2\(1\-3\),pp\. 37–52\.Cited by:[§2](https://arxiv.org/html/2605.24370#S2.p1.1.2.1)\.
- G\. Woo, C\. Liu, D\. Sahoo, A\. Kumar, and S\. Hoi \(2022\)Etsformer: exponential smoothing transformers for time\-series forecasting\.arXiv preprint arXiv:2202\.01381\.Cited by:[§4\.2\.1](https://arxiv.org/html/2605.24370#S4.SS2.SSS1.p7.1.1.1),[Table 4](https://arxiv.org/html/2605.24370#S4.T4.4.4.4.1.1.1),[Table 4](https://arxiv.org/html/2605.24370#S4.T4.4.7.7.1.1.1)\.
- A\. Zeng, M\. Chen, L\. Zhang, and Q\. Xu \(2023\)Are transformers effective for time series forecasting?\.InProceedings of the AAAI conference on artificial intelligence,Vol\.37,pp\. 11121–11128\.Cited by:[§4\.2\.1](https://arxiv.org/html/2605.24370#S4.SS2.SSS1.p7.1.2.1),[Table 4](https://arxiv.org/html/2605.24370#S4.T4.4.5.5.1.1.1)\.
- D\. Zhang \(2019\)Wavelet transform\.InFundamentals of image data mining: Analysis, Features, Classification and Retrieval,pp\. 35–44\.Cited by:[§2](https://arxiv.org/html/2605.24370#S2.p1.1.3.1)\.Similar Articles
AttnGen: Attention-Guided Saliency Learning for Interpretable Genomic Sequence Classification
AttnGen is an attention-guided training framework that embeds interpretability into the optimization of deep neural networks for genomic sequence classification, achieving improved accuracy and encouraging models to focus on informative nucleotide positions.
A Temporally Augmented Graph Attention Network for Affordance Classification
EEG-tGAT is a temporally augmented Graph Attention Network that improves affordance classification from interaction sequences by incorporating temporal attention and dropout mechanisms. The model enhances GATv2 for sequential data where temporal dimensions are semantically non-uniform.
CF-JEPA: Mask-free forward prediction with asymmetric encoder utilization for time-series representation learning
Proposes CF-JEPA, a mask-free self-supervised learning framework for time-series that uses multi-horizon forward prediction from random crops and exploits asymmetry between online and target encoders for improved performance on classification, forecasting, and anomaly detection.
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
This paper proposes GASP, a framework that injects geometric priors into vision-language models via deep supervision with contrastive and depth consistency losses, achieving significant improvements on 3D spatial reasoning benchmarks without using 3D VQA data.
AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling
This paper introduces AeroJEPA, a Joint-Embedding Predictive Architecture for scalable 3D aerodynamic field modeling. It addresses limitations in current surrogate models by predicting semantic latent representations of flow fields, enabling efficient high-fidelity analysis and design optimization.