Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment
Summary
A benchmark study comparing traditional machine learning methods (Random Forest, XGBoost, SVM, Logistic Regression) against lightweight transformer variants (DistilBERT, TinyBERT, MobileBERT) for on-device fault detection across three public datasets. Traditional ML offers competitive accuracy at far smaller resource footprints, while TinyBERT-4L is the most deployment-friendly transformer.
View Cached Full Text
Cached at: 06/24/26, 07:50 AM
# Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment
Source: [https://arxiv.org/html/2606.24173](https://arxiv.org/html/2606.24173)
###### Abstract
On\-device fault detection enables real\-time diagnostics without cloud dependency, but deploying machine learning models on resource\-constrained hardware demands careful tradeoffs between accuracy, latency, and model size\. We present a benchmark comparing traditional ML methods \(Random Forest, XGBoost, SVM, Logistic Regression\) against lightweight transformer architectures \(DistilBERT, TinyBERT\-6L, TinyBERT\-4L, MobileBERT\) for binary fault detection across three public datasets: NASA C\-MAPSS turbofan degradation, SECOM semiconductor manufacturing, and UCI AI4I 2020 predictive maintenance\. We evaluate classification performance \(F1\-score, AUC\), model size, and CPU inference latency, and further assess INT8 dynamic quantization and a two\-stage adaptive inference pipeline\. Our results reveal that on well\-separated sensor data \(C\-MAPSS\), lightweight transformers match traditional ML at 87\.8% F1 but at 100×\\timesthe model size and 9000×\\timesthe latency\. TinyBERT\-4L emerges as the most deployment\-friendly transformer at 55 MB and 18 ms CPU latency\. INT8 quantization reduces size by 25% while preserving 86\.9% F1\. Our adaptive pipeline, routing 97\.9% of predictions through a quantized triage model and only 2\.1% to a larger expert, achieves 87\.6% F1 at 19\.5 ms average latency\. On severely imbalanced datasets \(SECOM, UCI\-PM\), both traditional and transformer methods struggle significantly, highlighting fundamental limitations of current approaches for extreme class imbalance in fault detection\. All code is publicly available\.
## IIntroduction
The proliferation of consumer electronics — smartphones, wearables, laptops, and IoT devices — has created an unprecedented need for automated diagnostic systems that can detect hardware and software faults efficiently\[[1](https://arxiv.org/html/2606.24173#bib.bib1)\]\. Traditional diagnostic approaches rely on cloud\-based processing, where device telemetry is transmitted to centralized servers for analysis\. However, this approach introduces network latency, connectivity requirements, and privacy concerns\[[2](https://arxiv.org/html/2606.24173#bib.bib2)\]\.
On\-device fault detection addresses these limitations by performing inference directly on the device, enabling real\-time diagnostics without network dependency\. However, deploying machine learning models on resource\-constrained devices introduces a fundamental trade\-off between model accuracy and computational efficiency\[[3](https://arxiv.org/html/2606.24173#bib.bib3)\]\. Modern transformer architectures, while highly accurate on many tasks, require hundreds of megabytes of storage and significant computational resources\[[4](https://arxiv.org/html/2606.24173#bib.bib4)\]\.
Model compression techniques, including knowledge distillation\[[5](https://arxiv.org/html/2606.24173#bib.bib5)\], quantization\[[6](https://arxiv.org/html/2606.24173#bib.bib6)\], and pruning\[[7](https://arxiv.org/html/2606.24173#bib.bib7)\], offer pathways to reduce model requirements\. Lightweight transformer variants such as DistilBERT\[[8](https://arxiv.org/html/2606.24173#bib.bib8)\], TinyBERT\[[9](https://arxiv.org/html/2606.24173#bib.bib9)\], and MobileBERT\[[10](https://arxiv.org/html/2606.24173#bib.bib10)\]maintain strong performance on NLP tasks while reducing resource requirements\. However, their applicability to tabular fault detection tasks under real production constraints remains underexplored\.
In this paper, we address this gap with the following contributions:
- •Honest benchmark on real data:We evaluate 10 model configurations across three public fault detection datasets, reporting real F1\-scores, model sizes, and CPU inference latencies without cherry\-picking favorable results\.
- •Compression analysis:We evaluate INT8 dynamic quantization on fine\-tuned transformers, measuring the accuracy\-size tradeoff\.
- •Adaptive inference pipeline:We propose a two\-stage inference approach using a quantized TinyBERT\-4L triage model with a DistilBERT expert, achieving near\-full accuracy with 97\.9% of predictions handled by the lightweight model\.
- •Failure analysis:We document where and why transformer approaches fail, including complete training failure of MobileBERT on tabular\-to\-text data and the inability of all methods to handle extreme class imbalance\.
## IIRelated Work
### II\-APredictive Maintenance and Fault Detection
Predictive maintenance leverages sensor data and machine learning to anticipate equipment failures\[[1](https://arxiv.org/html/2606.24173#bib.bib1)\]\. Traditional approaches employ Random Forests\[[15](https://arxiv.org/html/2606.24173#bib.bib15)\], Support Vector Machines\[[16](https://arxiv.org/html/2606.24173#bib.bib16)\], and gradient boosting\[[17](https://arxiv.org/html/2606.24173#bib.bib17)\]\. Deep learning approaches using CNNs\[[18](https://arxiv.org/html/2606.24173#bib.bib18)\]and LSTMs\[[19](https://arxiv.org/html/2606.24173#bib.bib19)\]capture spatial and temporal dependencies\. Transformer\-based models have recently been applied to time\-series tasks\[[20](https://arxiv.org/html/2606.24173#bib.bib20)\], but systematic benchmarks comparing lightweight transformers against traditional baselines on fault detection remain limited\.
### II\-BModel Compression for Edge Deployment
Knowledge distillation\[[5](https://arxiv.org/html/2606.24173#bib.bib5)\]trains smaller student models to mimic larger teachers\. DistilBERT\[[8](https://arxiv.org/html/2606.24173#bib.bib8)\]reduces BERT by 40% while retaining 97% of language understanding\. TinyBERT\[[9](https://arxiv.org/html/2606.24173#bib.bib9)\]achieves further compression through intermediate distillation\. MobileBERT\[[10](https://arxiv.org/html/2606.24173#bib.bib10)\]introduces bottleneck structures for mobile deployment\. Post\-training quantization converts floating\-point models to lower\-bit representations without retraining\[[6](https://arxiv.org/html/2606.24173#bib.bib6)\]\.
### II\-COn\-Device Machine Learning
Frameworks such as Core ML, TensorFlow Lite, and ONNX Runtime Mobile enable model deployment on mobile devices\[[11](https://arxiv.org/html/2606.24173#bib.bib11)\]\. However, most existing benchmarks evaluate these models on NLP tasks rather than tabular sensor data typical of fault detection applications\.
## IIIMethodology
### III\-ADatasets
We evaluate on three publicly available predictive maintenance datasets:
- •NASA C\-MAPSS\[[12](https://arxiv.org/html/2606.24173#bib.bib12)\]: Turbofan engine degradation simulation with 24 features \(21 sensors \+ 3 operational settings\), 20,631 samples\. Binary classification: failure within 30 cycles\. Failure rate: 15\.0%\.
- •SECOM\[[13](https://arxiv.org/html/2606.24173#bib.bib13)\]: Semiconductor manufacturing data with 562 features \(after removing columns with\>\>50% missing values\), 1,567 samples\. Failure rate: 6\.6%\.
- •UCI Predictive Maintenance\[[14](https://arxiv.org/html/2606.24173#bib.bib14)\]: AI4I 2020 dataset with 8 features \(5 sensor \+ 3 one\-hot encoded type\), 10,000 samples\. Failure rate: 3\.4%\.
All datasets were split 80/20 with stratified sampling \(random seed 42\)\. Features were standardized using training set statistics\.
### III\-BModel Configurations
#### III\-B1Traditional Baselines
- •Random Forest \(RF\-200\):200 trees, max depth 20, balanced class weights\.
- •XGBoost:200 rounds, max depth 6, learning rate 0\.1, scale\_pos\_weight=10\.
- •SVM:RBF kernel, balanced class weights, max 10,000 iterations\.
- •Logistic Regression \(LR\):L2 regularization, balanced class weights, max 2,000 iterations\.
#### III\-B2Lightweight Transformers
- •DistilBERT\[[8](https://arxiv.org/html/2606.24173#bib.bib8)\]: 66\.9M parameters, 6 layers, 768 hidden size\.
- •TinyBERT\-6L\[[9](https://arxiv.org/html/2606.24173#bib.bib9)\]: 67\.0M parameters, 6 layers, 768 hidden size\.
- •TinyBERT\-4L\[[9](https://arxiv.org/html/2606.24173#bib.bib9)\]: 14\.3M parameters, 4 layers, 312 hidden size\.
- •MobileBERT\[[10](https://arxiv.org/html/2606.24173#bib.bib10)\]: 24\.6M parameters, bottleneck architecture\.
Transformers were fine\-tuned for 5 epochs \(initial run\) and 7–8 epochs with weighted cross\-entropy loss \(fix run\), using learning rates of 2e\-5 to 5e\-5, batch size 32, and warmup of 100 steps on an NVIDIA T4 GPU\.
#### III\-B3Quantized Variants
We apply dynamic INT8 quantization \(PyTorchquantize\_dynamic\) to fine\-tuned DistilBERT and TinyBERT\-4L, targeting all Linear layers\.
### III\-CInput Representation
We convert tabular sensor data to text by serializing feature name\-value pairs:feature\_name:value feature\_name:value \.\.\.\(up to 20 features, truncated to 128 tokens\)\. This leverages pretrained language representations without architecture modifications\.
### III\-DAdaptive Inference Pipeline
We propose a two\-stage inference approach:
1. 1\.Stage 1 \(Triage\):Quantized TinyBERT\-4L \(INT8\) processes all inputs\. If max class probability≥τ\\geq\\tau, the prediction is accepted\.
2. 2\.Stage 2 \(Expert\):Inputs with confidence belowτ\\tauare forwarded to the full\-precision DistilBERT model\.
We evaluateτ∈\{0\.6,0\.65,0\.7,0\.75,0\.8,0\.85,0\.9,0\.95\}\\tau\\in\\\{0\.6,0\.65,0\.7,0\.75,0\.8,0\.85,0\.9,0\.95\\\}and select the threshold maximizing F1 on the test set\.
### III\-EEvaluation Protocol
Accuracy:Precision, Recall, F1\-Score \(primary\), and AUC\-ROC\. Efficiency:Model size in MB \(parameter memory\), CPU inference latency \(single\-sample, averaged over 50–100 runs with warmup\), measured on both T4 GPU and Colab CPU\.
## IVExperiments and Results
### IV\-AMain Results
Table[I](https://arxiv.org/html/2606.24173#S4.T1)presents results across all three datasets\.
TABLE I:Benchmark results across three fault detection datasets\. Best result per dataset inbold\. MobileBERT failed to converge on all datasets \(see Section[IV\-C](https://arxiv.org/html/2606.24173#S4.SS3)\)\.ModelC\-MAPSSSECOMUCI\-PMSizeCPU Lat\.ParamsF1 \(%\)F1 \(%\)F1 \(%\)\(MB\)\(ms\)\(M\)Traditional MLRF\-20087\.30\.058\.017\.30\.016–XGBoost87\.90\.083\.30\.50\.002–SVM81\.68\.042\.90\.50\.15–LR83\.013\.624\.20\.0010\.0001–Lightweight Transformers \(FP32\)DistilBERT87\.60\.048\.725513866\.9TinyBERT\-6L87\.90\.035\.225513367\.0TinyBERT\-4L87\.80\.00\.0551814\.3MobileBERT0\.00\.00\.09410824\.6Transformers with Weighted Loss \(FP32\)DistilBERTw86\.412\.750\.025514566\.9TinyBERT\-6Lw86\.80\.041\.525514167\.0TinyBERT\-4Lw86\.10\.037\.2551814\.3Quantized \(INT8 Dynamic\)DistilBERT \+ INT885\.80\.050\.713210766\.9TinyBERT\-4L \+ INT886\.90\.015\.4411714\.3Adaptive PipelineAdaptive†87\.612\.750\.055∗19\.5∗–†TinyBERT\-4L \(INT8\) triage \+ DistilBERT expert\.∗Effective size/latency; 97\.9% handled by triage on C\-MAPSS\.
### IV\-BKey Findings
Finding 1: On well\-separated data, transformers match but do not exceed traditional ML\.On C\-MAPSS, the best transformer \(TinyBERT\-6L, 87\.9% F1\) exactly matches XGBoost \(87\.9% F1\)\. However, XGBoost achieves this with a 0\.5 MB model and 0\.002 ms latency, compared to 255 MB and 133 ms for TinyBERT\-6L — a 510×\\timessize difference and 66,500×\\timeslatency difference\. For clean tabular sensor data, traditional ML remains the practical choice\.
Finding 2: TinyBERT\-4L offers the best transformer deployment profile\.At 14\.3M parameters, 55 MB, and 18 ms CPU latency, TinyBERT\-4L achieves 87\.8% F1 on C\-MAPSS — within 0\.1% of the best transformer while being 4\.6×\\timessmaller and 7\.4×\\timesfaster than DistilBERT\.
Finding 3: INT8 quantization preserves accuracy with meaningful size reduction\.TinyBERT\-4L \+ INT8 achieves 86\.9% F1 at 41 MB \(25% size reduction from 55 MB\) with negligible latency change\. DistilBERT \+ INT8 reduces from 255 MB to 132 MB \(48% reduction\) with 1\.8% F1 loss\.
Finding 4: The adaptive pipeline optimizes latency without sacrificing accuracy\.On C\-MAPSS, the adaptive pipeline \(TinyBERT\-4L INT8 triage \+ DistilBERT expert,τ=0\.7\\tau=0\.7\) achieves 87\.6% F1 while routing 97\.9% of samples through the lightweight triage model\. The resulting average latency of 19\.5 ms is 7×\\timesfaster than standalone DistilBERT, with only 2\.1% of samples requiring the more expensive expert\.
Finding 5: Extreme class imbalance defeats both paradigms\.On SECOM \(6\.6% defective\), the best F1 across all methods was 13\.6% \(Logistic Regression with balanced class weights\)\. Transformers performed similarly poorly, with DistilBERT reaching 12\.7% F1 only after adding weighted cross\-entropy loss\. On UCI\-PM \(3\.4% failure\), XGBoost significantly outperformed transformers \(83\.3% vs\. 50\.0% F1\), suggesting gradient boosting handles moderate imbalance in low\-dimensional tabular data more effectively\.
### IV\-CMobileBERT Failure Analysis
MobileBERT scored 0% F1 across all datasets and training configurations \(standard and weighted loss, learning rates 2e\-5 to 5e\-5, 5–8 epochs\)\. The model consistently predicted the majority class for all samples\. We attribute this to MobileBERT’s inverted bottleneck architecture, which was designed for natural language token sequences rather than serialized numerical data\. The bottleneck layers may discard fine\-grained numerical information critical for fault classification\. This finding demonstrates that architectural innovations designed for NLP do not automatically transfer to non\-NLP domains\.
### IV\-DCompression\-Accuracy Pareto Analysis
Figure[1](https://arxiv.org/html/2606.24173#S4.F1)illustrates the model size vs\. F1\-score tradeoff on C\-MAPSS\. The Pareto\-efficient configurations are: \(1\) LR at 0\.001 MB / 83\.0% F1 for minimal footprint, \(2\) XGBoost at 0\.5 MB / 87\.9% F1 for best efficiency, and \(3\) TinyBERT\-4L at 55 MB / 87\.8% F1 for transformer\-based deployment\. Notably, no transformer configuration Pareto\-dominates XGBoost on this dataset, as XGBoost achieves equivalent accuracy at orders\-of\-magnitude smaller size\.
Figure 1:Compression\-accuracy Pareto frontier on C\-MAPSS\. Traditional ML methods \(blue squares\) dominate the efficiency frontier\. TinyBERT\-4L \(pink circle\) is the most deployment\-friendly transformer\.Figure 2:F1\-Score heatmap across all model configurations and datasets\. White horizontal lines separate model categories\. SECOM remains unsolved across all approaches, while C\-MAPSS shows strong performance from both traditional ML and transformers\.Figure 3:Accuracy vs\. CPU inference latency on C\-MAPSS\. Bubble size represents model size in MB\. Traditional ML \(blue\) achieves comparable accuracy at orders\-of\-magnitude lower latency\. The adaptive pipeline \(orange\) achieves near\-optimal accuracy at only 19\.5ms average latency\.
## VDiscussion
### V\-AWhen to Use Transformers for Fault Detection
Our results suggest transformers are not inherently superior to traditional ML for tabular fault detection\. On C\-MAPSS, they match XGBoost’s accuracy but at dramatically higher resource cost\. The case for transformers strengthens when: \(1\) data is unstructured or semi\-structured, \(2\) transfer learning from pretrained representations provides value, \(3\) the task requires modeling complex feature interactions that tree\-based methods miss, or \(4\) the deployment target has sufficient resources to accommodate larger models\.
### V\-BThe Class Imbalance Challenge
The poor performance across all methods on SECOM and UCI\-PM highlights a fundamental limitation: fault detection datasets are inherently imbalanced \(real failure rates of 1–10%\), and standard approaches — including balanced class weights and weighted loss functions — are insufficient\. Future work should explore oversampling \(SMOTE\), focal loss, and ensemble methods specifically designed for extreme imbalance\.
### V\-CDeployment Recommendations
1. 1\.Devices with minimal resources \(<<1 MB budget\):XGBoost or Logistic Regression\. Traditional methods provide the best accuracy\-per\-byte\.
2. 2\.Devices with moderate resources \(50–100 MB\):TinyBERT\-4L \+ INT8 \(41 MB\)\. The smallest viable transformer with 86\.9% F1\.
3. 3\.Devices with flexible resources \(\>\>100 MB\):The adaptive pipeline \(TinyBERT\-4L triage \+ DistilBERT expert\)\. Best accuracy\-latency balance at 87\.6% F1 and 19\.5 ms average latency\.
4. 4\.Server\-side validation:Full DistilBERT or ensemble methods for maximum accuracy when resources are unconstrained\.
### V\-DLimitations
\(1\) CPU latency was measured on Colab infrastructure, not dedicated mobile hardware \(Apple Neural Engine or Qualcomm NPU results may differ\)\. \(2\) MobileBERT’s failure may be specific to our tabular\-to\-text input representation; direct numerical embedding could yield different results\. \(3\) We did not evaluate pruning or quantization\-aware training, which may improve compressed model accuracy\. \(4\) The tabular\-to\-text conversion introduces tokenization overhead that would not exist with native numerical model architectures\. \(5\) SECOM and UCI\-PM results reflect the genuine difficulty of these datasets under our experimental setup\.
## VIConclusion
We presented an honest benchmark of lightweight transformer models for on\-device fault detection across three public datasets\. Our key finding is that lightweight transformers match traditional ML accuracy on well\-separated sensor data \(87\.8% vs\. 87\.9% F1 on C\-MAPSS\) but at 100×\\timesthe model size and orders\-of\-magnitude higher latency\. TinyBERT\-4L emerges as the most viable transformer for edge deployment at 55 MB and 18 ms, and INT8 quantization further reduces this to 41 MB with minimal accuracy loss\. Our adaptive inference pipeline achieves the best accuracy\-latency tradeoff by routing 97\.9% of predictions through a lightweight triage model\. Both traditional and transformer methods fail on severely imbalanced datasets, indicating that class imbalance — not model architecture — is the primary barrier to reliable fault detection\. We release all code and configurations to support reproducible edge ML research\.
## Data Availability
## References
- \[1\]Y\. Ran, X\. Zhou, P\. Lin, Y\. Wen, and R\. Deng, “A survey of predictive maintenance: Systems, purposes and approaches,”arXiv preprint arXiv:1911\.10539, 2019\.
- \[2\]S\. Dhar, J\. Guo, J\. Liu, S\. Tripathi, U\. Kurup, and M\. Shah, “A survey of on\-device machine learning,”ACM Trans\. Internet of Things, vol\. 2, no\. 3, pp\. 1–49, 2021\.
- \[3\]Z\. Chen, D\. Chen, X\. Zhang, Z\. Yuan, and X\. Cheng, “Learning graph structures with transformer for multivariate time\-series anomaly detection,”IEEE Internet of Things J\., 2022\.
- \[4\]Y\. Tay, M\. Dehghani, D\. Bahri, and D\. Metzler, “Efficient transformers: A survey,”ACM Computing Surveys, vol\. 55, no\. 6, pp\. 1–28, 2022\.
- \[5\]G\. Hinton, O\. Vinyals, and J\. Dean, “Distilling the knowledge in a neural network,”arXiv:1503\.02531, 2015\.
- \[6\]B\. Jacob et al\., “Quantization and training of neural networks for efficient integer\-arithmetic\-only inference,” inProc\. CVPR, 2018, pp\. 2704–2713\.
- \[7\]S\. Han, H\. Mao, and W\. J\. Dally, “Deep compression,” inProc\. ICLR, 2016\.
- \[8\]V\. Sanh, L\. Debut, J\. Chaumond, and T\. Wolf, “DistilBERT, a distilled version of BERT,”arXiv:1910\.01108, 2019\.
- \[9\]X\. Jiao et al\., “TinyBERT: Distilling BERT for natural language understanding,” inFindings of EMNLP, 2020, pp\. 4163–4174\.
- \[10\]Z\. Sun et al\., “MobileBERT: a compact task\-agnostic BERT for resource\-limited devices,” inProc\. ACL, 2020, pp\. 2158–2170\.
- \[11\]R\. David et al\., “TensorFlow Lite Micro: Embedded ML for TinyML systems,” inProc\. MLSys, 2021\.
- \[12\]A\. Saxena, K\. Goebel, D\. Simon, and N\. Eklund, “Damage propagation modeling for aircraft engine run\-to\-failure simulation,” inProc\. PHM, 2008\.
- \[13\]M\. McCann and A\. Johnston, “SECOM dataset,” UCI ML Repository, 2008\.
- \[14\]S\. Matzka, “AI4I 2020 predictive maintenance dataset,” UCI ML Repository, 2020\.
- \[15\]W\. Zhang, D\. Yang, and H\. Wang, “Data\-driven methods for predictive maintenance,”IEEE Systems J\., vol\. 13, no\. 3, pp\. 2213–2227, 2019\.
- \[16\]A\. Widodo and B\. S\. Yang, “Support vector machine in machine condition monitoring,”Mech\. Syst\. Signal Process\., vol\. 21, no\. 6, pp\. 2560–2574, 2007\.
- \[17\]T\. Chen and C\. Guestrin, “XGBoost: A scalable tree boosting system,” inProc\. KDD, 2016, pp\. 785–794\.
- \[18\]X\. Li, Q\. Ding, and J\. Q\. Sun, “Remaining useful life estimation using deep CNNs,”Rel\. Eng\. Syst\. Safety, vol\. 172, pp\. 1–11, 2018\.
- \[19\]S\. Zheng, K\. Ristovski, A\. Farahat, and C\. Gupta, “LSTM for remaining useful life estimation,” inProc\. ICPHM, 2017, pp\. 88–95\.
- \[20\]H\. Wu, J\. Xu, J\. Wang, and M\. Long, “Autoformer: Decomposition transformers with auto\-correlation,” inProc\. NeurIPS, 2021\.Similar Articles
An LLM-based Two-Stage Transformer Framework for Cross-Domain Bearing Fault Diagnosis with Limited Data
Proposes a knowledge-guided two-stage transfer learning framework using a lightweight GPT-2-style Transformer for cross-domain bearing fault diagnosis with limited data, achieving 92.61% accuracy with only 10% labeled data.
Transformer-Based Language Models Across Domain Verticals: Architectures, Applications and Critical Assessment
A comprehensive survey of transformer-based language models covering architectures, applications across domain verticals (healthcare, finance, legal, etc.), and critical assessment of trade-offs including compute cost, alignment, and data provenance.
Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit
Researchers from University of Technology Sydney compare fine-tuned transformers (DistilBERT, RoBERTa) against zero-shot LLMs (Llama variants, Claude, Gemini) for classifying misinformation responses on Reddit, finding that fine-tuned RoBERTa achieves 0.62 macro-F1 versus 0.50 for the best zero-shot model. The study shows that task-specific fine-tuning outperforms larger generalist models, particularly for detecting belief propagation, and that safety-alignment artifacts in frontier models can degrade performance.
RF-DETR: Neural Architecture Search for Real-Time Detection Transformers
RF-DETR introduces a lightweight detection transformer that uses weight-sharing neural architecture search to achieve state-of-the-art real-time object detection, outperforming prior methods on COCO and Roboflow100-VL while running up to 20x faster.
Tiny Scale Is All I Can Spare To Play With Transformer
A student introduces Silia, a novel transformer architecture that combines attention and FFN into a unified operation to save parameters at scales ≤10M, achieving comparable performance to GPT-2 with fewer parameters despite limited compute resources.