Transformer-Based Language Models Across Domain Verticals: Architectures, Applications and Critical Assessment

arXiv cs.CL Papers

Summary

A comprehensive survey of transformer-based language models covering architectures, applications across domain verticals (healthcare, finance, legal, etc.), and critical assessment of trade-offs including compute cost, alignment, and data provenance.

arXiv:2606.24331v1 Announce Type: new Abstract: Transformer-based language models have become the default substrate for natural language processing and the pace of new releases has made it hard for practitioners to separate durable ideas from the noise of incremental announcements. This review works at two levels. At the level of mechanism, we organise the main transformer families into a working taxonomy, covering encoder-only, decoder-only, encoder-decoder, long-context, permutation-based, and generator-discriminator variants. We then extend the discussion to post-2023 developments that changed the picture in practice: instruction tuning, reinforcement learning from human feedback, direct preference optimisation, mixture-of-experts scaling, retrieval augmentation and the current flagship model families from OpenAI, Anthropic, Google, Meta, Mistral and DeepSeek. At the level of use, we survey deployments across healthcare, finance, legal, education, customer service, creative writing and scientific work. Based on this we link each to the specific capabilities that make a transformer the appropriate tool. The contribution of this paper is a critical assessment that is based on the survey. We compare architectures on four axes that matter to deployment decisions, we quantify the trade-off between parameter count and energy cost. We also discuss how alignment methods, data provenance and benchmark saturation change what it means to call a model "state of the art". The final section lists the research questions that we think deserve more attention.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:46 AM

# Transformer-Based Language Models Across Domain Verticals: Architectures, Applications and Critical Assessment
Source: [https://arxiv.org/html/2606.24331](https://arxiv.org/html/2606.24331)
###### Abstract

Transformer\-based language models have become the default substrate for natural language processing and the pace of new releases has made it hard for practitioners to separate durable ideas from the noise of incremental announcements\. This review works at two levels\. At the level of mechanism, we organise the main transformer families into a working taxonomy, covering encoder\-only, decoder\-only, encoder\-decoder, long\-context, permutation\-based, and generator\-discriminator variants\. We then extend the discussion to post\-2023 developments that changed the picture in practice: instruction tuning, reinforcement learning from human feedback, direct preference optimisation, mixture\-of\-experts scaling, retrieval augmentation and the current flagship model families from OpenAI, Anthropic, Google, Meta, Mistral and DeepSeek\. At the level of use, we survey deployments across healthcare, finance, legal, education, customer service, creative writing and scientific work\. Based on this we link each to the specific capabilities that make a transformer the appropriate tool\. The contribution of this paper is a critical assessment that is based on the survey\. We compare architectures on four axes that matter to deployment decisions, we quantify the trade\-off between parameter count and energy cost\. We also discuss how alignment methods, data provenance and benchmark saturation change what it means to call a model “state of the art”\. The final section lists the research questions that we think deserve more attention\.

###### keywords:

Large language models , Transformer , BERT , GPT , Survey

††journal:Preprint\\affiliation

\[a\]organization=SCOPE, VIT\-AP University, city=Amaravathi, state=Andhra Pradesh, postcode=522241, country=India

\\affiliation

\[b\]organization=SCORE, VIT, city=Katpadi, state=Tamil Nadu, postcode=632014, country=India

## 1Introduction

Something changed in neural language modelling around 2017\. Until that year, anyone wanting to build a translation or summarisation system reached for a recurrent network of some flavour\. The paper byVaswaniet al\.\([2017](https://arxiv.org/html/2606.24331#bib.bib1)\), whose title claimed that attention was all you need, argued that recurrence could be dropped entirely while still matching recurrent baselines on translation\. The practical win was parallelisation\. Training runs that had been sequential by construction suddenly ran as fast as the hardware allowed\. Two years on, most supervised NLP leaderboards were dominated by transformer variants\. Five years on, the same architecture, pretrained on enormous web corpora, sat behind products used by hundreds of millions of people every day\.

The pressure on technology has changed since then\. The release calendar has become relentless, and new flagship models arrive almost monthly\. The claimed capabilities frequently are ahead of what a careful reader can verify\. Practitioners trying to pick a model for a specific setting, for instance clinical coding or contract review, have to choose between a handful of large proprietary systems accessed through an API, a growing set of open\-weight models they can run in\-house, and older specialised models that still perform well on narrow benchmarks\. The literature does not always make the trade\-offs clear\.

Our goals are three\. First, we organise transformer architectures into a taxonomy that is useful for deployment decisions, rather than listing models by release date\. Second, we survey applications across seven domain verticals and tie each to the architectural properties that matter for that domain\. Third, we provide a critical assessment of the trade\-offs that are often glossed over in vendor announcements: compute and energy cost, alignment behaviour, data provenance, and the gap between benchmark scores and field performance\.

The review extends earlier surveys such asZhaoet al\.\([2023](https://arxiv.org/html/2606.24331#bib.bib2)\)andMinaeeet al\.\([2024](https://arxiv.org/html/2606.24331#bib.bib3)\)by covering developments that arrived after 2023, including instruction\-tuned and preference\-optimised models\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.24331#bib.bib4); Rafailovet al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib5)\), mixture\-of\-experts systems\(Feduset al\.,[2022](https://arxiv.org/html/2606.24331#bib.bib6); Jianget al\.,[2024](https://arxiv.org/html/2606.24331#bib.bib7)\), retrieval\-augmented generation\(Lewiset al\.,[2020b](https://arxiv.org/html/2606.24331#bib.bib8)\), and the current generation of flagship models from OpenAI, Anthropic, Google DeepMind, Meta, Mistral AI, and DeepSeek\.

The rest of the paper is organised as follows\. Section[2](https://arxiv.org/html/2606.24331#S2)gives the technical background needed to follow the later discussion\. Section[3](https://arxiv.org/html/2606.24331#S3)presents the architecture taxonomy\. Section[4](https://arxiv.org/html/2606.24331#S4)covers the post\-2023 developments that changed how transformers are trained and served in practice\. Section[5](https://arxiv.org/html/2606.24331#S5)surveys applications by domain\. Section[6](https://arxiv.org/html/2606.24331#S6)is the critical assessment\. Section[7](https://arxiv.org/html/2606.24331#S7)lists open research questions\. Section[8](https://arxiv.org/html/2606.24331#S8)concludes\.

## 2Background

Before the transformer, sequence modelling relied mainly on recurrent networks and their long short\-term memory \(LSTM\) variants\(Hochreiter and Schmidhuber,[1997](https://arxiv.org/html/2606.24331#bib.bib9)\)\. Recurrent models process tokens one at a time, which limits how much of the training can be parallelised and makes it difficult to capture dependencies across long passages\. Attempts to fix these problems with gated recurrent units and attention mechanisms over recurrent states helped on specific tasks but did not change the basic bottleneck\.

The transformer removed the recurrence\. A transformer layer contains two sub\-layers: a self\-attention block, which lets each token attend to every other token in the input through scaled dot\-product attention, and a position\-wise feed\-forward network\. Residual connections and layer normalisation are applied around each sub\-layer\. Because attention is computed in parallel across all token positions, the full sequence can be processed in a single forward pass\. Order information, which recurrence encoded implicitly, is supplied explicitly through positional embeddings\.

The original transformer used an encoder–decoder split for machine translation\. Later work showed that each half could be used on its own\. An encoder\-only model such as BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2606.24331#bib.bib10)\)produces bidirectional representations useful for classification, tagging, and retrieval\. A decoder\-only model such as the GPT family\(Radfordet al\.,[2018](https://arxiv.org/html/2606.24331#bib.bib11),[2019](https://arxiv.org/html/2606.24331#bib.bib12); Brownet al\.,[2020](https://arxiv.org/html/2606.24331#bib.bib13)\)is trained with a causal mask and generates text one token at a time\. An encoder–decoder model such as T5\(Raffelet al\.,[2020](https://arxiv.org/html/2606.24331#bib.bib14)\)keeps both halves and recasts every task as a text\-to\-text problem\.

All of these models share the same training recipe at a high level: self\-supervised pretraining on a large unlabelled corpus, followed by supervised fine\-tuning on a smaller task\-specific dataset, or, more recently, by instruction tuning and preference optimisation on curated demonstration and comparison data\. The scale of pretraining has grown by roughly four orders of magnitude over the last seven years, from BERT\-Large at 340 million parameters to publicly discussed systems with over a trillion parameters\(Kaplanet al\.,[2020](https://arxiv.org/html/2606.24331#bib.bib15); Hoffmannet al\.,[2022](https://arxiv.org/html/2606.24331#bib.bib16)\)\.

## 3A Working Taxonomy of Transformer Architectures

The literature often groups transformer\-based models by release date or by size\. Neither is a good guide for someone choosing a model\. We organise the main families by the structural properties that determine what the model is good for\. Table[1](https://arxiv.org/html/2606.24331#S3.T1)summarises the taxonomy; the rest of this section explains it\.

Table 1:Working taxonomy of transformer\-based language models\.### 3\.1Encoder\-only models

BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2606.24331#bib.bib10)\)was the first widely\-used encoder\-only model\. It stacks transformer encoder blocks and is pretrained with two objectives\. In masked language modelling \(MLM\), a fraction of input tokens, usually 15%, is replaced by a\[MASK\]token, and the model is trained to predict the original tokens from the surrounding context\. In next\-sentence prediction \(NSP\), the model is given two segments and asked whether the second follows the first\. The BERT\-base configuration uses 12 layers, 12 attention heads per layer, and hidden size 768, giving 110 million parameters; BERT\-large scales these to 24 layers and 340 million parameters\.

RoBERTa\(Liuet al\.,[2019](https://arxiv.org/html/2606.24331#bib.bib17)\)kept the architecture but changed the training\. The authors removed NSP, trained on roughly ten times more data for longer, used larger batches and dynamic masking, and showed that a well\-tuned BERT recipe could close the gap to newer models\. The lesson, which has been repeated many times since, is that architectural novelty is often confounded with training data and training budget\.

DeBERTa\(Heet al\.,[2021](https://arxiv.org/html/2606.24331#bib.bib18)\)added disentangled attention, which separates content and position information in the attention computation, and improved on RoBERTa by a few points on GLUE and SuperGLUE\. Encoder\-only models are still the right choice for tasks where the input is bounded and the output is a label, a span, or a retrieval score\. They are cheaper to fine\-tune than a generative model and they produce embeddings that are useful for semantic search\.

### 3\.2Decoder\-only models

The GPT family trained a transformer decoder with a causal attention mask, so each token can only attend to itself and earlier tokens\. The model is trained to predict the next token given the prefix\. GPT\-2\(Radfordet al\.,[2019](https://arxiv.org/html/2606.24331#bib.bib12)\)showed that this simple objective, at 1\.5 billion parameters, produced surprisingly fluent text\. GPT\-3\(Brownet al\.,[2020](https://arxiv.org/html/2606.24331#bib.bib13)\)scaled the same recipe to 175 billion parameters and demonstrated*in\-context learning*: given a few worked examples in the prompt, the model could perform a new task without any gradient updates\.

Once developers saw in\-context learning work, the way the model was used at all began to shift\. Tasks that had previously called for a fine\-tuning job could be handled by prompt engineering, which moved the cost of specialisation from training time to inference time\. It also exposed a new failure mode\. The model is sensitive to the wording of the prompt, the order of the examples, and the random seed of the sampling routine, and the same prompt can produce different outputs on different runs\.

Post\-2023 decoder\-only models, which we cover in Section[4](https://arxiv.org/html/2606.24331#S4), are essentially the GPT recipe with better data, better alignment, and better efficiency\. Most open\-weight releases now belong to this family\. The Llama 2 and Llama 3 series from Meta\(Touvronet al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib19); Grattafioriet al\.,[2024](https://arxiv.org/html/2606.24331#bib.bib20)\), the Mistral and Mixtral models\(Jianget al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib21),[2024](https://arxiv.org/html/2606.24331#bib.bib7)\), and the DeepSeek series\(DeepSeek\-AI,[2024](https://arxiv.org/html/2606.24331#bib.bib22)\)between them cover the bulk of current open\-weight deployments\.

### 3\.3Encoder\-decoder models

T5\(Raffelet al\.,[2020](https://arxiv.org/html/2606.24331#bib.bib14)\)cast every NLP task as text\-to\-text\. Translation inputs are prefixed with “translate English to German:”; summarisation inputs with “summarize:”\. The model is trained with a span\-corruption objective in which contiguous spans are replaced with sentinel tokens and the decoder has to produce the missing content\. T5 comes in sizes from 60 million to 11 billion parameters\. BART\(Lewiset al\.,[2020a](https://arxiv.org/html/2606.24331#bib.bib23)\)uses a similar encoder\-decoder arrangement but with a noisy\-autoencoding objective that mixes token masking, sentence permutation, and document rotation\.

Encoder\-decoder models remain competitive on tasks where the output is a constrained rewrite of the input: abstractive summarisation, translation, grammatical correction, and schema\-driven generation\. They allow the encoder and decoder to have different depths and attention patterns, which is useful when the input is long and the output is short\.

### 3\.4Long\-context variants

Standard self\-attention scales quadratically in sequence length, which makes it prohibitive to train on inputs longer than a few thousand tokens\. Several families of architecture try to work around that\.

Transformer\-XL\(Daiet al\.,[2019](https://arxiv.org/html/2606.24331#bib.bib24)\)brings back a limited form of recurrence, but at the segment level rather than the token level\. Hidden states from the previous segment are cached and reused as extra context when the next segment is processed, which lets information flow across segment boundaries without forcing the model to ingest everything in a single pass\. To keep the caching scheme self\-consistent, the authors also swap absolute for relative positional encodings\. The effective context window is larger, while the training cost stays close to that of an ordinary fixed\-length transformer\.

Longformer\(Beltagyet al\.,[2020](https://arxiv.org/html/2606.24331#bib.bib25)\)and BigBird\(Zaheeret al\.,[2020](https://arxiv.org/html/2606.24331#bib.bib26)\)replace the dense attention matrix with a sparse pattern\. Each token attends to a local window and to a small number of global tokens\. The resulting attention matrix has a linear or near\-linear number of non\-zero entries, which makes sequences of tens of thousands of tokens tractable\. More recent work uses linear attention approximations\(Choromanskiet al\.,[2020](https://arxiv.org/html/2606.24331#bib.bib27)\)and dilated or strided patterns\(Dinget al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib28)\)\. FlashAttention\(Daoet al\.,[2022](https://arxiv.org/html/2606.24331#bib.bib29)\)takes a different route\. It does not change the attention pattern; it re\-implements the exact attention computation in a way that avoids materialising the full attention matrix in memory, which gives a large speed\-up without an accuracy penalty\.

### 3\.5Permutation\-based and generator–discriminator models

XLNet\(Yanget al\.,[2019](https://arxiv.org/html/2606.24331#bib.bib30)\)was an attempt to combine the bidirectional context of BERT with the autoregressive factorisation of GPT\. It uses permutation language modelling, in which the model predicts each token from a randomly chosen permutation of the remaining tokens\. This avoids the MLM artefact of seeing\[MASK\]tokens at training time but not at inference time\. XLNet also introduces two\-stream self\-attention, which separates the content and query streams so that a token can be predicted without seeing itself\. XLNet outperformed BERT on most GLUE tasks at the time of its release, but was expensive to train and has been largely superseded by models that optimise the MLM recipe more directly\.

ELECTRA\(Clarket al\.,[2020](https://arxiv.org/html/2606.24331#bib.bib31)\)replaces MLM with replaced token detection\. A small generator network proposes replacements for masked tokens, and the main model, the discriminator, is trained to classify each token as original or replaced\. Because the loss is defined over all input positions rather than only the masked ones, ELECTRA learns more per training example and matches BERT’s accuracy with a small fraction of the compute\.

### 3\.6Mixture\-of\-experts

Mixture\-of\-experts \(MoE\) layers replace a dense feed\-forward block with a set of expert sub\-networks and a gating function that routes each token to a small number of experts\(Shazeeret al\.,[2017](https://arxiv.org/html/2606.24331#bib.bib32); Feduset al\.,[2022](https://arxiv.org/html/2606.24331#bib.bib6)\)\. Only the selected experts are activated for a given token, so the number of parameters touched per forward pass is a fraction of the total parameter count\. MoE is the main way production systems decouple model capacity from inference cost\.

Mixtral\-8x7B\(Jianget al\.,[2024](https://arxiv.org/html/2606.24331#bib.bib7)\)is an open\-weight MoE with 47 billion total parameters and about 13 billion activated per token\. DeepSeek\-V3\(DeepSeek\-AI,[2024](https://arxiv.org/html/2606.24331#bib.bib22)\)pushes the same design to a much larger scale\. What you give up for this is not accuracy but operational convenience\. Serving an MoE system is harder than serving a dense one\. The routing function produces irregular memory\-access patterns that sit awkwardly on top of GPU kernels written for dense matrix multiplication, and balancing the load across experts, so that no one expert becomes a bottleneck, is a research question on its own\.

## 4The Post\-2023 Turn

By mid\-2023, scaled\-up decoder\-only transformers were fluent enough for commercial deployment\. They were also unreliable in ways that mattered to customers\. They hallucinated facts\. They followed instructions when it suited them\. They sometimes produced output that no support team wanted in a log file\. A cluster of techniques, some of them older than 2023 and some newer, came together at roughly this point to address those problems\. The rest of this section takes the ones that stuck and describes them\.

### 4\.1Instruction tuning

The InstructGPT paper\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.24331#bib.bib4)\)reported a surprising result\. A 1\.3\-billion\-parameter model fine\-tuned on a few tens of thousands of human\-written demonstrations produced outputs that annotators preferred to those of the raw 175\-billion\-parameter GPT\-3 it had been distilled from\. The demonstrations spanned summarisation, classification, question answering, translation, and rewriting\. Supervised fine\-tuning on data of this kind, now routinely called instruction tuning, teaches the model the genre of following user requests rather than simply the distribution of web text\.

Instruction tuning data is expensive to collect at scale\. Open efforts such as Dolly\(Conoveret al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib37)\)and Alpaca\(Taoriet al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib38)\)demonstrated that synthetic instructions generated by stronger models could substitute, with caveats\. Models trained on synthetic data inherit the errors of the generator and can fail in correlated ways\.

### 4\.2Reinforcement learning from human feedback and its successors

InstructGPT also introduced, at production scale, reinforcement learning from human feedback \(RLHF\)\. Human annotators rank pairs of model outputs\. A reward model is trained to predict these rankings\. The policy model is then fine\-tuned using proximal policy optimisation\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.24331#bib.bib33)\)to maximise the learned reward while staying close to the supervised baseline through a Kullback–Leibler penalty\. RLHF is fiddly\. Reward hacking, mode collapse, and reward over\-optimisation are all documented failure modes\(Gaoet al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib34)\)\.

Direct preference optimisation \(DPO\)\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib5)\)removes the reward model entirely\. It rewrites the RLHF objective so that the policy is trained directly on preference pairs with a supervised\-style loss\. DPO is much easier to implement and tune, and it has become a standard alternative for open\-weight models\. Variants include identity preference optimisation \(IPO\) and Kahneman–Tversky optimisation\(Ethayarajhet al\.,[2024](https://arxiv.org/html/2606.24331#bib.bib35)\)\.

Constitutional AI\(Baiet al\.,[2022](https://arxiv.org/html/2606.24331#bib.bib36)\)takes a different route\. Instead of collecting pairwise human comparisons, the model is asked to critique and revise its own outputs against a set of written principles, and the critiques are used as training signal\. This reduces the amount of human labelling required but makes the behaviour of the final model depend on the quality of the principles\.

### 4\.3Retrieval augmentation

Retrieval\-augmented generation \(RAG\)\(Lewiset al\.,[2020b](https://arxiv.org/html/2606.24331#bib.bib8)\)addresses two of the sharpest weaknesses of pretrained models: stale knowledge and factual hallucination\. The system retrieves passages from an external corpus at inference time and conditions generation on the retrieved text\. Retrieval can be done with sparse methods such as BM25, dense methods using an embedding model trained with contrastive objectives, or hybrids\.

RAG is now the default pattern for enterprise deployments where the model has to answer questions about documents the vendor did not train on\. It also shifts the reliability problem from the model to the retriever: if the retriever surfaces the wrong passage, the model will faithfully quote the wrong passage\.

### 4\.4Parameter\-efficient fine\-tuning

Full fine\-tuning of a multi\-billion parameter model is expensive\. Low\-rank adaptation \(LoRA\)\(Huet al\.,[2022](https://arxiv.org/html/2606.24331#bib.bib39)\)freezes the pretrained weights and inserts trainable low\-rank matrices into each attention projection\. The number of trainable parameters drops by two to three orders of magnitude with a small loss in downstream quality\. Quantised variants such as QLoRA\(Dettmerset al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib40)\)combine LoRA with 4\-bit weight quantisation and make it possible to fine\-tune 65\-billion\-parameter models on a single consumer GPU\.

### 4\.5Current flagship families

Public information on the largest current models is uneven, so we restrict the discussion to facts that are documented in technical reports or reliable press\.

OpenAI released GPT\-4 in March 2023\(OpenAI,[2023](https://arxiv.org/html/2606.24331#bib.bib41)\); it accepts images as well as text and is accessed through an API\. The company has since released GPT\-4 Turbo, GPT\-4o, and successor models, and has not published parameter counts\. Anthropic released Claude 2 in July 2023 and the Claude 3 family \(Haiku, Sonnet, Opus\) in March 2024, followed by Claude 3\.5 Sonnet in June 2024\(Anthropic,[2024](https://arxiv.org/html/2606.24331#bib.bib42)\)\. Anthropic has published research on constitutional AI and on interpretability, but not architectural details of the deployed models\.

Google DeepMind released Gemini 1\.0 in December 2023 and Gemini 1\.5 in February 2024\(Gemini Team, Google,[2024](https://arxiv.org/html/2606.24331#bib.bib43)\)\. Gemini 1\.5 Pro reports context windows up to one million tokens, enabled by a combination of mixture\-of\-experts and long\-context techniques\. Meta released Llama 2 in July 2023\(Touvronet al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib19)\)and Llama 3 in April 2024\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.24331#bib.bib20)\); both are open\-weight, which has made them the starting point for a large collection of community fine\-tunes\. Mistral AI released Mistral 7B in September 2023 and Mixtral 8x7B in December 2023\(Jianget al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib21),[2024](https://arxiv.org/html/2606.24331#bib.bib7)\); both are open\-weight and focus on strong per\-parameter performance\. DeepSeek has released a series of open\-weight dense and MoE models, including DeepSeek\-V3\(DeepSeek\-AI,[2024](https://arxiv.org/html/2606.24331#bib.bib22)\), that have been competitive with closed\-weight models on standard benchmarks\.

Table[2](https://arxiv.org/html/2606.24331#S4.T2)summarises the public facts about these families\. Parameter counts for closed\-weight models are omitted where they are not officially disclosed, rather than guessed\.

Table 2:Flagship model families as of 2024\. Dashes indicate that the value has not been publicly disclosed by the vendor\.

## 5Applications Across Domain Verticals

Transformer\-based models are now deployed across a wide set of industries\. The interesting question is no longer whether they can be used, but where they give a real advantage over simpler baselines and where they introduce risks that the user should know about\. This section covers seven verticals\. For each, we describe the workload, summarise representative work, and identify the property of the architecture that makes it suitable\.

### 5\.1Healthcare

Healthcare text is a natural target for transformers\. Clinical notes are unstructured, long, and full of domain\-specific abbreviations\. Extracting structured information from them, for example diagnosis codes or medication lists, is a well\-defined supervised problem that benefits from pretrained contextual embeddings\. Med\-BERT\(Rasmyet al\.,[2021](https://arxiv.org/html/2606.24331#bib.bib44)\)adapts BERT to structured electronic health records and improves disease prediction relative to non\-transformer baselines\. BioBERT\(Leeet al\.,[2020](https://arxiv.org/html/2606.24331#bib.bib45)\)and ClinicalBERT\(Alsentzeret al\.,[2019](https://arxiv.org/html/2606.24331#bib.bib46)\)continue pretraining on PubMed and MIMIC\-III to produce embeddings that generalise better to biomedical text than general\-purpose models\.

Beyond extraction, transformer models have been used for radiology reporting\(Houet al\.,[2021](https://arxiv.org/html/2606.24331#bib.bib47)\), for answering patient questions about medications, and for summarising discharge notes\. The practical bottleneck in this vertical is not model quality\. It is regulatory: clinical deployment requires auditability, and the opacity of a decoder\-only generative model is a real barrier\. Recent work on retrieval\-augmented clinical question answering\(Zakkaet al\.,[2024](https://arxiv.org/html/2606.24331#bib.bib48)\)attempts to close this gap by grounding outputs in citable sources\.

### 5\.2Finance

Finance is a difficult vertical for language models\. The text is noisy, the timestamps are part of the meaning, and the numbers embedded in the text are often the whole point of the document\. FinBERT\(Araci,[2019](https://arxiv.org/html/2606.24331#bib.bib49)\)took the obvious route and fine\-tuned BERT on financial news with a sentiment head\. NumHTML\(Yanget al\.,[2022](https://arxiv.org/html/2606.24331#bib.bib50)\)did something more interesting: it combined textual and numerical features from earnings\-call transcripts inside a hierarchical transformer, and used the joint representation to forecast stock returns\. The property being exploited there is not fluency\. It is the ability of attention to align signals of different kinds, text on one side and numbers on the other, inside a shared representation\. Recent work has tilted towards prompting frontier models for fraud detection or analyst\-report drafting\. The results are uneven\. When an error shows up later as a regulatory fine, domain\-specific adaptation still earns its keep\.

### 5\.3Legal

Legal writing sits at the opposite end of the spectrum from social\-media text\. Documents are long, sentences are precise, and small changes in wording matter\. Transformer models have been used for contract clause classification, legal judgment prediction, and retrieval across case law\(Chalkidiset al\.,[2020](https://arxiv.org/html/2606.24331#bib.bib51); Shaheenet al\.,[2020](https://arxiv.org/html/2606.24331#bib.bib52)\)\. The most successful systems in this domain combine long\-context encoders with retrieval, rather than relying on a pure generative model\. Generation is risky: a model that invents a non\-existent case citation can cause real harm\(Dahlet al\.,[2024](https://arxiv.org/html/2606.24331#bib.bib53)\)\. This is one of the clearest examples of a vertical where the right deployment pattern is “retrieve, extract, and verify” rather than “generate freely”\.

### 5\.4Education

Two applications have received the most attention: automated essay scoring and personalised tutoring\. Ormerod et al\.\(Ormerodet al\.,[2021](https://arxiv.org/html/2606.24331#bib.bib54)\)show that efficient transformer models can match traditional automated scoring on standardised tests\. Kulshreshtha et al\.\(Kulshreshthaet al\.,[2022](https://arxiv.org/html/2606.24331#bib.bib55)\)use few\-shot generation to produce follow\-up questions in an intelligent tutoring system\. The field has been affected more than most by the release of free consumer chatbots; whether transformer\-assisted grading is fair when students are also using transformers to write, and how to detect the latter, are now active research questions\.

### 5\.5Customer service and conversational agents

Conversational agents are the most visible deployment of transformer models\. The shift from rule\-based bots to instruction\-tuned generative models has improved handling of out\-of\-distribution queries, at the cost of new failure modes: hallucinated policies, fabricated order numbers, and inconsistent tone\. Production systems increasingly wrap the language model in a retrieval layer, a policy checker, and a set of tools the model can call, rather than letting it answer freely\(Schicket al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib56)\)\. What matters architecturally for this workload is a context window long enough to hold the relevant policy documents, reliable tool invocation, and function calls that actually return the structured response the deployer expects\.

### 5\.6Creative and content applications

Decoder\-only models produce coherent prose and, with the right prompt or a light fine\-tune, can be pushed towards a particular genre\.Marcoet al\.\([2022](https://arxiv.org/html/2606.24331#bib.bib57)\)evaluated several transformer models across poetry, fiction, and lyrics\. Their finding is worth repeating: output quality depended more on how narrowly the stylistic prompt was specified than on which base model did the generating\. The open research problem is less about fluency and more about control: how to keep a long narrative consistent, how to respect constraints such as rhyme or metre, and how to avoid repeating patterns from the training data in ways that approach plagiarism\. Copyright is now a live legal question in this area rather than an abstract concern\.

### 5\.7Scientific work

Transformer models are being used for literature search, hypothesis suggestion, and draft writing\(Qazvinianet al\.,[2013](https://arxiv.org/html/2606.24331#bib.bib58); Zakkaet al\.,[2024](https://arxiv.org/html/2606.24331#bib.bib48)\)\. Recent work has gone further\. AlphaFold\-style protein language models adapt the transformer to amino\-acid sequences for structure prediction; code models such as Codex and its successors\(Chenet al\.,[2021](https://arxiv.org/html/2606.24331#bib.bib59)\)have changed day\-to\-day software engineering\. Scientific applications are useful as a stress test for the claim that these models “reason”\. They do not, in any strong sense; they pattern\-match over their training distribution\. When the distribution is rich enough, as it is for protein sequences or common programming patterns, the pattern matching is useful\.

## 6Critical Assessment

The survey part of this paper describes what has been built\. This section asks what the trade\-offs actually are\. We focus on five issues that come up in every real deployment and that are under\-represented in typical survey papers: architecture trade\-offs, compute cost, alignment and safety, data provenance, and the gap between benchmarks and field performance\.

### 6\.1Architecture trade\-offs

No single architecture dominates across all axes\. Table[3](https://arxiv.org/html/2606.24331#S6.T3)summarises the trade\-offs for the families introduced in Section[3](https://arxiv.org/html/2606.24331#S3)\. A few observations from the table deserve pulling out\.

The first is that encoder\-only models have not gone away, and should not\. For bounded supervised tasks they are still the most sensible choice\. They train faster, they serve cheaper, and they are far easier to calibrate than a generative model of comparable competence\. Swapping a fine\-tuned RoBERTa classifier for a prompted frontier decoder, which we have seen teams do in production, often makes classification accuracy worse while also making the system slower and more expensive\. The decoder is strictly larger, but size is not the point\.

A second concerns decoder\-only models\. They pay for their generality, and the bill is not always obvious until it arrives\. Evaluation is harder because the output space is unbounded\. Alignment is harder because what counts as correct behaviour is under\-specified\. Prompt injection\(Greshakeet al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib60)\)opens a class of attack that did not exist when the same team was fine\-tuning a BERT\-style classifier\. In\-context learning, the very feature that made these models interesting in the first place, is also where a lot of the brittleness sits, and users rarely see it unless they go looking\.

The last observation is about mixture\-of\-experts\. MoE is a cost\-model change more than an architectural revolution\. A 100\-billion\-parameter MoE with 10 billion active parameters per token costs less per inference than a dense 100\-billion model, but it is not equivalent to a dense 10\-billion either\. Serving becomes more complex, the router adds another component that can misbehave, and benchmark tables that report only the total parameter count give a misleading picture of effective capacity\.

Table 3:Architecture trade\-offs on four deployment\-relevant axes\. “Interpretability” here means the ease of attributing outputs to input features, not mechanistic interpretability of the model weights\.
### 6\.2Compute and energy cost

Training compute has grown faster than benchmark accuracy has improved\. That is not a new observation but it is worth restating\.Strubellet al\.\([2019](https://arxiv.org/html/2606.24331#bib.bib61)\)estimated that training a large transformer at 2019 scale produced around 284 tonnes of carbon dioxide equivalent, roughly what five cars put out over their working lives\.Pattersonet al\.\([2021](https://arxiv.org/html/2606.24331#bib.bib62)\)followed up with a more careful accounting\. Their numbers were lower, but the curve still pointed the same way\. Training runs for current flagship models are largely opaque\. The commonly cited estimate for GPT\-4\-class training puts the figure around102510^\{25\}floating\-point operations, though vendors do not confirm this\.

Inference cost turns out to matter more at the margin than training cost does\. A model that answers a billion queries a day eats through the cost of its training run within a few weeks\. The techniques that determine whether inference stays affordable are unglamorous but decisive: 8\-bit and 4\-bit quantisation\(Dettmerset al\.,[2022](https://arxiv.org/html/2606.24331#bib.bib63)\), speculative decoding\(Leviathanet al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib64)\), key\-value cache compression, continuous batching, and FlashAttention\(Daoet al\.,[2022](https://arxiv.org/html/2606.24331#bib.bib29)\)\. Treat these as the difference between a model being useful and a model being a worrying line item on a cloud bill\.

The hidden cost sits inside the training data\. Cleaning, deduplication, and filtering are labour\-intensive\. A lot of that labour is done by low\-paid contractors in jurisdictions where complaints are rare, and in some well\-documented cases those workers are exposed to genuinely disturbing material\(Perrigo,[2023](https://arxiv.org/html/2606.24331#bib.bib65)\)\. Most survey papers on this technology step past that fact\. We do not think it should be stepped past\. It is part of the real cost of what we are describing\.

### 6\.3Alignment and safety

Alignment, as we use the word here, is the work of making a trained model behave in ways that the users and the deployers are willing to stand behind\. It is an active research area, not a solved problem\. Several recurring failure modes are now well documented\.

Hallucination, in the sense of confident generation of false facts, is a structural property of decoder\-only models trained on next\-token prediction\. Retrieval augmentation reduces it but does not remove it\(Jiet al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib66)\)\. Jailbreaks and prompt injection let adversarial users override safety training\(Weiet al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib67); Greshakeet al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib60)\)\. Sycophancy, in which the model agrees with whatever the user asserts, appears to emerge from RLHF training objectives\(Sharmaet al\.,[2024](https://arxiv.org/html/2606.24331#bib.bib68)\)\. Reward hacking is active under DPO as well as under classical RLHF\.

These failure modes are hard to detect with standard benchmarks\. They require targeted red\-teaming, and they interact with one another\. A model that is safer against one category of attack can be more sycophantic in general conversation\. There is no general\-purpose evaluation that captures all of this, and this is an honest limitation of the state of the art\.

### 6\.4Data provenance and bias

Large pretraining corpora draw on web crawls, books, code, and social media\. The composition of the corpus is usually not published in detail, and the text is not filtered for consent\. Copyright litigation is active\(The New York Times,[2023](https://arxiv.org/html/2606.24331#bib.bib69)\), and the outcomes will shape what can be trained in the future\. For European deployments, the EU AI Act imposes disclosure requirements on training data summaries for general\-purpose AI models\(European Parliament and Council,[2024](https://arxiv.org/html/2606.24331#bib.bib70)\)\.

Bias in pretraining data turns into bias in outputs\(Benderet al\.,[2021](https://arxiv.org/html/2606.24331#bib.bib71); Blodgettet al\.,[2020](https://arxiv.org/html/2606.24331#bib.bib72)\)\. The effect is not evenly distributed either\. Performance degrades on non\-English languages, on English dialects that are thinly represented on the open web, and on topics where the available text is lopsided\. Balanced fine\-tuning data reduces these effects without eliminating them\.

### 6\.5Benchmarks and the gap to field performance

Almost every headline number in a model announcement comes from the same short list of public benchmarks\. MMLU, HumanEval, GSM8K, HellaSwag, and a handful of multilingual tests between them account for most of the comparisons we see\. At the frontier these benchmarks are saturated\. There is also credible evidence that some of their questions have ended up inside training corpora\(Orenet al\.,[2024](https://arxiv.org/html/2606.24331#bib.bib73)\)\. A score that was diagnostic in 2021 tells you less in 2024 than it did then\.

The more serious issue is that the distance between benchmark performance and day\-to\-day behaviour in the field is now wider than the distance between successive model versions\. A model can score 90% on HumanEval and still write code that does not compile against the user’s actual repository\. A model can score well on a clinical multiple\-choice test and still miss a contraindication that a registrar would catch on a first read\. We are not arguing that benchmarks should be abandoned\. We are arguing that taking a single leaderboard position as a proxy for capability is a mistake, and that what determines field performance is domain\-specific evaluation and post\-deployment monitoring\.

## 7Open Research Directions

What follows is a short set of questions that, from our own reading of the literature, are under\-served relative to how much they actually matter to anyone building on top of these models\.

#### Efficient long\-context models\.

The million\-token context windows that vendors now quote are, we would argue, a marketing figure as much as a research result\. Attention cost still dominates for realistic inputs, and recall inside a long context is uneven\. Models reliably remember the beginning and the end of a long document while losing material from the middle\(Liuet al\.,[2024](https://arxiv.org/html/2606.24331#bib.bib74)\)\. What the field needs is not a larger window but architectures and training recipes that produce reliable use of long context\.

#### Evaluation beyond leaderboards\.

We need benchmarks that resist training\-data contamination by construction, that measure failure modes such as sycophancy, overclaiming, and pathological refusal rather than only task accuracy, and that are specialised to the setting they are supposed to evaluate rather than aggregating over an unrelated grab\-bag of tasks\. HELM\(Lianget al\.,[2023](https://arxiv.org/html/2606.24331#bib.bib75)\)and BIG\-bench\(Srivastavaet al\.,[2022](https://arxiv.org/html/2606.24331#bib.bib76)\)are starting points, not end points\.

#### Small open models\.

Small open\-weight models, in the 1–10 billion parameter range, are the substrate of most commercial fine\-tuning\. The research community pays them less attention than frontier models, but they are where most deployed systems actually live\. Work on efficient pretraining, strong instruction tuning, and safety fine\-tuning for this size class has outsized practical impact\.

#### Grounded generation\.

Retrieval augmentation is the current default for reducing hallucination, but the interface between retriever and generator is ad hoc\. End\-to\-end training of retriever and generator, joint calibration of citation, and principled handling of conflicting sources are open problems\.

#### Auditable alignment\.

Alignment techniques are applied but rarely audited in a way that makes the resulting behaviour inspectable\. Mechanistic interpretability\(Olahet al\.,[2020](https://arxiv.org/html/2606.24331#bib.bib77); Elhageet al\.,[2022](https://arxiv.org/html/2606.24331#bib.bib78)\)and causal interventions on internal representations offer a route, but the methods do not yet scale to models of current size\.

#### Cost and environmental reporting\.

Few published papers report the compute, energy, and water cost of their experiments\. Standardised reporting would let readers compare like with like and would shift incentives towards efficiency rather than pure scale\.

## 8Conclusion

Transformer\-based language models have moved from a research curiosity to a deployed technology in under a decade\. The core architecture has changed less than the scale, the training data, and the alignment process that surrounds it\. This review organised the main architecture families into a taxonomy, covered the post\-2023 developments that matter in practice, surveyed applications across seven domain verticals, and offered a critical assessment of the trade\-offs that are underplayed in vendor\-driven narratives\.

Two points deserve a closing emphasis\. The right architecture for a given deployment is, very often, not the largest one available\. Encoder\-only and encoder\-decoder models remain the right tool for many production workloads, and reaching for a frontier decoder by default is an expensive habit and, in a surprising number of cases, an unnecessary one\. Equally, capability claims that rest on a benchmark number should be read with scepticism\. Field performance is shaped by data provenance, by alignment quality, and by whatever domain\-specific evaluation a deployer is willing to build, none of which shows up on a leaderboard\.

The next few years of work on this technology will be shaped at least as much by operational and regulatory constraints as by further architectural invention\. In our reading, researchers working on transformer\-based systems will do more practical good by investing in evaluation, auditability, efficiency, and grounding than by racing for the next order of magnitude of parameter count\.

## References

- E\. Alsentzer, J\. Murphy, W\. Boag, W\. Weng, D\. Jindi, T\. Naumann, and M\. McDermott \(2019\)Publicly available clinical BERT embeddings\.InProceedings of the Clinical Natural Language Processing Workshop,Cited by:[§5\.1](https://arxiv.org/html/2606.24331#S5.SS1.p1.1)\.
- Anthropic \(2024\)The Claude 3 model family: Opus, Sonnet, Haiku\.Technical reportAnthropic\.Cited by:[§4\.5](https://arxiv.org/html/2606.24331#S4.SS5.p2.1)\.
- D\. Araci \(2019\)FinBERT: financial sentiment analysis with pre\-trained language models\.InarXiv preprint arXiv:1908\.10063,Cited by:[§5\.2](https://arxiv.org/html/2606.24331#S5.SS2.p1.1)\.
- Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon,et al\.\(2022\)Constitutional AI: harmlessness from AI feedback\.arXiv preprint arXiv:2212\.08073\.Cited by:[§4\.2](https://arxiv.org/html/2606.24331#S4.SS2.p3.1)\.
- I\. Beltagy, M\. E\. Peters, and A\. Cohan \(2020\)Longformer: the long\-document transformer\.arXiv preprint arXiv:2004\.05150\.Cited by:[§3\.4](https://arxiv.org/html/2606.24331#S3.SS4.p3.1)\.
- E\. M\. Bender, T\. Gebru, A\. McMillan\-Major, and S\. Shmitchell \(2021\)On the dangers of stochastic parrots: can language models be too big?\.InProceedings of the ACM Conference on Fairness, Accountability, and Transparency,pp\. 610–623\.Cited by:[§6\.4](https://arxiv.org/html/2606.24331#S6.SS4.p2.1)\.
- S\. L\. Blodgett, S\. Barocas, H\. Daumé III, and H\. Wallach \(2020\)Language \(technology\) is power: a critical survey of “bias” in NLP\.InProceedings of ACL,Cited by:[§6\.4](https://arxiv.org/html/2606.24331#S6.SS4.p2.1)\.
- T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in Neural Information Processing Systems33,pp\. 1877–1901\.Cited by:[§2](https://arxiv.org/html/2606.24331#S2.p3.1),[§3\.2](https://arxiv.org/html/2606.24331#S3.SS2.p1.1)\.
- I\. Chalkidis, M\. Fergadiotis, P\. Malakasiotis, N\. Aletras, and I\. Androutsopoulos \(2020\)LEGAL\-BERT: the muppets straight out of law school\.InFindings of EMNLP,Cited by:[§5\.3](https://arxiv.org/html/2606.24331#S5.SS3.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§5\.7](https://arxiv.org/html/2606.24331#S5.SS7.p1.1)\.
- K\. Choromanski, V\. Likhosherstov, D\. Dohan, X\. Song, A\. Gane, T\. Sarlos, P\. Hawkins, J\. Davis, A\. Mohiuddin, L\. Kaiser,et al\.\(2020\)Rethinking attention with performers\.arXiv preprint arXiv:2009\.14794\.Cited by:[§3\.4](https://arxiv.org/html/2606.24331#S3.SS4.p3.1)\.
- K\. Clark, M\. Luong, Q\. V\. Le, and C\. D\. Manning \(2020\)ELECTRA: pre\-training text encoders as discriminators rather than generators\.InInternational Conference on Learning Representations,Cited by:[§3\.5](https://arxiv.org/html/2606.24331#S3.SS5.p2.1)\.
- M\. Conover, M\. Hayes, A\. Mathur, J\. Xie, J\. Wan, S\. Shah, A\. Ghodsi, P\. Wendell, M\. Zaharia, and R\. Xin \(2023\)Free Dolly: introducing the world’s first truly open instruction\-tuned LLM\.Note:Databricks BlogCited by:[§4\.1](https://arxiv.org/html/2606.24331#S4.SS1.p2.1)\.
- M\. Dahl, V\. Magesh, M\. Suzgun, and D\. E\. Ho \(2024\)Large legal fictions: profiling legal hallucinations in large language models\.Journal of Legal Analysis16\(1\),pp\. 64–93\.Cited by:[§5\.3](https://arxiv.org/html/2606.24331#S5.SS3.p1.1)\.
- Z\. Dai, Z\. Yang, Y\. Yang, J\. Carbonell, Q\. V\. Le, and R\. Salakhutdinov \(2019\)Transformer\-XL: attentive language models beyond a fixed\-length context\.InProceedings of ACL,Cited by:[§3\.4](https://arxiv.org/html/2606.24331#S3.SS4.p2.1)\.
- T\. Dao, D\. Y\. Fu, S\. Ermon, A\. Rudra, and C\. Ré \(2022\)FlashAttention: fast and memory\-efficient exact attention with IO\-awareness\.Advances in Neural Information Processing Systems\.Cited by:[§3\.4](https://arxiv.org/html/2606.24331#S3.SS4.p3.1),[§6\.2](https://arxiv.org/html/2606.24331#S6.SS2.p2.1)\.
- DeepSeek\-AI \(2024\)DeepSeek\-V3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§3\.2](https://arxiv.org/html/2606.24331#S3.SS2.p3.1),[§3\.6](https://arxiv.org/html/2606.24331#S3.SS6.p2.1),[§4\.5](https://arxiv.org/html/2606.24331#S4.SS5.p3.1)\.
- T\. Dettmers, M\. Lewis, Y\. Belkada, and L\. Zettlemoyer \(2022\)LLM\.int8\(\): 8\-bit matrix multiplication for transformers at scale\.InAdvances in Neural Information Processing Systems,Cited by:[§6\.2](https://arxiv.org/html/2606.24331#S6.SS2.p2.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)QLoRA: efficient finetuning of quantized LLMs\.Advances in Neural Information Processing Systems\.Cited by:[§4\.4](https://arxiv.org/html/2606.24331#S4.SS4.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of NAACL\-HLT,pp\. 4171–4186\.Cited by:[§2](https://arxiv.org/html/2606.24331#S2.p3.1),[§3\.1](https://arxiv.org/html/2606.24331#S3.SS1.p1.1)\.
- J\. Ding, S\. Ma, L\. Dong, X\. Zhang, S\. Huang, W\. Wang, N\. Zheng, and F\. Wei \(2023\)LongNet: scaling transformers to 1,000,000,000 tokens\.arXiv preprint arXiv:2307\.02486\.Cited by:[§3\.4](https://arxiv.org/html/2606.24331#S3.SS4.p3.1)\.
- N\. Elhage, T\. Hume, C\. Olsson, N\. Schiefer, T\. Henighan, S\. Kravec, Z\. Hatfield\-Dodds, R\. Lasenby, D\. Drain, C\. Chen,et al\.\(2022\)Toy models of superposition\.Transformer Circuits Thread\.Cited by:[§7](https://arxiv.org/html/2606.24331#S7.SS0.SSS0.Px5.p1.1)\.
- K\. Ethayarajh, W\. Xu, N\. Muennighoff, D\. Jurafsky, and D\. Kiela \(2024\)KTO: model alignment as prospect theoretic optimization\.arXiv preprint arXiv:2402\.01306\.Cited by:[§4\.2](https://arxiv.org/html/2606.24331#S4.SS2.p2.1)\.
- European Parliament and Council \(2024\)Regulation \(eu\) 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence \(AI Act\)\.Note:Official Journal of the European UnionCited by:[§6\.4](https://arxiv.org/html/2606.24331#S6.SS4.p1.1)\.
- W\. Fedus, B\. Zoph, and N\. Shazeer \(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.Cited by:[§1](https://arxiv.org/html/2606.24331#S1.p4.1),[§3\.6](https://arxiv.org/html/2606.24331#S3.SS6.p1.1)\.
- L\. Gao, J\. Schulman, and J\. Hilton \(2023\)Scaling laws for reward model overoptimization\.InInternational Conference on Machine Learning,Cited by:[§4\.2](https://arxiv.org/html/2606.24331#S4.SS2.p1.1)\.
- Gemini Team, Google \(2024\)Gemini 1\.5: unlocking multimodal understanding across millions of tokens of context\.Technical reportGoogle DeepMind\.Cited by:[§4\.5](https://arxiv.org/html/2606.24331#S4.SS5.p3.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§3\.2](https://arxiv.org/html/2606.24331#S3.SS2.p3.1),[§4\.5](https://arxiv.org/html/2606.24331#S4.SS5.p3.1)\.
- K\. Greshake, S\. Abdelnabi, S\. Mishra, C\. Endres, T\. Holz, and M\. Fritz \(2023\)Not what you’ve signed up for: compromising real\-world LLM\-integrated applications with indirect prompt injection\.InProceedings of the ACM Workshop on Artificial Intelligence and Security,Cited by:[§6\.1](https://arxiv.org/html/2606.24331#S6.SS1.p3.1),[§6\.3](https://arxiv.org/html/2606.24331#S6.SS3.p2.1)\.
- P\. He, X\. Liu, J\. Gao, and W\. Chen \(2021\)DeBERTa: decoding\-enhanced BERT with disentangled attention\.InInternational Conference on Learning Representations,Cited by:[§3\.1](https://arxiv.org/html/2606.24331#S3.SS1.p3.1)\.
- S\. Hochreiter and J\. Schmidhuber \(1997\)Long short\-term memory\.Neural Computation9\(8\),pp\. 1735–1780\.Cited by:[§2](https://arxiv.org/html/2606.24331#S2.p1.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. de Las Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark,et al\.\(2022\)Training compute\-optimal large language models\.Advances in Neural Information Processing Systems\.Cited by:[§2](https://arxiv.org/html/2606.24331#S2.p4.1)\.
- B\. Hou, G\. Kaissis, R\. M\. Summers, and B\. Kainz \(2021\)RATCHET: medical transformer for chest X\-ray diagnosis and reporting\.arXiv preprint arXiv:2107\.02104\.Cited by:[§5\.1](https://arxiv.org/html/2606.24331#S5.SS1.p2.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,Cited by:[§4\.4](https://arxiv.org/html/2606.24331#S4.SS4.p1.1)\.
- Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung \(2023\)Survey of hallucination in natural language generation\.ACM Computing Surveys55\(12\),pp\. 1–38\.Cited by:[§6\.3](https://arxiv.org/html/2606.24331#S6.SS3.p2.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier,et al\.\(2023\)Mistral 7B\.arXiv preprint arXiv:2310\.06825\.Cited by:[§3\.2](https://arxiv.org/html/2606.24331#S3.SS2.p3.1),[§4\.5](https://arxiv.org/html/2606.24331#S4.SS5.p3.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Roux, A\. Mensch, B\. Savary, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, E\. B\. Hanna, F\. Bressand,et al\.\(2024\)Mixtral of experts\.arXiv preprint arXiv:2401\.04088\.Cited by:[§1](https://arxiv.org/html/2606.24331#S1.p4.1),[§3\.2](https://arxiv.org/html/2606.24331#S3.SS2.p3.1),[§3\.6](https://arxiv.org/html/2606.24331#S3.SS6.p2.1),[§4\.5](https://arxiv.org/html/2606.24331#S4.SS5.p3.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§2](https://arxiv.org/html/2606.24331#S2.p4.1)\.
- D\. Kulshreshtha, M\. Shayan, R\. Belfer, S\. Reddy, I\. V\. Serban, and E\. Kochmar \(2022\)Few\-shot question generation for personalized feedback in intelligent tutoring systems\.arXiv preprint arXiv:2206\.04187\.Cited by:[§5\.4](https://arxiv.org/html/2606.24331#S5.SS4.p1.1)\.
- J\. Lee, W\. Yoon, S\. Kim, D\. Kim, S\. Kim, C\. H\. So, and J\. Kang \(2020\)BioBERT: a pre\-trained biomedical language representation model for biomedical text mining\.Bioinformatics36\(4\),pp\. 1234–1240\.Cited by:[§5\.1](https://arxiv.org/html/2606.24331#S5.SS1.p1.1)\.
- Y\. Leviathan, M\. Kalman, and Y\. Matias \(2023\)Fast inference from transformers via speculative decoding\.InInternational Conference on Machine Learning,Cited by:[§6\.2](https://arxiv.org/html/2606.24331#S6.SS2.p2.1)\.
- M\. Lewis, Y\. Liu, N\. Goyal, M\. Ghazvininejad, A\. Mohamed, O\. Levy, V\. Stoyanov, and L\. Zettlemoyer \(2020a\)BART: denoising sequence\-to\-sequence pre\-training for natural language generation, translation, and comprehension\.InProceedings of ACL,pp\. 7871–7880\.Cited by:[§3\.3](https://arxiv.org/html/2606.24331#S3.SS3.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020b\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in Neural Information Processing Systems33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2606.24331#S1.p4.1),[§4\.3](https://arxiv.org/html/2606.24331#S4.SS3.p1.1)\.
- P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar,et al\.\(2023\)Holistic evaluation of language models\.Transactions on Machine Learning Research\.Cited by:[§7](https://arxiv.org/html/2606.24331#S7.SS0.SSS0.Px2.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.Cited by:[§7](https://arxiv.org/html/2606.24331#S7.SS0.SSS0.Px1.p1.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized BERT pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[§3\.1](https://arxiv.org/html/2606.24331#S3.SS1.p2.1)\.
- G\. Marco, J\. Gonzalo, and L\. Rello \(2022\)A systematic evaluation of the creative writing skills of transformer deep neural networks\.SSRN Electronic Journal\.Cited by:[§5\.6](https://arxiv.org/html/2606.24331#S5.SS6.p1.1)\.
- S\. Minaee, T\. Mikolov, N\. Nikzad, M\. Chenaghlu, R\. Socher, X\. Amatriain, and J\. Gao \(2024\)Large language models: a survey\.arXiv preprint arXiv:2402\.06196\.Cited by:[§1](https://arxiv.org/html/2606.24331#S1.p4.1)\.
- C\. Olah, N\. Cammarata, L\. Schubert, G\. Goh, M\. Petrov, and S\. Carter \(2020\)Zoom in: an introduction to circuits\.Note:DistillCited by:[§7](https://arxiv.org/html/2606.24331#S7.SS0.SSS0.Px5.p1.1)\.
- OpenAI \(2023\)GPT\-4 technical report\.Technical reportOpenAI\.Note:arXiv:2303\.08774Cited by:[§4\.5](https://arxiv.org/html/2606.24331#S4.SS5.p2.1)\.
- Y\. Oren, N\. Meister, N\. Chatterji, F\. Ladhak, and T\. B\. Hashimoto \(2024\)Proving test set contamination in black box language models\.International Conference on Learning Representations\.Cited by:[§6\.5](https://arxiv.org/html/2606.24331#S6.SS5.p1.1)\.
- C\. M\. Ormerod, A\. Malhotra, and A\. Jafari \(2021\)Automated essay scoring using efficient transformer\-based language models\.arXiv preprint arXiv:2102\.13136\.Cited by:[§5\.4](https://arxiv.org/html/2606.24331#S5.SS4.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in Neural Information Processing Systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2606.24331#S1.p4.1),[§4\.1](https://arxiv.org/html/2606.24331#S4.SS1.p1.1)\.
- D\. Patterson, J\. Gonzalez, Q\. Le, C\. Liang, L\. Munguia, D\. Rothchild, D\. So, M\. Texier, and J\. Dean \(2021\)Carbon emissions and large neural network training\.arXiv preprint arXiv:2104\.10350\.Cited by:[§6\.2](https://arxiv.org/html/2606.24331#S6.SS2.p1.1)\.
- B\. Perrigo \(2023\)Exclusive: OpenAI used Kenyan workers on less than $2 per hour to make ChatGPT less toxic\.Note:Time MagazineCited by:[§6\.2](https://arxiv.org/html/2606.24331#S6.SS2.p3.1)\.
- V\. Qazvinian, D\. R\. Radev, S\. M\. Mohammad, B\. Dorr, D\. Zajic, M\. Whidby, and T\. Moon \(2013\)Generating extractive summaries of scientific paradigms\.Journal of Artificial Intelligence Research46,pp\. 165–201\.Cited by:[§5\.7](https://arxiv.org/html/2606.24331#S5.SS7.p1.1)\.
- A\. Radford, K\. Narasimhan, T\. Salimans, and I\. Sutskever \(2018\)Improving language understanding by generative pre\-training\.Technical reportOpenAI\.Cited by:[§2](https://arxiv.org/html/2606.24331#S2.p3.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, and I\. Sutskever \(2019\)Language models are unsupervised multitask learners\.Technical reportOpenAI\.Cited by:[§2](https://arxiv.org/html/2606.24331#S2.p3.1),[§3\.2](https://arxiv.org/html/2606.24331#S3.SS2.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.24331#S1.p4.1),[§4\.2](https://arxiv.org/html/2606.24331#S4.SS2.p2.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of Machine Learning Research21\(140\),pp\. 1–67\.Cited by:[§2](https://arxiv.org/html/2606.24331#S2.p3.1),[§3\.3](https://arxiv.org/html/2606.24331#S3.SS3.p1.1)\.
- L\. Rasmy, Y\. Xiang, Z\. Xie, C\. Tao, and D\. Zhi \(2021\)Med\-BERT: pretrained contextualized embeddings on large\-scale structured electronic health records for disease prediction\.npj Digital Medicine4\(1\),pp\. 86\.Cited by:[§5\.1](https://arxiv.org/html/2606.24331#S5.SS1.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems,Cited by:[§5\.5](https://arxiv.org/html/2606.24331#S5.SS5.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§4\.2](https://arxiv.org/html/2606.24331#S4.SS2.p1.1)\.
- Z\. Shaheen, G\. Wohlgenannt, and E\. Filtz \(2020\)Large scale legal text classification using transformer models\.arXiv preprint arXiv:2010\.12871\.Cited by:[§5\.3](https://arxiv.org/html/2606.24331#S5.SS3.p1.1)\.
- M\. Sharma, M\. Tong, T\. Korbak, D\. Duvenaud, A\. Askell, S\. R\. Bowman, N\. Cheng, E\. Durmus, Z\. Hatfield\-Dodds, S\. R\. Johnston,et al\.\(2024\)Towards understanding sycophancy in language models\.International Conference on Learning Representations\.Cited by:[§6\.3](https://arxiv.org/html/2606.24331#S6.SS3.p2.1)\.
- N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. V\. Le, G\. Hinton, and J\. Dean \(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.InInternational Conference on Learning Representations,Cited by:[§3\.6](https://arxiv.org/html/2606.24331#S3.SS6.p1.1)\.
- A\. Srivastava, A\. Rastogi, A\. Rao, A\. A\. M\. Shoeb, A\. Abid, A\. Fisch, A\. R\. Brown, A\. Santoro, A\. Gupta, A\. Garriga\-Alonso,et al\.\(2022\)Beyond the imitation game: quantifying and extrapolating the capabilities of language models\.arXiv preprint arXiv:2206\.04615\.Cited by:[§7](https://arxiv.org/html/2606.24331#S7.SS0.SSS0.Px2.p1.1)\.
- E\. Strubell, A\. Ganesh, and A\. McCallum \(2019\)Energy and policy considerations for deep learning in NLP\.arXiv preprint arXiv:1906\.02243\.Cited by:[§6\.2](https://arxiv.org/html/2606.24331#S6.SS2.p1.1)\.
- R\. Taori, I\. Gulrajani, T\. Zhang, Y\. Dubois, X\. Li, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023\)Stanford Alpaca: an instruction\-following LLaMA model\.Note:[https://github\.com/tatsu\-lab/stanford\_alpaca](https://github.com/tatsu-lab/stanford_alpaca)Cited by:[§4\.1](https://arxiv.org/html/2606.24331#S4.SS1.p2.1)\.
- The New York Times \(2023\)The New York Times v\. Microsoft Corporation and OpenAI\.Note:Case 1:23\-cv\-11195, S\.D\.N\.Y\.Cited by:[§6\.4](https://arxiv.org/html/2606.24331#S6.SS4.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§3\.2](https://arxiv.org/html/2606.24331#S3.SS2.p3.1),[§4\.5](https://arxiv.org/html/2606.24331#S4.SS5.p3.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.24331#S1.p1.1)\.
- A\. Wei, N\. Haghtalab, and J\. Steinhardt \(2023\)Jailbroken: how does LLM safety training fail?\.Advances in Neural Information Processing Systems\.Cited by:[§6\.3](https://arxiv.org/html/2606.24331#S6.SS3.p2.1)\.
- L\. Yang, J\. Li, R\. Dong, Y\. Zhang, and B\. Smyth \(2022\)NumHTML: numeric\-oriented hierarchical transformer model for multi\-task financial forecasting\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§5\.2](https://arxiv.org/html/2606.24331#S5.SS2.p1.1)\.
- Z\. Yang, Z\. Dai, Y\. Yang, J\. Carbonell, R\. Salakhutdinov, and Q\. V\. Le \(2019\)XLNet: generalized autoregressive pretraining for language understanding\.InAdvances in Neural Information Processing Systems,Cited by:[§3\.5](https://arxiv.org/html/2606.24331#S3.SS5.p1.1)\.
- M\. Zaheer, G\. Guruganesh, K\. A\. Dubey, J\. Ainslie, C\. Alberti, S\. Ontanon, P\. Pham, A\. Ravula, Q\. Wang, L\. Yang, and A\. Ahmed \(2020\)Big Bird: transformers for longer sequences\.InAdvances in Neural Information Processing Systems,Cited by:[§3\.4](https://arxiv.org/html/2606.24331#S3.SS4.p3.1)\.
- C\. Zakka, R\. Shad, A\. Chaurasia, A\. R\. Dalal, J\. L\. Kim, M\. Moor, R\. Fong, C\. Phillips, K\. Alexander, E\. Ashley,et al\.\(2024\)Almanac: retrieval\-augmented language models for clinical medicine\.NEJM AI\.Cited by:[§5\.1](https://arxiv.org/html/2606.24331#S5.SS1.p2.1),[§5\.7](https://arxiv.org/html/2606.24331#S5.SS7.p1.1)\.
- W\. X\. Zhao, K\. Zhou, J\. Li, T\. Tang, X\. Wang, Y\. Hou, Y\. Min, B\. Zhang, J\. Zhang, Z\. Dong,et al\.\(2023\)A survey of large language models\.arXiv preprint arXiv:2303\.18223\.Cited by:[§1](https://arxiv.org/html/2606.24331#S1.p4.1)\.

Similar Articles

The Transformer Pill

Reddit r/ArtificialInteligence

A reflection on the broad implications of transformer architectures beyond LLMs, including potential impacts on linguistics, genetics, and causal modeling, comparing their significance to the Haber-Bosch process.

Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment

arXiv cs.LG

A benchmark study comparing traditional machine learning methods (Random Forest, XGBoost, SVM, Logistic Regression) against lightweight transformer variants (DistilBERT, TinyBERT, MobileBERT) for on-device fault detection across three public datasets. Traditional ML offers competitive accuracy at far smaller resource footprints, while TinyBERT-4L is the most deployment-friendly transformer.

Better language models and their implications

OpenAI Blog

OpenAI introduces GPT-2, a 1.5 billion parameter transformer-based language model trained on 40GB of internet text that achieves state-of-the-art performance on language modeling benchmarks and demonstrates zero-shot capabilities in reading comprehension, translation, question answering, and summarization. Due to safety concerns, only a smaller model and technical paper are released publicly rather than the full trained model.