Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines

arXiv cs.LG Papers

Summary

This paper presents an end-to-end energy accounting framework for LLM distillation pipelines, measuring stage-wise energy costs and constructing energy-quality Pareto frontiers to reveal previously ignored teacher-side costs.

arXiv:2605.13981v1 Announce Type: new Abstract: The rise in deployment of large language models has driven a surge in GPU demand and datacenter scaling, raising concerns about electricity use, grid stress, and the impacts of modern AI workloads. Distillation is often promoted as one of the most effective paths to obtain cheaper, more efficient models, yet these claims rarely account for the full end-to-end energy and resource costs, including crucial teacher-side workloads such as data generation, logit caching, and evaluation. We present a comprehensive energy accounting framework that measures the complete computational cost of distillation pipelines via detailed stage-wise tracking of GPU device power consumption. In our experiments, we separate and log empirical energy use across distinct phases and systematically measure the energy and emissions of two common distillation methods: the classic logit-based knowledge distillation and synthetic-data supervised fine-tuning, constructing energy-quality Pareto frontiers that expose the previously ignored costs. From these measurements and analyses, we derive practical design rules for selecting distillation methods and hyperparameters under energy and budget constraints, and release an open-source measurement harness and accounting protocol to provide a standardized foundation for comparable, reproducible distillation research, explicitly accountable for complete pipeline energy impact.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:26 AM

# Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines
Source: [https://arxiv.org/html/2605.13981](https://arxiv.org/html/2605.13981)
###### Abstract

The rise in deployment of large language models has driven a surge in GPU demand and datacenter scaling, raising concerns about electricity use, grid stress, and the impacts of modern AI workloads\. Distillation is often promoted as one of the most effective paths to obtain cheaper, more efficient models, yet these claims rarely account for the full end\-to\-end energy and resource costs, including crucial teacher\-side workloads such as data generation, logit caching, and evaluation\. We present a comprehensive energy accounting framework that measures the complete computational cost of distillation pipelines via detailed stage\-wise tracking of GPU device power consumption\. In our experiments, we separate and log empirical energy use across distinct phases and systematically measure the energy and emissions of two common distillation methods: the classic logit\-based knowledge distillation and synthetic\-data supervised fine\-tuning, constructing energy–quality Pareto frontiers that expose the previously ignored costs\. From these measurements and analyses, we derive practical design rules for selecting distillation methods and hyperparameters under energy and budget constraints, and release an open\-source measurement harness and accounting protocol to provide a standardized foundation for comparable, reproducible distillation research, explicitly accountable for complete pipeline energy impact\.

Machine Learning, Energy, AI, Deep Learning, ICML, Efficiency

## 1Introduction

The rapid deployment of large language models \(LLMs\) has accelerated GPU demand and datacenter expansion, intensifying concerns about electricity use, grid stress, and environmental impact\. From a “Green AI” perspective, energy should be evaluated alongside accuracy and performance\. Distillation is often framed as a primary lever for reducing compute by producing smaller students that preserve much of a teacher’s quality while enabling cheaper inference\. However, it can require substantial teacher\-side work—synthetic data generation, logit caching, filtering, and evaluation—that can rival or exceed student training, especially under hyperparameter sweeps \(e\.g\., KD temperature, decoding settings, filtering thresholds\)\. Prior work typically reports student\-side savings \(FLOPs, runtime, or training energy\) while omitting these upstream costs, making sustainability and cost claims hard to substantiate\.

We address this gap with an end\-to\-end, distillation\-specific energy accounting framework that \(i\) delineates pipeline stages, \(ii\) measures energy, quality, and throughput per stage under consistent protocols, and \(iii\) incorporates teacher\-side costs in comparisons across pipelines\. Our analysis is organized around three questions:

1. 1\.When do distillation pipelines \(KD or synthetic supervised fine\-tuning \(SFT\)\) deliver better energy–quality tradeoffs than a strong SFT baseline under fixed hardware and training budgets?
2. 2\.How do teacher\-side costs \(generation, logit caching, evaluation\) compare to student training, and when do they dominate the overall energy budget?
3. 3\.Under what conditions — student scale, sequence length, teacher reuse, and target quality — does distillation actually reduce end\-to\-end energy use and emissions relative to alternatives?

To answer these questions, we make three contributions:

- •Distillation energy protocol and harness\.We formalize distillation stages, logging rules, and metrics to produce comparable energy charts for KD and synthetic SFT, using NVML\-based GPU energy as ground truth and empirical CPU/CO2e estimates\.
- •Controlled benchmark with stage\-wise accounting\.Under fixed hardware, software, and training budgets, we benchmark 1B/7B/13B OLMo\-2 students and report stage\-wise energy, runtime, and J/token, constructing Pareto frontiers that separate regimes in the energy–quality space\.
- •Design rules and break\-even conditions\.We quantify how teacher reuse, sequence length, and key hyperparameters \(KD temperature/loss weight; decoding and synthetic\-data reuse\) shift these frontiers, and derive conditions where distillation is truly cheaper in energy/emissions—and where it is not\.

Taken together, our results provide a distillation\-specific lens on economical and sustainable AI: rather than treating smaller students as automatically efficient, we show how to account for the full pipeline, measure its energy costs, and decide when distillation is an energy\- and resource\-efficient choice\.

## 2Background and Related Work

The energy and carbon costs of modern ML systems have motivated the “Green AI” perspective, which argues that efficiency should be evaluated alongside predictive performance\(Schwartzet al\.,[2020](https://arxiv.org/html/2605.13981#bib.bib8); Strubellet al\.,[2019](https://arxiv.org/html/2605.13981#bib.bib19); Pattersonet al\.,[2022](https://arxiv.org/html/2605.13981#bib.bib2)\)\. Prior work also emphasizes the need for transparent reporting of hardware, runtime, and location assumptions, and for careful treatment of measurement uncertainty\(Hendersonet al\.,[2020](https://arxiv.org/html/2605.13981#bib.bib23)\)\. Our protocol follows these recommendations, but focuses on*directly measured*energy for distillation pipelines \(Section[4](https://arxiv.org/html/2605.13981#S4)\) rather than proxy metrics such as GPU\-hours or FLOPs\.

##### Energy accounting for distillation\.

Recent life\-cycle reviews of AI environmental reporting identify post\-training adaptation as a major blind spot\. Despite the growing use of fine\-tuning, distillation, quantization, and related methods, these methods are rarely measured separately, and the energy invested in creating distilled or compressed models is rarely reported or weighed against downstream inference savings\(Lambert and Luccioni,[2026](https://arxiv.org/html/2605.13981#bib.bib20)\)\.

A small body of work examines the environmental cost and impacts of distillation\. Rafat et al\. show that KD for CNNs can be carbon\-intensive and highlight the importance of energy\-aware tuning\(Rafatet al\.,[2023](https://arxiv.org/html/2605.13981#bib.bib24)\)\. Yuan et al\. compare distilled and non\-distilled NLP models, focusing on inference\-time energy and runtime while largely treating teacher training and the distillation pipeline as sunk costs\(Yuanet al\.,[2024](https://arxiv.org/html/2605.13981#bib.bib22)\)\. These results suggest distillation is not inherently “green,” but existing studies typically do not provide a stage\-wise, end\-to\-end holistic accounting that separates teacher\-side workloads \(e\.g\., generation/logit caching, filtering\) from student training and evaluation under a consistent budget, and provides suggestions on how to reduce it\.

##### Measurement tools and methodology\.

Energy reporting commonly relies on estimation toolchains such as CodeCarbon and Experiment Impact Tracker\(Courtyet al\.,[2024](https://arxiv.org/html/2605.13981#bib.bib16); Hendersonet al\.,[2020](https://arxiv.org/html/2605.13981#bib.bib23)\), alongside work comparing telemetry\-based measurements with model\-based estimates and documenting estimator error modes\(Bouzaet al\.,[2023](https://arxiv.org/html/2605.13981#bib.bib10); Bannouret al\.,[2021](https://arxiv.org/html/2605.13981#bib.bib13)\)\. These tools motivate a fidelity–convenience trade\-off: estimator\-only approaches are easy to deploy but can misestimate device energy, while telemetry\-based logging requires tighter integration\. Our framework combines NVML\-based GPU telemetry as the ground\-truth signal with lightweight estimators for CPU energy andCO2​e\\mathrm\{CO\}\_\{2\}\\text\{e\}, packaged into a reusable harness with explicit stage boundaries and logging rules\.

## 3Distillation Pipeline Setup

We study three standard regimes: baseline supervised fine\-tuning \(SFT\), logit\-based knowledge distillation \(KD\), and synthetic supervised fine\-tuning \(synthetic SFT\) on a single, fully open LLM family\. All runs share the same hardware, software stack, tokenizer, and data preprocessing\.

### 3\.1Hardware and Environment

All of our experiments were run on a single NVIDIA H100 SXM 80 GB GPU and 16 Intel Xeon Gold 6442Y CPU cores, on exclusive nodes to avoid noise from jobs running on neighboring GPUs\. The software environment \(Linux distribution, kernel, NVIDIA driver, and CUDA/package versions\) was fixed across all runs\. Logging is versioned by a Git commit record for each experiment\. A full configuration table appears in the Appendix; measurement tools and logging details are described in Section[4](https://arxiv.org/html/2605.13981#S4), controlling the environmental drift so that observed energy differences arise purely from pipeline structure, model size and hyperparameter differences\.

### 3\.2Models and Tasks

Table[1](https://arxiv.org/html/2605.13981#S3.T1)summarizes the models used in teacher/student distillation\. We chose the OLMo\-2 family for its fully open weights and training methodology, a shared tokenizer and pretraining lineage across sizes, and existing environmental and energy accounting, allowing us to represent the full lifecycle of energy use of the model\. All models share the same tokenizer, and we tokenize prompts and outputs identically for teacher and students, enabling comparable token\-count and Joules\-per\-token statistics\.

Table 1:Teacher and student models used in our experiments\.RoleModel \(Hugging Face ID\)ParamsTeacher[allenai/OLMo\-2\-0325\-32B\-SFT](https://huggingface.co/allenai/OLMo-2-0325-32B-SFT)32BStudent[allenai/OLMo\-2\-0425\-1B](https://huggingface.co/allenai/OLMo-2-0425-1B)1BStudent[allenai/OLMo\-2\-1124\-7B](https://huggingface.co/allenai/OLMo-2-1124-7B)7BStudent[allenai/OLMo\-2\-1124\-13B](https://huggingface.co/allenai/OLMo-2-1124-13B)13BTo avoid domain\-specific conclusions, we instantiate pipelines on three supervised workloads: instruction following \(TULU\-3 SFT mixture\)\(Lambertet al\.,[2024](https://arxiv.org/html/2605.13981#bib.bib12)\), math reasoning \(OpenR1\-Math\-220k\)\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.13981#bib.bib14)\), and code generation \(Open\-R1 Codeforces\)\(Penedoet al\.,[2025](https://arxiv.org/html/2605.13981#bib.bib11)\)\. All datasets are formatted through a single OLMo\-2–compatible chat\-style prompting pipeline\. TULU serves as our primary setting for most experiments, while the math and code workloads act as robustness checks\. Across otherwise identical runs, we observed only minor differences in end\-to\-end energy between datasets relative to the much larger effects of pipeline choice and student scale\. In Section[6](https://arxiv.org/html/2605.13981#S6), we therefore report the main energy metrics \(kWh and J/token\) as averages over datasets for each pipeline–size configuration\.

### 3\.3Distillation Pipelines

We consider two regimes that mirror the most common distillation methods used in practice for deployment\-oriented LLMs: logit\-based KD and synthetic SFT\. Across all regimes, optimizer, schedule, precision, effective batch size, and early stopping are held fixed \(see Section[5](https://arxiv.org/html/2605.13981#S5)\)\. The three pipelines below — direct SFT, logit\-based KD, and synthetic sequence distillation – provide a minimal but representative set of distillation patterns under a shared, controlled environment, supporting the stage\-wise energy frontiers and break\-even analyses in the following sections\.

#### 3\.3\.1Logit\-Based Knowledge Distillation \(KD\)

This represents the traditional KD regime as described by Hinton et al\.\(Hintonet al\.,[2015](https://arxiv.org/html/2605.13981#bib.bib1)\), where the student is trained to match the teacher’s token\-level distribution via offline distillation in two stages:

##### Teacher logit caching\.

For a fixed training corpus, the teacher \(a 32B model\) is run in inference mode, generating samples from the inputs in the given dataset\. For each token position we cache the top\-kklogits \(k=100k=100\) and indices\.

##### Student KD training\.

Given the cached teacher distributionsptp\_\{t\}and hard labelsyhardy\_\{\\mathrm\{hard\}\}, the student distributionpsp\_\{s\}is trained with:

ℒKD​\(θs\)=α​CE​\(yhard,ps\)\+\(1−α\)​T2​KL​\(pt\(T\)∥ps\(T\)\),\\mathcal\{L\}\_\{\\text\{KD\}\}\(\\theta\_\{s\}\)=\\alpha\\,\\mathrm\{CE\}\\big\(y\_\{\\mathrm\{hard\}\},p\_\{s\}\\big\)\+\(1\-\\alpha\)\\,T^\{2\}\\,\\mathrm\{KL\}\\big\(p\_\{t\}^\{\(T\)\}\\,\\\|\\,p\_\{s\}^\{\(T\)\}\\big\),\(1\)whereTTis the distillation temperature,p\(T\)p^\{\(T\)\}denotes distributions softened byTT, andα∈\[0,1\]\\alpha\\in\[0,1\]trades off hard\-label supervision and soft\-label matching\. Our core grid uses a default choice ofα=0\.5\\alpha=0\.5andT=1T=1; both are varied in sensitivity experiments \(see Section[6](https://arxiv.org/html/2605.13981#S6)\)\.

#### 3\.3\.2Synthetic Supervised Fine\-Tuning \(SFT\)

Often called the “cheaper“ distillation method, synthetic SFT implements data or sequence distillation, i\.e\. the teacher generates outputs that are then treated as hard labels for supervised fine\-tuning of the student\.

##### Teacher data generation\.

For each dataset, we keep the original prompts \(instructions, math problems, programming tasks\) and replace original dataset labels with teacher\-generated responses, using a standardized decoding configuration \(nucleus sampling with fixed top\-pp, temperature, and maximum length; see Section[5](https://arxiv.org/html/2605.13981#S5)\)\. This generation pass is run once per dataset, and the resulting synthetic corpora are reused across student sizes and hyperparameter settings\.

##### Student synthetic SFT\.

Students are fine\-tuned on the synthetic datasets using the standard autoregressive cross\-entropy objective:

ℒSFT​\(θs;x,y\)=−∑t=1slog⁡pθs​\(yt∣x,y<t\),\\mathcal\{L\}\_\{\\text\{SFT\}\}\(\\theta\_\{s\};x,y\)=\-\\sum\_\{t=1\}^\{s\}\\log p\_\{\\theta\_\{s\}\}\\big\(y\_\{t\}\\mid x,y\_\{<t\}\\big\),\(2\)wherexxis the input prompt,y=\(y1,…,ys\)y=\(y\_\{1\},\\dots,y\_\{s\}\)is the target sequence,θs\\theta\_\{s\}are the student parameters, andpθsp\_\{\\theta\_\{s\}\}is the student conditional token distribution\. In the synthetic SFT regime, the targetsyty\_\{t\}are teacher\-generated continuations of the original prompts\.

#### 3\.3\.3Baseline Supervised Fine\-Tuning

To provide a non\-distilled reference, we also train the same students in a baseline SFT regime, where the model is trained directly on the original dataset labels using the same objective as in Eq\.[2](https://arxiv.org/html/2605.13981#S3.E2), but with original dataset target labelsyy\. We use a single global training budget and early stopping based on validation loss, and select the best checkpoint for evaluation\. This baseline anchors both quality and energy, and serves as a control for measuring the incremental cost and potential gains of introducing a teacher\.

## 4Energy Accounting

We adopt a distillation\-specific energy accounting protocol aligned with best\-practice recommendations for systematic reporting of energy in machine learning\(Hendersonet al\.,[2020](https://arxiv.org/html/2605.13981#bib.bib23)\), tailored to the design of our pipelines\. The goal is to measure*end\-to\-end*energy in a way that is reproducible, reveals the energy load distributions across different stages of distillation pipelines and is comparable across model sizes and experiments\.

### 4\.1Stage\-wise protocol

Each run is decomposed into disjoint stages with explicit start/end timestamps\. We define total distillation energy as the sum of teacher\-side work, student training, and evaluation, and map these terms to logged stages:

- •EprerunE\_\{\\text\{prerun\}\}– standalone, one\-time environment stabilization and smoke tests \(short batch to stabilize energy and validate logging\);
- •EteacherE\_\{\\text\{teacher\}\}– teacher forward passes composed of synthetic\-data generation \(EgenE\_\{\\text\{gen\}\}\) or logit caching \(ElogitE\_\{\\text\{logit\}\}\), corresponding respectively to synthetic SFT and KD;
- •EstudentE\_\{\\text\{student\}\}– student training part of the pipeline, shared across SFT, KD, and synthetic SFT;
- •EevalE\_\{\\text\{eval\}\}– core and auxiliary evaluation suites used to construct quality metrics\.

For each stage we log wall\-clock time, token counts, and energy, and aggregate these into stage\-wise and pipeline totals used in our results and Pareto frontiers\.

### 4\.2Measurement and reporting

We treat GPU energy from NVML telemetry as the primary signal and use CodeCarbon\-style estimators for CPU energy and derived CO2e\. We report energy in kWh and normalize by tokens processed to obtain Joules\-per\-token \(J/token\) and tokens\-per\-second as comparable efficiency axes\. Because carbon depends on deployment\-specific assumptions \(e\.g\., PUE and regional grid intensity\), we treat CO2e as derived and assumption\-sensitive; our main analysis focuses on directly measured energy and J/token\. Full measurement details and assumptions are provided in Appendix A \(Supplementary\)\.

## 5Experimental Design

Our experimental design was structured to answer the posed research questions under controlled and fixed hardware, software, and data conditions over the different distillation regimes and student scales to allow for reproducibility and comparable energy measurements\. To facilitate reproducibility, we release the measurement harness, experiment configurations, and quality\-score scripts at[https://github\.com/StellarLuminosity/Energy](https://github.com/StellarLuminosity/Energy)\.

### 5\.1Core Grid

The central experiment was designed to cross three pipelines — baseline supervised fine\-tuning, logit\-based knowledge distillation \(KD\), and synthetic supervised fine\-tuning \(SFT\) — with three student scales: 1B, 7B, and 13B OLMo\-2 models distilled from a 32B instruction\-tuned teacher\. Each experimental configuration corresponds to a full pipeline instantiated on one of the supervised workloads \(instruction following, math, or code, as described in Section[3](https://arxiv.org/html/2605.13981#S3)\) and run end\-to\-end with stage\-wise energy logging as defined in Section[4](https://arxiv.org/html/2605.13981#S4)\.

Across the core experimental conditions, we hold fixed:

- •Hardware and environmentsettings, as described in Section[3\.1](https://arxiv.org/html/2605.13981#S3.SS1)
- •Tokenizer and preprocessing\.All models share the same OLMo\-2 tokenizer and a single chat\-style formatting pipeline, ensuring that the ‘tokens processed’ measurement is comparable across student sizes and pipelines\.
- •Training configuration and stopping rule\.We use a baseline training configuration with fixed hyperparameter settings for all pipelines and student sizes to allow for reproducibility and effective comparisons\. See the Appendix for more details about the exact hyperparameters and values fixed\. Training proceeded until an early\-stopping criterion is met \(no improvement beyond a small tolerance ofϵ=2×10−3\\epsilon=2\\times 10^\{\-3\}for three consecutive evaluations\); we retained the checkpoint with the best validation loss\.

### 5\.2Evaluation Protocol

All trained students and the teacher are evaluated on a fixed benchmark suite that covers instruction following, dialogue, mathematical reasoning, and general knowledge: AlpacaEval 2, IFEval, MT\-Bench\-101, GSM8K, and MMLU\. This particular suite was chosen to test general retention and broad knowledge \(MMLU, GSM8K\), measure domain gains and losses in math and instruction\-following / dialogue \(GSM8K, AlpacaEval 2, IFEval, MT\-Bench\-101\), and align with the Tülu/OLMo training objectives and benchmark choices from the OLMo\-2 paper\(OLMoet al\.,[2024](https://arxiv.org/html/2605.13981#bib.bib21)\)\.

##### Quality score\.

To compare students across benchmarks of differing scales, we aggregate the five evaluation scores via equally\-weighted teacher\-relative retention:

Qi=1B​∑b=1Bsi,bsteacher,b,Q\_\{i\}=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}\\frac\{s\_\{i,b\}\}\{s\_\{\\text\{teacher\},b\}\},\(3\)whereB=5B=5,si,bs\_\{i,b\}is the score of student modeliion benchmarkbb, andsteacher,bs\_\{\\text\{teacher\},b\}is the corresponding score of the OLMo\-2 32B teacher for benchmarks AlpacaEval 2 LC, IFEval, GSM8K, MMLU \(from Ai2’s published OLMo\-2 evaluation table\(OLMoet al\.,[2024](https://arxiv.org/html/2605.13981#bib.bib21)\)\), and MT\-Bench\-101 \(measured under our evaluation protocol\) respectively\.QiQ\_\{i\}measures the average fraction of teacher quality retained by the student across the evaluation suite\. Raw per\-benchmark scores are reported in Appendix[E](https://arxiv.org/html/2605.13981#A5)\.

### 5\.3Ablations and Hyperparameter Sweeps

Beyond the core experiments, we run a small set of targeted ablations to probe how distillation\-specific choices move models along the energy–quality frontier, keeping all other factors fixed\.

##### KD hyperparameters\.

For logit\-based KD, we vary the distillation temperatureT∈\{1\.0,2\.0,4\.0\}T\\in\\\{1\.0,2\.0,4\.0\\\}and the soft\-label loss weightα∈\{0\.3,0\.5,0\.8\}\\alpha\\in\\\{0\.3,0\.5,0\.8\\\}in the objective of Eq\. \([1](https://arxiv.org/html/2605.13981#S3.E1)\), primarily for the 1B and 7B students\.

##### Synthetic SFT generation budget\.

For synthetic SFT, we vary the teacher generation budget and decoding configuration by changing: 1\) the maximum number of new tokens per response,max\_new\_tokens∈\{256,512,1024\}\\texttt\{max\\\_new\\\_tokens\}\\in\\\{256,512,1024\\\}, and 2\) the number of prompts used to construct the synthetic dataset \(e\.g\., subsampling from≈7000\\approx 7000to≈3500\\approx 3500prompts\)\.

## 6Results

We report end\-to\-end energy, quality, and Pareto frontier results for the three distillation regimes\.

### 6\.1Primary Energy Frontiers

Table[2](https://arxiv.org/html/2605.13981#S6.T2)summarizes the central experimental grid: for each pipeline and student size, we report total end\-to\-end energy in kWh, Joules per token \(J/tok\) normalized by tokens processed, and a scalar quality scoreQQobtained by averaging normalized benchmark scores over the evaluation suite\. As expected, larger students consume more energy and achieve higher quality with diminishing returns: moving from 1B to 13B roughly quintuples total energy while improvingQQby at most∼\\sim0\.150\.15–0\.20\.2\.

Table 2:Full\-pipeline energy breakdown; Energy values reported as means over 2–3 repeated runs; Totals exclude prerun validation\.PipelineSizeE \(kWh\)J/tokQQBaseline SFT1B7\.000\.840\.697B19\.502\.340\.9013B34\.604\.150\.99KD distillation1B16\.902\.030\.707B28\.403\.410\.7813B42\.505\.100\.82Synthetic SFT1B16\.652\.000\.717B28\.253\.390\.7913B40\.704\.880\.85Figure[1](https://arxiv.org/html/2605.13981#S6.F1)below further summarizes end\-to\-end energy\-quality tradeoff across the three training regimes, with the x\-axis reporting full\-pipeline energy per run \(kWh\), and the y\-axis reports the normalized aggregate quality score Q over AlpacaEval 2, IFEval, MT\-Bench\-101, GSM8K, and MMLU\.

![Refer to caption](https://arxiv.org/html/2605.13981v1/x1.png)Figure 1:Energy\-quality frontier across each pipeline’s full end\-to\-end energy costAs can be observed from Figure[1](https://arxiv.org/html/2605.13981#S6.F1), the teacher side is a major cost driver in our setting: once logit caching or synthetic label generation is included, KD and synthetic SFT consume substantially more end\-to\-end energy without commensurate gains in aggregateQQ\. This may appear counter to the common view of distillation as a cheaper method, but that claim typically refers to student\-only training cost or downstream inference efficiency; by contrast, our accounting treats teacher computation as a first\-class, end\-to\-end cost\. When teacher artifacts are generated per experiment and not reused, the teacher acts as a near\-fixed overhead that shifts KD/synthetic SFT curves to the right — an effect especially pronounced for smaller models or shorter training times where the student\-side compute cannot amortize the teacher work\. In contrast, scaling the student under baseline SFT yields strong returns, whereas KD and synthetic SFT exhibit generally weaker scaling\.

Under one\-off full\-pipeline accounting, baseline SFT defines the energy\-dominant frontier in our measured setting\. At the 1B scale, KD and synthetic SFT obtain slightly higher aggregateQQthan baseline SFT, but require roughly2\.4×2\.4\\timesmore end\-to\-end energy\. At 7B and 13B, baseline SFT strictly dominates both distillation pipelines in the aggregate energy–quality plane\.

Bigger models, such as 13B, maximize quality at moderate cost, while smaller models such as 1B minimize energy\. Distillation can still be favorable in settings where teacher artifacts \(cached logits or synthetic datasets\) are reused across multiple students or hyperparameter sweeps, effectively amortizing the teacher\-side energy and moving KD/synthetic SFT toward the Pareto frontier, particularly for instruction\-following objectives\. It should still be noted that since these conclusions depend on the definition ofQQ, our small model sizes, and inclusion of the full end\-to\-end accounting costs, it is possible that student\-only energy or per\-benchmark analyses may identify regimes where distillation yields superior performance on specific capabilities even if the aggregate frontier is dominated\.

BecauseCO2​e\\mathrm\{CO\}\_\{2\}\\mathrm\{e\}scales linearly with energy under a fixed grid/PUE factorγ\\gamma\(Appendix[A](https://arxiv.org/html/2605.13981#A1), Eq\.[6](https://arxiv.org/html/2605.13981#A1.E6)\), the frontier in Figure[1](https://arxiv.org/html/2605.13981#S6.F1)can be read equivalently as an emissions–quality frontier\. In our observations, the teacher overhead that shifts KD/synthetic SFT “to the right” in kWh produces the same shift in CO2e\.

### 6\.2Stage\-wise Distillation Energy Breakdown

Table 3:Stage\-wise energy breakdown \(kWh\)Stage1B7B13BPrerun0\.120\.120\.12Baseline SFT \(no teacher\)Datapreprocess0\.370\.370\.37Student training \(SFT\)6\.3018\.4533\.15Evaluation0\.330\.681\.08KD distillation \(offline\)Datapreprocess0\.370\.370\.37Logit caching11\.0011\.0011\.00Student training \(KD\)5\.2016\.3530\.05Evaluation0\.330\.681\.08Synthetic SFTDatapreprocess0\.370\.370\.37Synthetic datageneration10\.6010\.6010\.60Student training\(SFT on synthetic\)5\.3516\.6028\.65Evaluation0\.330\.681\.08To understand where the energy is actually spent, we decompose each run into pre\-run, preprocessing, teacher\-side computation, student training, and evaluation, as defined in Section[4](https://arxiv.org/html/2605.13981#S4)\. Table[3](https://arxiv.org/html/2605.13981#S6.T3)shows the exact numerical breakdown of total energy kWh by stage, while the figure below visualizes the cost attribution for stages\. We observe that for baseline SFT, the vast majority of energy is consumed by student training, with small contributions from preprocessing and evaluation\. In contrast, KD splits its budget between teacher logit caching and student training, while synthetic SFT is dominated by teacher\-side synthetic data generation; dataset preprocessing and evaluation remain relatively small in all regimes\.

![Refer to caption](https://arxiv.org/html/2605.13981v1/x2.png)Figure 2:Stage\-wise energy breakdown \(kWh\) across student sizesStage\-wise accounting makes explicit which component must be optimized to change end\-to\-end efficiency: for larger students, student training dominates \(baseline and teacher\-mediated pipelines\), while for smaller students, teacher artifact creation can be the primary driver of total energy\.

We note that student\-side training energy is consistently lower under KD and synthetic SFT than under baseline SFT at the same scale\. This is a convergence effect, with distilled pipelines reaching the early\-stopping criterion in fewer optimization steps than baseline SFT due to the additional supervision signal\.

### 6\.3Teacher Reuse and Amortization

Next, we examine how the practice of reusing teacher artifacts can change end\-to\-end energy conclusions\. Figure[3](https://arxiv.org/html/2605.13981#S6.F3)plots the average energy per trained student as a function of the number of student runsNNthat share a single set of cached logits \(KD\) or a single synthetic dataset \(synthetic SFT\) generated by the teacher\.

![Refer to caption](https://arxiv.org/html/2605.13981v1/x3.png)Figure 3:Amortizing teacher cost through reuse for 7B modelsBaseline SFT has no teacher\-side overhead, so its per\-model energy is essentially constant inNN, dominated by student training\. In contrast, KD and synthetic SFT include a near\-fixed teacher artifact cost \(logit caching or generation\) that contributes as1/N1/Nwhen averaged across runs: asNNgrows, the amortized curves rapidly drop toward their student\-only training costs\.

The break\-even reuse threshold admits a closed form:

N∗=EteacherEstudentbaseline−Estudentdistill,N^\{\*\}\\;=\\;\\frac\{E\_\{\\text\{ teacher\}\}\}\{E\_\{\\text\{ student\}\}^\{\\text\{ baseline\}\}\-E\_\{\\text\{ student\}\}^\{\\text\{ distill\}\}\},\(4\)whereEteacherE\_\{\\text\{teacher\}\}is the one\-time teacher artifact cost \(logit caching or synthetic generation\) and the denominator is the per\-run training\-energy gap between the baseline and the distilled student at fixed scale\. Applying Eq\.[4](https://arxiv.org/html/2605.13981#S6.E4)to our measured stage\-wise energies yields the following thresholds:

Pipeline1B7B13BKD≈10\\approx 10≈5\\approx 5–66≈4\\approx 4Synthetic SFT≈11\\approx 11≈6\\approx 6≈2\\approx 2–33
The pattern is systematic; smaller students require more reuse to amortize the fixed teacher cost because their per\-run energy gap relative to baseline SFT is narrower, while larger students cross the break\-even threshold sooner\. The design rule*reuse\-before\-regenerate*is therefore most important at small scale, where one\-off distillation is least energy\-competitive\. The absolute kWh values are hardware\-specific, so Eq\.[4](https://arxiv.org/html/2605.13981#S6.E4)should be recomputed for new hardware, model families, or parallelism strategies\.

### 6\.4Sensitivity Analysis

Figures[4](https://arxiv.org/html/2605.13981#S6.F4),[5](https://arxiv.org/html/2605.13981#S6.F5), and[6](https://arxiv.org/html/2605.13981#S6.F6)summarize representative sweeps\. For KD, increasing temperatureTTprovides only modest quality gains while slightly increasing energy, and some\(T,α\)\(T,\\alpha\)settings are Pareto\-dominated—wasting energy for negligible benefit\. In practice,TTis a second\-order knob: onceTTis reasonable,α\\alphamore directly controls the quality–energy tradeoff\.

Figure[5](https://arxiv.org/html/2605.13981#S6.F5)isolatesα\\alphaunder offline KD with cached teacher outputs\. Quality increases withα\\alphafor all sizes, while energy shifts are small because teacher\-side costs are fixed andα\\alphamostly affects convergence\. The effect is size\-dependent: 1B can favor lowerα\\alphafor slightly lower energy with minimal quality loss, whereas 7B/13B benefit more from moderate\-to\-higherα\\alphawith only a mild energy increase\.

![Refer to caption](https://arxiv.org/html/2605.13981v1/x4.png)Figure 4:Hyperparameter sensitivity for KD 7B Student;
Left: quality as a function of distillation temperature;
Right: energy per run\.![Refer to caption](https://arxiv.org/html/2605.13981v1/x5.png)Figure 5:Modeled impact of the KD mixing weightα\\alphaon quality \(Q\) and end\-to\-end energy \(kWh\) for 1B/7B/13B students\.For synthetic SFT, the generation budget is the dominant energy driver\. Increasingmax\_new\_tokensconsistently improves quality, but the energy cost grows nonlinearly and exhibits diminishing returns beyond moderate lengths \(Fig\.[6](https://arxiv.org/html/2605.13981#S6.F6)\)\. Reducing the number of synthetic prompts provides a direct, predictable energy reduction at comparable quality, suggesting that budget\-aware synthetic distillation should prioritize moderate sequence lengths and dataset sizes before scaling generation aggressively\.

![Refer to caption](https://arxiv.org/html/2605.13981v1/x6.png)Figure 6:Energy–quality tradeoff for 7B synthetic distillation\.
### 6\.5Inference\-Side Energy

Our main accounting focuses on training\-time energy\. For completeness, we also estimate inference energy from evaluation\-stage NVML telemetry, using forward passes: 0\.27 J/token for 1B, 0\.68 J/token for 7B, and 1\.44 J/token for 13B\. The standard inference\-savings argument requires a smaller distilled model to substitute for a larger baseline at comparable quality\. We do not observe such cross\-size equivalence as KD 1B reachesQ=0\.70Q=0\.70versus baseline SFT 7B atQ=0\.90Q=0\.90, and KD 7B reachesQ=0\.78Q=0\.78versus baseline SFT 13B atQ=0\.99Q=0\.99\. Same\-size comparisons use the same serving architecture, so the inference energy per token is the same, with the relevant difference in costs coming from the training\-time pipeline\.

For settings where a smaller distilled student is an acceptable quality\-equivalent substitute, the inference break\-even point is

T∗=Eextra\-train,kWh⋅3,600,000jref−jstudent,T^\{\*\}=\\frac\{E\_\{\\text\{ extra\-train,kWh\}\}\\cdot 3\{,\}600\{,\}000\}\{j\_\{\\text\{ref\}\}\-j\_\{\\text\{student\}\}\},\(5\)whereEextra\-train,kWhE\_\{\\text\{ extra\-train,kWh\}\}is the additional training energy of the distilled pipeline andjref−jstudentj\_\{\\text\{ref\}\}\-j\_\{\\text\{student\}\}is the per\-token inference\-energy saving\.

## 7Discussion and Takeaways

End\-to\-end accounting shows distillation is not inherently cheap\. The dominant energy drivers are \(i\)student compute\(model size and training tokens\) and \(ii\)teacher artifact creation\(logit caching or synthetic generation\), which acts as a near\-fixed overhead unless reused\.

##### 1\) Scale dominates\.

Bigger students cost much more for diminishing gains, making this the largest lever on total kWh\. Mid\-scale students typically offer the better energy–quality tradeoff than defaulting to the largest\.

##### 2\) Teacher reuse can invert the ranking\.

In one\-off runs, teacher costs push KD and synthetic SFT right of baseline SFT on the frontier \(Figure[1](https://arxiv.org/html/2605.13981#S6.F1)\), especially for smaller students\. With reuse across students, sweeps, or seeds, the ordering flips \(Figure[3](https://arxiv.org/html/2605.13981#S6.F3)\): distillation becomes energy\-competitive only when teacher costs are amortized\. Whether distillation is cost\-efficient is therefore primarily a workflow question, not an intrinsic property of KD or synthetic SFT\.

##### 3\) Stage\-wise attribution dictates the right knob\.

Baseline SFT is dominated bystudent training; KD splits betweenlogit cachingand training; synthetic SFT is dominated byteacher generation\(Figure[2](https://arxiv.org/html/2605.13981#S6.F2)\)\. Effective optimization depends on the dominant stage—training tokens for SFT, generation budget for synthetic SFT, and reuse whenever a teacher stage is present\.

##### 4\) Hyperparameters are frontier controls\.

For KD, some\(T,α\)\(T,\\alpha\)pairs are Pareto\-dominated\. For synthetic SFT,max\_new\_tokensand dataset size yield non\-linear energy growth with diminishing returns\. Tune on the energy–quality frontier, not accuracy alone\.

##### Practical recommendations\.

\(i\) Report both student\-only and full\-pipeline energy with explicit reuse assumptions\. \(ii\) Treat teacher artifacts as shared infrastructure—cached, versioned, and published via lightweight registries logging metadata, decoding settings, licensing, and measured production energy—and adopt a default*reuse\-before\-regenerate*workflow\. \(iii\) Use baseline SFT when reuse is low; favor KD or synthetic SFT when reuse and quality demands are high\.

## 8Limitations and Future Work

Our study has several limitations that qualify the scope of the empirical findings\.

\(1\) First, all experiments were conducted on a single hardware and cluster configuration as described in[3\.1](https://arxiv.org/html/2605.13981#S3.SS1)\. While this controlled setting was useful for isolating pipeline and hyperparameter effects, it does not capture variability across different GPU accelerators \(e\.g\., A100, L4, TPUs\), power caps, or multi\-GPU training\. Extending the protocol to multiple hardware devices and experimenting with power or latency\-constrained regimes is a direction for future work\.

\(2\) Our research focused on a single, fully open model family \(OLMo\-2\) and a specific teacher–student size configurations \(32B teacher, 1B/7B/13B students\)\. Distillation behavior, including teacher\-side cost, convergence speed, and quality gains, may differ for other architectures and models, pretraining corpora, and compression methods \(e\.g\., quantization, pruning, LoRA\)\. Likewise, different teacher size can likely shift the cost of teacher\-size of distillation, the energy–quality frontier and the reuse thresholds at which distillation becomes beneficial\. Applying the proposed protocol to other model families and compression strategies would test how robust these design rules are\.

\(3\) The training and evaluation domains are limited to three pipelines on three supervised workloads \(instruction following, math reasoning, and code generation\), we do not cover other types such as safety alignment, multilingual tasks, long\-context reasoning\. Different task mixtures, deployment objectives, or evaluation protocols could lead to different trade\-offs on the energy–quality frontier\.

\(4\) Our accounting relies on specific measurement and modeling assumptions\. We treat NVML\-based GPU telemetry as ground truth and use estimator\-based methods for CPU energy andCO2​e\\mathrm\{CO\}\_\{2\}\\text\{e\}under fixed PUE and grid\-intensity assumptions; deployment\-specific datacenter characteristics are not captured\.

\(5\) Finally, the absolute kWh values, J/token values, and the 7B reuse thresholdNNare configuration\-specific\. They depend on hardware, parallelism strategy, model family, decoding setup, and convergence behavior\. The broader conclusion is not that these numerical thresholds transfer directly, but that teacher artifact creation is a measurable pipeline cost that can change energy–quality rankings unless it is amortized\. Eq\.[4](https://arxiv.org/html/2605.13981#S6.E4)and the released energy harness are intended to make this threshold recomputable for other hardware and model families\.

## 9Conclusion

This work set out to answer a simple yet unexplored question: when is LLM distillation actually energy\-efficient? While distillation and smaller students are often presented as “greener” or “cheaper“ methods, existing practice rarely accounts for the full distillation pipeline, especially the teacher\-side generation costs\. We addressed this gap by holistically evaluating the energy demand across three pipelines, comparing baseline supervised fine\-tuning, logit\-based KD and synthetic SFT methods for 1B, 7B, and 13B students distilled from a 32B teacher model, and releasing a distillation energy accounting protocol and harness that allows further evaluation of distillation costs, decomposing each pipeline into stages, measuring GPU energy via NVML with complementary CPU and carbon estimates, and normalizing results in Joules per token under controlled hardware, software, and data conditions\.

Our stage\-wise breakdowns and energy–quality frontiers show that distillation is not inherently more sustainable: once the teacher\-side costs are included, both KD and synthetic SFT often consume more end\-to\-end energy than a strong SFT baseline for modest quality gains\. At the same time, teacher generation and logit caching costs amortize sharply with reuse; under realistic reuse levels across students or fine\-tuning rounds, KD can match or beat baseline SFT, and synthetic SFT can become strictly more energy\-efficient at higher quality targets\.

## Impact Statement

In light of the current rapid growth in large\-scale AI deployment, demand for GPUs and new datacenter capacity has risen sharply, while systematic tracking and reporting of energy is often overlooked, in spite of the substantial environmental and societal impacts of large training runs, as well as direct financial costs of operating and developing these systems\. Many training and distillation decisions are still driven primarily by benchmark scores and model quality metrics rather than by a careful assessment of return on investment in compute, energy, financial spending, and emissions\. This disconnect risks locking in infrastructure and practices that prove theoretically efficient on benchmarks but are costly in real\-world deployment settings\.

This work aims to make the energy, emissions, and costs of model distillation more transparent and measurable\. By providing stage\-wise energy accounting, end\-to\-end energy–quality frontiers, and concrete design guidelines, our results can help practitioners select distillation pipelines and hyperparameters that substantially reduce operational energy use for training smaller language models\. In principle, this can lower the environmental footprint of deploying capable models, encouraging more honest reporting of compute and energy in both academic and industrial settings\.

At the same time, more efficient distillation pipelines may also lower the cost of training and models that are potentially harmful or misaligned\. Our analysis does not address questions of content safety, misuse, or governance, and these methods could be applied to optimize the energy use of models with negative societal impacts\. We therefore view our contribution as complementary to ongoing work on safety and governance: energy\-aware distillation should be combined with strong safeguards on how distilled models are trained, evaluated, and deployed\.

Finally, our measurements are limited to a specific hardware and software stack \(H100 GPUs, OLMo\-2 models, and a small set of supervised tasks\) and focus on operational energy during distillation and evaluation\. We do not account for embodied emissions and costs which arise from hardware manufacturing or data center infrastructure\. As a result, our quantitative estimates should be interpreted as lower bounds on the full lifecycle environmental cost of large\-scale language model distillation\.

## References

- N\. Bannour, S\. Ghannay, A\. Névéol, and A\. Ligozat \(2021\)Evaluating the carbon footprint of nlp methods: a survey and analysis of existing tools\.InProceedings of the second workshop on simple and efficient natural language processing,pp\. 11–21\.Cited by:[§2](https://arxiv.org/html/2605.13981#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Bouza, A\. Bugeau, and L\. Lannelongue \(2023\)How to estimate carbon footprint when training deep learning models? a guide and review\.Environmental Research Communications5\(11\),pp\. 115014\.Cited by:[§2](https://arxiv.org/html/2605.13981#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Courty, V\. Schmidt, S\. Luccioni, Goyal\-Kamal, MarionCoutarel, B\. Feld, J\. Lecourt, L\. Connell, A\. Saboni, Inimaz, supatomic, M\. Léval, L\. Blanche, A\. Cruveiller, ouminasara, F\. Zhao, A\. Joshi, A\. Bogroff, H\. de Lavoreille, N\. Laskaris, E\. Abati, D\. Blank, Z\. Wang, A\. Catovic, M\. Alencon, M\. Stechiy, C\. Bauer, L\. O\. N\. de Araújo, JPW, and MinervaBooks \(2024\)Mlco2/codecarbon: v2\.4\.1External Links:[Document](https://dx.doi.org/10.5281/zenodo.11171501),[Link](https://doi.org/10.5281/zenodo.11171501)Cited by:[§2](https://arxiv.org/html/2605.13981#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Henderson, J\. Hu, J\. Romoff, E\. Brunskill, D\. Jurafsky, and J\. Pineau \(2020\)Towards the systematic reporting of the energy and carbon footprints of machine learning\.Journal of Machine Learning Research21\(248\),pp\. 1–43\.Cited by:[§2](https://arxiv.org/html/2605.13981#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.13981#S2.p1.1),[§4](https://arxiv.org/html/2605.13981#S4.p1.1)\.
- G\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.arXiv preprint arXiv:1503\.02531\.Cited by:[§3\.3\.1](https://arxiv.org/html/2605.13981#S3.SS3.SSS1.p1.1)\.
- K\. Lambert and S\. Luccioni \(2026\)From cradle to cloud: a life cycle review of ai’s environmental footprint\.arXiv preprint arXiv:2605\.05416\.Cited by:[§2](https://arxiv.org/html/2605.13981#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Lambert, J\. Morrison, V\. Pyatkin, S\. Huang, H\. Ivison, F\. Brahman, L\. J\. V\. Miranda, A\. Liu, N\. Dziri, S\. Lyu, Y\. Gu, S\. Malik, V\. Graf, J\. D\. Hwang, J\. Yang, R\. L\. Bras, O\. Tafjord, C\. Wilhelm, L\. Soldaini, N\. A\. Smith, Y\. Wang, P\. Dasigi, and H\. Hajishirzi \(2024\)Tülu 3: pushing frontiers in open language model post\-training\.Cited by:[§3\.2](https://arxiv.org/html/2605.13981#S3.SS2.p2.1)\.
- T\. OLMo, P\. Walsh, L\. Soldaini, D\. Groeneveld, K\. Lo, S\. Arora, A\. Bhagia, Y\. Gu, S\. Huang, M\. Jordan,et al\.\(2024\)2 olmo 2 furious\.arXiv preprint arXiv:2501\.00656\.Cited by:[§5\.2](https://arxiv.org/html/2605.13981#S5.SS2.SSS0.Px1.p1.6),[§5\.2](https://arxiv.org/html/2605.13981#S5.SS2.p1.1)\.
- D\. Patterson, J\. Gonzalez, U\. Hölzle, Q\. Le, C\. Liang, L\. Munguia, D\. Rothchild, D\. R\. So, M\. Texier, and J\. Dean \(2022\)The carbon footprint of machine learning training will plateau, then shrink\.Computer55\(7\),pp\. 18–28\.Cited by:[§2](https://arxiv.org/html/2605.13981#S2.p1.1)\.
- G\. Penedo, A\. Lozhkov, H\. Kydlíček, L\. B\. Allal, E\. Beeching, A\. P\. Lajarín, Q\. Gallouédec, N\. Habib, L\. Tunstall, and L\. von Werra \(2025\)CodeForces cots\.Hugging Face\.Note:[https://huggingface\.co/datasets/open\-r1/codeforces\-cots](https://huggingface.co/datasets/open-r1/codeforces-cots)Cited by:[§3\.2](https://arxiv.org/html/2605.13981#S3.SS2.p2.1)\.
- K\. Rafat, S\. Islam, A\. A\. Mahfug, M\. I\. Hossain, F\. Rahman, S\. Momen, S\. Rahman, and N\. Mohammed \(2023\)Mitigating carbon footprint for knowledge distillation based deep learning model compression\.Plos one18\(5\),pp\. e0285668\.Cited by:[§2](https://arxiv.org/html/2605.13981#S2.SS0.SSS0.Px1.p2.1)\.
- R\. Schwartz, J\. Dodge, N\. A\. Smith, and O\. Etzioni \(2020\)Green ai\.Communications of the ACM63\(12\),pp\. 54–63\.Cited by:[§2](https://arxiv.org/html/2605.13981#S2.p1.1)\.
- E\. Strubell, A\. Ganesh, and A\. McCallum \(2019\)Energy and policy considerations for deep learning in nlp\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 3645–3650\.Cited by:[§2](https://arxiv.org/html/2605.13981#S2.p1.1)\.
- Y\. Yuan, J\. Shi, Z\. Zhang, K\. Chen, J\. Zhang, V\. Stoico, and I\. Malavolta \(2024\)The impact of knowledge distillation on the energy consumption and runtime efficiency of nlp models\.InProceedings of the IEEE/ACM 3rd International Conference on AI Engineering\-Software Engineering for AI,pp\. 129–133\.Cited by:[§2](https://arxiv.org/html/2605.13981#S2.SS0.SSS0.Px1.p2.1)\.
- H\. Zhao, H\. Wang, Y\. Peng, S\. Zhao, X\. Tian, S\. Chen, Y\. Ji, and X\. Li \(2025\)1\.4 million open\-source distilled reasoning dataset to empower large language model training\.arXiv preprint arXiv:2503\.19633\.Cited by:[§3\.2](https://arxiv.org/html/2605.13981#S3.SS2.p2.1)\.

## Appendix

## Appendix AMeasurement Details

NVML provides a GPU power time seriesPGPU​\(t\)P\_\{\\text\{GPU\}\}\(t\)\(Watts\), which we sample every 0\.5s as a trade\-off between noise and logging overhead\. For a stage with start timetst\_\{s\}and end timetet\_\{e\}, we approximate GPU energy by numerical integration:

EGPU≈∫tstePGPU​\(t\)​𝑑t\.E\_\{\\text\{GPU\}\}\\approx\\int\_\{t\_\{s\}\}^\{t\_\{e\}\}P\_\{\\text\{GPU\}\}\(t\)\\,dt\.CPU energy is obtained from the CodeCarbon estimator in process\-tracking mode\. Stage energy is computed as

Etotal\(stage\)=EGPU\+ECPU,E^\{\(\\text\{stage\}\)\}\_\{\\text\{total\}\}=E\_\{\\text\{GPU\}\}\+E\_\{\\text\{CPU\}\},and pipeline totals are computed by summingEtotal\(stage\)E^\{\(\\text\{stage\}\)\}\_\{\\text\{ total\}\}over stages\. We report both Joules and kWh using1​kWh=3\.6×106​J1\\,\\text\{kWh\}=3\.6\\times 10^\{6\}\\,\\text\{J\}\.

To compare runs with different sequence lengths and token counts, we normalize energy and throughput by tokens processed\. For a stage that processesNtokensN\_\{\\text\{tokens\}\}tokens,

J/token=Etotal\(stage\)Ntokens\.\\text\{J/token\}=\\frac\{E^\{\(\\text\{stage\}\)\}\_\{\\text\{ total\}\}\}\{N\_\{\\text\{ tokens\}\}\}\.We derive CO2e using a regional grid factorgregiong\_\{\\text\{ region\}\}\(kg CO2e / kWh\) and a data\-center PUE:

CO2​e=Etotal,kWh×PUE×gregion\.\\mathrm\{CO\}\_\{2\}\\mathrm\{e\}\\;=\\;E\_\{\\text\{total,kWh\}\}\\times\\mathrm\{PUE\}\\times g\_\{\\text\{region\}\}\.\(6\)
Because PUE andgregiong\_\{\\text\{region\}\}are deployment\-dependent, we treat CO2e as assumption\-sensitive and emphasize direct energy measurements in the main text\. Runs are repeated 2–3 times and with different seeds to estimate variance in both energy and performance\.

## Appendix BDetailed Training Configuration

Unless otherwise, we hold the following training settings fixed across*all*experiments and pipelines:

- •Optimizer: Adafactor \(memory\-efficient variant of Adam; permits all experiments to run on a single GPU\)\.
- •Learning rate:5×10−55\\times 10^\{\-5\}\.
- •Scheduler: cosine decay with100100warmup steps\.
- •Batch size: effective batch size of44with1616gradient\-accumulation steps; evaluation batch size11\.
- •Gradient clipping: maximum gradient norm1\.01\.0\.
- •Precision:bfloat16for both training and evaluation\.
- •Sequence length: maximum sequence length of10241024tokens for training and evaluation\.

## Appendix CHardware and Software Environment

All experiments are run on the same machine type to ensure comparability of energy measurements\. Each run uses a single NVIDIA H100 80GB HBM3 GPU with a power limit of700700W, paired with an Intel\(R\) Xeon\(R\) Gold 6442Y CPU \(48 physical cores; 16 used\) and approximately22TB of system RAM\. The software stack consists of Python 3\.10\.13, PyTorch 2\.6\.0\+cu124, CUDA 12\.4, andtransformers4\.51\.3\.

## Appendix DTotal GPU Wall\-Clock Time

Across the core 3×\\times3 grid \(2–3 repeats\) plus the KD \(T×αT\\times\\alpha\) and synthetic SFT \(max\_new\_tokens\) sweeps, we estimate a total of≈2,000\\approx 2\{,\}000H100 GPU\-hoursof wall\-clock compute \(about83 GPU\-dayson a single GPU\), obtained by converting the summed measured kWh to time assuming an average H100 draw of≈\\approx0\.65 kW\.

## Appendix EPer\-Benchmark Evaluation Scores

Table below reports the model per\-benchmark scores used to compute the aggregate quality scoreQQ\(Eq\.[3](https://arxiv.org/html/2605.13981#S5.E3), Section[5\.2](https://arxiv.org/html/2605.13981#S5.SS2)\) for each pipeline–student\-size configuration\.

Table 4:Per\-benchmark model scoresPipelineSizeAE2IFEvalGSM8KMMLUMTBaseline SFT1B67068445\.57B127074697\.613B177578707\.8KD1B66967436\.27B106276556\.013B107074586\.6Synthetic SFT1B66968436\.27B106177556\.513B117075606\.8The per\-benchmark scores show that distillation can provide modest task\-specific gains; however, these gains do not reverse the aggregate energy–quality conclusions stated in the main text\. Practitioners targeting a specific capability should therefore recompute the frontier using task\-appropriate quality metrics; the accounting framework itself is independent of the choice of metric\.

Similar Articles

10 Ways To Reduce Your LLM API Costs

Reddit r/AI_Agents

A practical guide listing 10 strategies to reduce costs when using LLM APIs, including model selection, prompt caching, batch processing, and monitoring expenses.