Do Foundation Model Embeddings Improve Cross-Country Crop Yield Generalisation? A Leave-One-Country-Out Evaluation in Sub-Saharan Africa
Summary
This paper evaluates whether geospatial foundation model embeddings like Prithvi-EO improve cross-country crop yield prediction in Sub-Saharan Africa compared to traditional Sentinel-2 features. The study finds that frozen embeddings do not significantly outperform spectral medians under rigorous Leave-One-Country-Out validation, suggesting country-level distribution shift is the primary bottleneck rather than feature representation quality.
View Cached Full Text
Cached at: 05/12/26, 06:42 AM
# Do Foundation Model Embeddings Improve Cross-Country Crop Yield Generalisation? A Leave-One-Country-Out Evaluation in Sub-Saharan Africa
Source: [https://arxiv.org/html/2605.08113](https://arxiv.org/html/2605.08113)
Yaw Osei Adjei Department of Computer Science, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana yoadjei@st\.knust\.edu\.gh
###### Abstract
Accurate smallholder maize yield prediction across national boundaries is critical for food security planning in sub\-Saharan Africa, yet most published benchmarks report within\-country performance that overstates true generalisability\. This paper asks whether geospatial foundation model embeddings — specifically Prithvi\-EO\-1\.0\-100M and ViT\-Base — outperform traditional Sentinel\-2 spectral features under a rigorous Leave\-One\-Country\-Out \(LOCO\) cross\-validation scheme applied to 6,404 smallholder maize field observations across five African countries \(Kenya, Malawi, Nigeria, Rwanda, Tanzania\) for the years 2017–2022\. We evaluate 18 experimental conditions spanning three feature representations, three regression algorithms \(Ridge, Random Forest, XGBoost\), and two validation protocols \(LOCO and standard five\-fold random cross\-validation\)\. Results reveal a stark generalisability gap: all feature sets achieve R∈2\[0\.17,0\.30\]\{\}^\{2\}\\in\[0\.17,0\.30\]under within\-country evaluation but universally negative R∈2\[−0\.09,−0\.03\]\{\}^\{2\}\\in\[\-0\.09,\-0\.03\]under LOCO, including a naive country\-mean baseline\. Critically, frozen Prithvi\-EO embeddings applied to single growing\-season composites offer no meaningful advantage over 10\-band Sentinel\-2 band medians for cross\-country prediction\. This result must be interpreted carefully: Prithvi\-EO was designed for multi\-temporal input, and using a single annual composite is a deliberate mismatch that isolates the question of whether frozen representations alone carry cross\-country invariance\. We find they do not, and interpret this as evidence that the primary bottleneck is country\-level yield distribution shift — a data problem — rather than deficiencies in feature representation\. Our findings challenge the assumption that domain\-specific geospatial pre\-training closes the generalisability gap, and establish a reproducible negative benchmark that future work should surpass\. All code and processed results are released at[https://github\.com/yoadjei/yield\-africa](https://github.com/yoadjei/yield-africa)\.
## IIntroduction
Food insecurity remains a defining challenge across sub\-Saharan Africa, where smallholder farms account for more than 70% of staple crop production\[[2](https://arxiv.org/html/2605.08113#bib.bib9)\]\. Timely and spatially disaggregated yield forecasts are essential inputs for early warning systems, agricultural policy, and targeted humanitarian response\. Remote sensing offers a scalable, cost\-effective route to such forecasts by linking satellite observations directly to field\-level yield outcomes\[[14](https://arxiv.org/html/2605.08113#bib.bib10)\]\.
A central ambition of this programme is to train predictive models in data\-rich countries and apply them in data\-scarce settings — a transfer that requires genuine cross\-country generalisation\. Yet the dominant evaluation paradigm in the literature applies randomkk\-fold cross\-validation within a pooled dataset, inadvertently preserving country\-specific statistical patterns in both training and test folds\. This produces optimistic accuracy estimates that collapse when models are deployed across national boundaries\.
The recent emergence of geospatial foundation models pre\-trained on large\-scale satellite archives has raised hopes that self\-supervised representations might encode invariant physical features of agricultural landscapes, thereby reducing this generalisation gap\. Prithvi\-EO\-1\.0\-100M\[[10](https://arxiv.org/html/2605.08113#bib.bib2)\], jointly released by IBM and NASA, is a Masked Autoencoder\[[8](https://arxiv.org/html/2605.08113#bib.bib20)\]pre\-trained on six\-channel Harmonised Landsat\-Sentinel \(HLS\) time series specifically for crop and land\-cover applications\. The Vision Transformer \(ViT\-Base\)\[[4](https://arxiv.org/html/2605.08113#bib.bib3)\], pre\-trained on ImageNet at scale, provides a general\-purpose visual baseline for comparison\.
This paper makes three contributions:
1. 1\.We presenta systematic LOCO evaluationof frozen foundation model embeddings against engineered spectral features for smallholder maize yield prediction in Africa, using 6,404 observations from five countries spanning 2017–2022\. To our knowledge, no prior published work applies this evaluation protocol to Prithvi\-EO or comparable geospatial foundation models in the African smallholder yield regression setting\.
2. 2\.We demonstrate thatall feature representations fail to generalise cross\-country under the frozen, single\-frame evaluation protocol: LOCO R2is universally negative for all 9 combinations of feature set and regressor, and a naive country\-mean predictor also yields negative R2, confirming the primary bottleneck is distribution shift in yield itself rather than in the feature space\.
3. 3\.We provide areproducible negative benchmark— a clearly documented failure mode that future work on domain adaptation, meta\-learning, and cross\-country yield transfer must surpass\.
The remainder of this paper is organised as follows\. Section[II](https://arxiv.org/html/2605.08113#S2)reviews related work\. Section[III](https://arxiv.org/html/2605.08113#S3)describes the study area and data sources\. Section[IV](https://arxiv.org/html/2605.08113#S4)presents the experimental design\. Section[V](https://arxiv.org/html/2605.08113#S5)reports results\. Section[VI](https://arxiv.org/html/2605.08113#S6)discusses implications, and Section[VII](https://arxiv.org/html/2605.08113#S7)concludes\.
## IIRelated Work
### II\-ASatellite\-Based Yield Prediction in Africa
Burke and Lobell\[[2](https://arxiv.org/html/2605.08113#bib.bib9)\]demonstrated that Landsat\-derived NDVI time series can explain a significant fraction of smallholder yield variation in Uganda, establishing early proof of concept for satellite\-based prediction in Africa\. Lobell et al\.\[[14](https://arxiv.org/html/2605.08113#bib.bib10)\]conducted a systematic comparison of satellite imagery and traditional household surveys, finding that imagery\-based prediction achieves moderate accuracy within\-country but highlighted the challenge of transferring models across contexts\. You et al\.\[[22](https://arxiv.org/html/2605.08113#bib.bib11)\]applied deep Gaussian processes to MODIS time series for county\-level yield prediction in the United States, showing that spatial dependencies are a key driver of predictive accuracy\. These studies share a common limitation: cross\-location evaluation is rare, and cross\-country evaluation is nearly absent\.
### II\-BFoundation Models for Geospatial Applications
The Vision Transformer\[[4](https://arxiv.org/html/2605.08113#bib.bib3)\], through its attention mechanism, demonstrated that large\-scale pre\-training on natural images enables powerful general\-purpose visual representations\. Mai et al\.\[[15](https://arxiv.org/html/2605.08113#bib.bib16)\]surveyed opportunities and challenges of foundation models for geospatial AI, noting a persistent gap between general pre\-training objectives and specialised downstream tasks\. Jakubik et al\.\[[10](https://arxiv.org/html/2605.08113#bib.bib2)\]introduced Prithvi\-EO\-1\.0\-100M, a 100\-million parameter masked autoencoder pre\-trained on six\-channel HLS imagery across the continental United States and fine\-tuned for flood mapping, multi\-temporal crop segmentation, and burned area detection\. Wang et al\.\[[20](https://arxiv.org/html/2605.08113#bib.bib24)\]curated the SSL4EO\-S12 benchmark for self\-supervised pre\-training on Sentinel\-1/2 imagery, highlighting the importance of modality alignment between pre\-training and downstream tasks\. No prior work has benchmarked frozen Prithvi\-EO embeddings against spectral baselines for*cross\-country*yield regression in Africa\.
### II\-CDomain Shift and Cross\-Location Transfer
Tseng et al\.\[[18](https://arxiv.org/html/2605.08113#bib.bib14)\]introduced CropHarvest, a global crop\-type dataset, and evaluated few\-shot cross\-country crop classification, finding severe performance degradation without target\-country adaptation\. Kerner et al\.\[[12](https://arxiv.org/html/2605.08113#bib.bib13)\]developed rapid\-response crop mapping for data\-sparse regions, but relied on within\-region fine\-tuning\. Tuia et al\.\[[19](https://arxiv.org/html/2605.08113#bib.bib22)\]provided a comprehensive survey of domain adaptation in remote sensing classification, establishing that covariate shift between acquisition conditions routinely degrades cross\-site generalisation; their findings extend conceptually to the regression setting studied here\. Wolanin et al\.\[[21](https://arxiv.org/html/2605.08113#bib.bib23)\]demonstrated that explainable deep learning can achieve strong within\-country yield estimates in the Indian wheat belt, yet their evaluation was confined to a single country and did not address cross\-country transfer\. Mañas et al\.\[[16](https://arxiv.org/html/2605.08113#bib.bib21)\]showed that seasonal contrast pre\-training on uncurated RS data improves linear probe accuracy across multiple downstream RS tasks, but did not evaluate cross\-country yield regression\. These studies reinforce a consistent pattern: geospatial models trained in one country or region degrade substantially when applied elsewhere without adaptation\. Our work extends this evidence to the yield regression setting and explicitly quantifies the gap using a LOCO protocol across five African countries, adding the dimension of comparing foundation model embeddings against engineered spectral features under the same LOCO protocol\.
## IIIStudy Area and Data
### III\-ACountries and Crop
The study covers five sub\-Saharan African countries: Kenya, Malawi, Nigeria, Rwanda, and Tanzania\. Maize \(Zea mays\) is the target crop, selected as the most widely cultivated staple in the region and the crop with the highest GROW\-Africa observation density\. The five countries represent diverse agroecological zones \(semi\-arid savanna to highland tropics\) and a wide range of smallholder yield outcomes, making cross\-country generalisation genuinely challenging\.
### III\-BYield Labels
Field\-level yield labels are drawn from two complementary sources:
GROW\-Africa\[[7](https://arxiv.org/html/2605.08113#bib.bib1)\]provides GPS\-tagged, farm\-level maize yield observations collected by household surveys and agronomic trials across sub\-Saharan Africa\. We applied the following filters: \(i\) observation years 2017–2022 to align with Sentinel\-2 availability; \(ii\) point\-level GPS coordinates only, excluding administrative\-polygon centroids; \(iii\) countries with at least 100 post\-filter observations\.
HarvestStat Africa\[[13](https://arxiv.org/html/2605.08113#bib.bib15)\]provides subnational crop production statistics\. Nigeria has negligible GPS\-level GROW\-Africa coverage; HarvestStat admin\-unit centroids were used as proxy observations for that country\. Alabel\_sourceindicator distinguishes the two data types\. This coarser spatial resolution for Nigeria is a disclosed limitation \(see Section[VI\-C](https://arxiv.org/html/2605.08113#S6.SS3)\)\.
After merging and quality filtering, the dataset contains6,404 observations: Kenya \(1,396\), Malawi \(1,552\), Nigeria \(955\), Rwanda \(1,138\), and Tanzania \(1,363\)\. Table[I](https://arxiv.org/html/2605.08113#S3.T1)summarises the per\-country distribution\.
TABLE I:Per\-country dataset summary\. Yield in kg/ha on original scale\.The wide inter\-country yield variation \(Nigeria mean 1,256 kg/ha vs\. Rwanda mean 3,993 kg/ha\) is a key driver of the distribution shift studied here\.
### III\-CSentinel\-2 Imagery
Sentinel\-2 Level\-2A surface reflectance imagery was acquired via Google Earth Engine for a 500 m buffer around each field centroid, composited as annual growing\-season medians\[[5](https://arxiv.org/html/2605.08113#bib.bib8)\]\. Ten spectral bands were extracted: B2 \(blue\), B3 \(green\), B4 \(red\), B5–B7 \(red\-edge\), B8 \(NIR\), B8A \(narrow NIR\), B11 and B12 \(SWIR\)\. A 224×\\times224 pixel patch centred on each field was exported as a GeoTIFF for foundation model inference\. Patches with estimated cloud cover exceeding 20% were discarded\.
### III\-DSpectral Indices
Four spectral indices were computed from the Sentinel\-2 composites: NDVI\[[17](https://arxiv.org/html/2605.08113#bib.bib17)\]\(Normalised Difference Vegetation Index\), EVI\[[11](https://arxiv.org/html/2605.08113#bib.bib18)\]\(Enhanced Vegetation Index\), LSWI \(Land Surface Water Index\), and NDWI \(Normalised Difference Water Index\)\. These indices capture vegetation greenness and canopy moisture status relevant to yield formation\.
### III\-ERainfall
Growing\-season cumulative precipitation was extracted from the CHIRPS\[[6](https://arxiv.org/html/2605.08113#bib.bib7)\]daily dataset at 0\.05° resolution\. Three features were derived per field: total seasonal rainfall, mean daily rainfall, and the coefficient of variation \(a proxy for drought stress\)\.
## IVMethods
### IV\-AFeature Representations
Three feature sets were evaluated, each representing a distinct data representation paradigm:
#### IV\-A1Spectral \(baseline\)
A hand\-engineered 23\-dimensional vector comprising the 10 Sentinel\-2 band medians, four spectral indices, and three CHIRPS rainfall statistics\. This represents the standard engineered\-feature approach used in most operational yield prediction systems\.
#### IV\-A2Prithvi\-EO Embeddings
Prithvi\-EO\-1\.0\-100M\[[10](https://arxiv.org/html/2605.08113#bib.bib2)\]is a Vision Transformer with Masked Autoencoder pre\-training on six\-channel \(HLS blue, green, red, NIR, SWIR\-1, SWIR\-2\) multi\-temporal imagery\. Sentinel\-2 patches were resampled to the six HLS channels, normalised using the model’s published per\-channel statistics, and fed to the frozen encoder as a single\-frame tensor of shape\(B,6,1,224,224\)\(B,6,1,224,224\)\. The 768\-dimensional CLS token from the final encoder block was used as the embedding\. No fine\-tuning was performed; weights were loaded directly from the publicly released checkpoint \(Prithvi\_EO\_V1\_100M\.pt\)\. This isolates the contribution of pre\-trained representations from task\-specific adaptation\.
#### IV\-A3ViT\-Base Embeddings
ViT\-Base/16\[[4](https://arxiv.org/html/2605.08113#bib.bib3)\]pre\-trained on ImageNet\-21k was applied to RGB \(B4, B3, B2\) Sentinel\-2 patches, producing a 768\-dimensional CLS token\. ViT\-Base serves as a general\-purpose vision baseline, enabling comparison between geospatial\-specialised pre\-training \(Prithvi\-EO\) and domain\-agnostic pre\-training on natural images\.
### IV\-BRegressors
Three regression algorithms were evaluated:
Ridge Regression\[[9](https://arxiv.org/html/2605.08113#bib.bib6)\]withα∈\{0\.1,1,10,100,1000\}\\alpha\\in\\\{0\.1,1,10,100,1000\\\}selected by inner cross\-validation\. Ridge provides a linear baseline that is robust to multicollinearity — relevant given the high dimensionality of embedding features\.
Random Forest\[[1](https://arxiv.org/html/2605.08113#bib.bib5)\]with 100 trees and all default hyperparameters\. Random Forest captures non\-linear interactions without being prone to overfitting individual training points\.
XGBoost\[[3](https://arxiv.org/html/2605.08113#bib.bib4)\]with 300 trees, learning rate 0\.05, maximum depth 4, and subsampling 0\.8\. XGBoost is the strongest tree ensemble baseline in tabular regression benchmarks and provides a competitive upper bound on performance within this feature paradigm\.
All models were wrapped in aStandardScaler→\\rightarrowregressor pipeline\. Random state was fixed at 42 across all experiments\.
### IV\-CTarget Variable
The yield target was log\-transformed \(yield\_log\) after confirming right skewness greater than 1\.0 in the pooled distribution\. Log transformation stabilises variance and is standard practice for smallholder yield regression\. All reported RMSE values are back\-transformed to kg/ha for interpretability\.
### IV\-DMissing Value Imputation
NaN values in feature vectors — arising from cloud\-affected patches or partial CHIRPS coverage — were imputed with the training\-set column median\. All\-NaN columns \(which can occur for held\-out countries with zero valid spectral observations\) were filled with 0\. Rows with NaN yield values were dropped\.
### IV\-ECross\-Validation Schemes
Two orthogonal evaluation protocols were applied to every feature–regressor combination, yielding 18 experimental conditions in total\.
#### IV\-E1Random Five\-Fold CV
Standard stratified five\-fold cross\-validation with shuffling\. This protocol leaks country\-specific signal into the training folds and measures within\-distribution predictive accuracy\.
#### IV\-E2Leave\-One\-Country\-Out \(LOCO\) CV
For each of the five countries, the model is trained on the remaining four countries and evaluated on the held\-out country\. Per\-country RMSE, MAE, and R2are recorded, and predictions are concatenated across folds to produce aggregate metrics\. LOCO is the appropriate evaluation protocol for the stated goal of cross\-country generalisation\.
### IV\-FNaive Baseline
A non\-parametric baseline was constructed for each held\-out country by predicting all test observations with the mean yield of the corresponding training set \(all other countries\)\. Because country yield means differ substantially, this baseline also achieves negative R2, confirming that the challenge is not merely underfitting but genuine distribution shift\.
### IV\-GEvaluation Metrics
Three metrics are reported: root mean squared error \(RMSE, kg/ha\), mean absolute error \(MAE, kg/ha\), and the coefficient of determination R2\. R<20\{\}^\{2\}<0indicates the model performs worse than predicting the test\-set mean yield — a more severe failure than simply underfitting\.
## VResults
### V\-AWithin\-Country Performance \(Random CV\)
Figure 1:Within\-country \(random CV\) vs\. cross\-country \(LOCO\) R2for all nine feature–regressor combinations\. Every condition shows a large drop when moving from random to LOCO evaluation\.Table[II](https://arxiv.org/html/2605.08113#S5.T2)and Figure[1](https://arxiv.org/html/2605.08113#S5.F1)report five\-fold random CV results\. All nine feature–model combinations achieve positive R2, with values ranging from 0\.169 \(ViT\-Base / Ridge\) to 0\.300 \(Prithvi\-EO / XGBoost\)\. Tree ensemble methods \(RF and XGBoost\) consistently outperform Ridge regardless of feature set, consistent with non\-linear yield–feature relationships\. Prithvi\-EO and spectral features are nearly equivalent: best\-case Prithvi\-EO R=20\.300\{\}^\{2\}=0\.300vs\. spectral R=20\.291\{\}^\{2\}=0\.291, a difference of 0\.009 R2units\. ViT\-Base lags behind both domain\-specific representations \(best R=20\.219\{\}^\{2\}=0\.219\), suggesting that ImageNet pre\-training provides less useful inductive biases for vegetation analysis than geospatial pre\-training\.
TABLE II:Random five\-fold CV results\. Best R2per feature set inbold\.
### V\-BCross\-Country Generalisation \(LOCO CV\)
Figure 2:LOCO R2heatmap across all nine feature–regressor combinations\. All values are negative\. Prithvi\-EO / Ridge achieves the least\-negative result \(−0\.027\-0\.027\)\.Table[III](https://arxiv.org/html/2605.08113#S5.T3)and Figure[2](https://arxiv.org/html/2605.08113#S5.F2)report LOCO results\. All 9 conditions yield negative R2, ranging from−0\.093\-0\.093\(Spectral / Ridge\) to−0\.027\-0\.027\(Prithvi\-EO / Ridge\)\. The aggregate RMSE \(1,850–1,910 kg/ha\) is notably higher than the random CV RMSE \(1,527–1,664 kg/ha\), reflecting the additional difficulty of predicting across country boundaries\.
TABLE III:LOCO CV results\. Least\-negative R2per feature set inbold\.
### V\-CNaive Baseline
Figure 3:Per\-country RMSE for the best model per feature set \(bars\) vs\. the naive country\-mean baseline \(diamond, dashed line\)\. Learned models beat the naive baseline on RMSE, but all remain below zero R2\.Table[IV](https://arxiv.org/html/2605.08113#S5.T4)confirms that the naive country\-mean predictor also achieves universally negative R2under LOCO\. The naive RMSE \(1,719–2,392 kg/ha\) exceeds the best\-model RMSE for all countries, indicating that learned representations do provide some signal beyond the simplest possible predictor\. However, both learned models and the naive baseline lie below zero R2, confirming that predicting held\-out countries from other\-country data is a fundamentally difficult task regardless of model sophistication\.
TABLE IV:Naive country\-mean baseline under LOCO\.
### V\-DGeneralisation Gap
Figure 4:Generalisation gap \(random CV R2minus LOCO R2\) per feature–regressor combination\. Ridge consistently shows smaller gaps than tree ensembles despite lower within\-country accuracy\.The generalisation gap — defined as random R2minus LOCO R2— is visualised in Figure[4](https://arxiv.org/html/2605.08113#S5.F4)and ranges from 0\.216 \(Prithvi\-EO / Ridge:0\.180−\(−0\.027\)0\.180\-\(\-0\.027\)\) to 0\.384 \(Spectral / Ridge:0\.191−\(−0\.093\)0\.191\-\(\-0\.093\)\)\. XGBoost, despite being the strongest within\-country model, exhibits generalisation gaps of 0\.359 \(spectral\), 0\.360 \(Prithvi\-EO\), and 0\.259 \(ViT\-Base\), demonstrating that higher within\-country accuracy does not imply better cross\-country transfer\. Ridge consistently exhibits smaller generalisation gaps than tree ensembles, suggesting that simpler models are more robust to distribution shift\.
### V\-ENDVI\-Only Ablation
To bound the lower end of feature richness, Table[V](https://arxiv.org/html/2605.08113#S5.T5)evaluates a single\-feature baseline using only NDVI under LOCO\. NDVI\-only Ridge achieves R=2−0\.170\{\}^\{2\}=\-0\.170, worse than the 23\-feature spectral baseline \(−0\.093\-0\.093\), confirming that the full engineered feature set provides incremental signal even when all models fail to generalise cross\-country\. The gap between NDVI\-only and the best full\-feature model \(Prithvi\-EO / Ridge,−0\.027\-0\.027\) is only 0\.143 R2units — small relative to the within\-country to LOCO gap of\>0\.2\>0\.2units, which underscores that feature richness is not the binding constraint\.
TABLE V:NDVI\-only ablation under LOCO \(single feature, all three regressors\)\.
### V\-FNigeria Sensitivity Analysis
Nigeria labels are admin\-centroid proxies rather than GPS\-tagged field observations \(Section[III](https://arxiv.org/html/2605.08113#S3)\)\. Table[VI](https://arxiv.org/html/2605.08113#S5.T6)reports LOCO results on the four remaining countries after excluding Nigeria from both training and test folds\. For tree ensembles, aggregate R2remains strongly negative \(spectral/RF:−0\.156\-0\.156; spectral/XGBoost:−0\.139\-0\.139\) while Ridge deteriorates \(−0\.278\-0\.278\)\. This non\-monotone pattern arises because Nigeria occupies a distinct low\-yield region of the training distribution that anchors the linear decision boundary; its removal shifts the pooled training mean toward higher yields, increasing prediction error for Malawi \(mean 1,918 kg/ha\)\. Critically, all conditions remain well below zero R2, confirming that distribution shift is pervasive and not an artefact of Nigeria’s label quality\.
TABLE VI:Nigeria\-excluded LOCO sensitivity \(spectral features; Kenya, Malawi, Rwanda, Tanzania only\)\.
### V\-GPer\-Fold Variability
With only five LOCO folds, aggregate R2figures mask large per\-country variance\. Table[VII](https://arxiv.org/html/2605.08113#S5.T7)reports the mean and standard deviation of per\-country R2across the five country holdouts\. Standard deviations of 0\.32–0\.70 R2units dwarf the differences between feature sets \(<0\.07<0\.07aggregate R2units\) and between regressors \(<0\.05<0\.05units\)\. Differences between conditions in Tables[II](https://arxiv.org/html/2605.08113#S5.T2)and[III](https://arxiv.org/html/2605.08113#S5.T3)should therefore be interpreted as indicative directional trends, not statistically separable findings\.
TABLE VII:Per\-country R2mean±\\pmstd across the five LOCO holdouts\. Large std values confirm that aggregate results are driven by which country is hardest to predict, not by small feature differences\.
### V\-HPer\-Country Analysis
Figure 5:Per\-country LOCO RMSE \(kg/ha\) for all nine conditions, with test\-set sample sizes annotated\. Rwanda and Nigeria are consistently the most difficult held\-out countries\.Figure 6:Predicted vs\. actual yield scatter under LOCO for the Prithvi\-EO / Ridge condition \(one panel per held\-out country\)\. The 1:1 line is shown dashed\. Systematic under\- and over\-prediction reflects country\-level yield distribution shift\.Figure[5](https://arxiv.org/html/2605.08113#S5.F5)shows per\-country LOCO RMSE across all conditions and Figure[6](https://arxiv.org/html/2605.08113#S5.F6)plots predicted vs\. actual yield for the Prithvi\-EO / Ridge condition\. Rwanda is the hardest held\-out country across all conditions \(RMSE\>\>2,100 kg/ha\), consistent with its unusually high mean yield \(3,993 kg/ha\) relative to training\-set means\. Nigeria similarly shows the largest naive\-baseline deficit \(naive R=2−2\.121\{\}^\{2\}=\-2\.121\), plausibly explained by the use of admin\-centroid proxy observations\. Kenya and Tanzania are easiest to predict \(lowest RMSE under LOCO\), likely because their yield ranges overlap more substantially with the pooled training distribution\.
### V\-ILabel Shift as Primary Bottleneck
Figure 7:Yield distributions \(kg/ha\) per country\. Nigeria and Rwanda exhibit markedly different central tendency and spread relative to the pooled training set, explaining why LOCO performance collapses for these folds regardless of feature representation\.Figure 8:Pairwise KL divergence of log\-yield distributions between countries \(Gaussian approximation\)\. Nigeria is the distributional outlier: KL\(Nigeria∥\\\|Rwanda\)=3\.93=3\.93, while Kenya–Tanzania divergence is near zero \(0\.0010\.001\)\. Higher KL predicts worse LOCO performance\.Figure[7](https://arxiv.org/html/2605.08113#S5.F7)visualises the yield distribution for each country\. Country means span from 1,256 kg/ha \(Nigeria\) to 3,993 kg/ha \(Rwanda\), a factor of3\.2×3\.2\\timeson the original scale\. Figure[8](https://arxiv.org/html/2605.08113#S5.F8)quantifies pairwise label shift via Gaussian KL divergence on log\-yield distributions\. Nigeria is the distributional outlier: KL\(Nigeria∥\\\|Kenya\)=2\.69=2\.69, KL\(Nigeria∥\\\|Rwanda\)=3\.93=3\.93\. Kenya and Tanzania are nearly identical: KL\(Kenya∥\\\|Tanzania\)=0\.001=0\.001\. The hardest held\-out countries under LOCO \(Nigeria, Rwanda\) correspond exactly to those with the largest KL divergence from the pooled training pool; the easiest \(Kenya, Tanzania\) have near\-zero divergence from each other\.
### V\-JStatistical Separability of Conditions
Figure 9:Mean LOCO R2±\\pmone standard deviation across the five LOCO country folds, by feature set and regressor\. Per\-fold standard deviations \(0\.32–0\.70\) dwarf the between\-condition differences \(<<0\.07\), indicating that no condition is statistically separable from any other\.Figure 10:Feature ablation under LOCO: a single NDVI feature achieves R2comparable to 768\-dimensional Prithvi\-EO or ViT\-Base embeddings, confirming that richer representations do not improve cross\-country generalisation under the frozen, single\-frame protocol\.Figure[9](https://arxiv.org/html/2605.08113#S5.F9)shows that the large within\-fold variance renders all nine conditions statistically indistinguishable\. Figure[10](https://arxiv.org/html/2605.08113#S5.F10)further shows that a single NDVI feature \(Ridge R=2−0\.170\{\}^\{2\}=\-0\.170\) performs comparably to the full 768\-dimensional Prithvi\-EO embeddings \(Ridge R=2−0\.027\{\}^\{2\}=\-0\.027\) — a gap well within one per\-fold standard deviation\. Together these results confirm that the bottleneck is distributional, not representational\.
## VIDiscussion
### VI\-AWhy Foundation Models Do Not Close the Gap
The core finding — that frozen Prithvi\-EO embeddings applied to single growing\-season composites are not meaningfully superior to spectral features under LOCO — requires precise qualification before drawing broader conclusions about geospatial foundation models\.
Design mismatch\.Prithvi\-EO is a multi\-temporal masked autoencoder pre\-trained on six\-channel HLS time series\. Its inductive biases — temporal self\-attention, multi\-frame reconstruction — are optimised for time\-series input\. By feeding a single annual composite we discard the phenological signal the model was trained to exploit\. This is a deliberate experimental choice: it isolates the question of whether frozen spatial representations alone carry cross\-country invariance\. The answer is no, but this should not be interpreted as a general failure of Prithvi\-EO; multi\-temporal fine\-tuning may yield substantially different results and remains an important open experiment\.
Geographic domain shift\.Prithvi\-EO was pre\-trained predominantly on North American HLS tiles\. African smallholder landscapes — fragmented parcels, mixed cropping, and informal field boundaries — differ substantially from the large\-field Corn Belt imagery that dominates the pre\-training corpus\. This geographic domain gap adds a second source of distribution mismatch beyond the temporal one\.
Label shift as the primary bottleneck\.Section[V](https://arxiv.org/html/2605.08113#S5)showed that the countries hardest to predict under LOCO \(Nigeria, Rwanda\) are exactly those with the largest KL divergence from the pooled training pool \(Figure[8](https://arxiv.org/html/2605.08113#S5.F8)\), while the easiest \(Kenya, Tanzania\) have near\-zero mutual divergence\. This pattern holds regardless of feature set or regressor, and persists even after removing Nigeria from both training and test folds \(Table[VI](https://arxiv.org/html/2605.08113#S5.T6)\)\. The conclusion is that no feature representation — however rich or domain\-specific — can recover accurate yield magnitudes for a held\-out country whose yield distribution occupies a fundamentally different range from the training pool\. Addressing label shift requires country\-level yield normalisation, richer auxiliary covariates \(soil fertility, management inputs, variety adoption\), or explicit distributional alignment — information not encoded in satellite imagery alone\.
### VI\-BImplications for Benchmarking
Our results demonstrate concretely that within\-country random CV inflates performance by 0\.22–0\.38 R2units relative to LOCO\. Practitioners and reviewers should treat any yield prediction accuracy figures reported under random within\-country CV with commensurate scepticism when the intended application is cross\-country or cross\-region deployment\. LOCO, or at minimum a spatial block CV scheme that prevents geographic leakage, should be the default evaluation protocol for operational yield prediction\.
### VI\-CLimitations
Several limitations constrain the scope of our conclusions:
1. 1\.Frozen embeddings only\.Fine\-tuning Prithvi\-EO on African maize data may recover generalisation that frozen inference cannot\. This remains an important open experiment\.
2. 2\.Single\-timestamp composites\.Prithvi\-EO was designed for multi\-temporal input; using a single growing\-season composite discards phenological information the model was trained to exploit\.
3. 3\.Nigeria label quality\.Admin\-centroid proxy labels for Nigeria introduce spatial mismatch error relative to GPS\-tagged fields\.
4. 4\.Five\-country sample\.LOCO over five countries produces five test folds\. Per\-country R2standard deviations of 0\.32–0\.70 \(Table[VII](https://arxiv.org/html/2605.08113#S5.T7)\) far exceed the between\-condition differences \(<0\.07<0\.07\); no formal significance test is applicable at this fold count, and all differences should be treated as directional only\.
5. 5\.No domain adaptation\.Techniques such as domain\-adversarial training, country\-level normalisation, or meta\-learning were not evaluated and may substantially improve cross\-country transfer\.
## VIIConclusion
We evaluated 18 combinations of feature representation, regression algorithm, and cross\-validation scheme for smallholder maize yield prediction across five sub\-Saharan African countries\. Within\-country performance is moderate \(R2up to 0\.30\) but cross\-country generalisation is uniformly poor \(all LOCO R<20\{\}^\{2\}<0\)\. Frozen Prithvi\-EO embeddings applied to single\-frame composites provide no meaningful advantage over 10\-band Sentinel\-2 spectral features\. This result should be understood as an evaluation of*frozen, single\-frame*inference rather than a general indictment of geospatial foundation models: multi\-temporal fine\-tuning on African data remains untested and is a promising direction\. The evidence suggests that geospatial domain\-specific pre\-training alone — without fine\-tuning, multi\-temporal input, or domain adaptation — is insufficient to overcome country\-level yield distribution shift\. Our work establishes a reproducible LOCO benchmark against which future domain adaptation and meta\-learning approaches for African crop yield prediction should be compared\.
## Data and Code Availability
The GROW\-Africa yield labels are publicly available at[doi:10\.5281/zenodo\.14961637](https://doi.org/10.5281/zenodo.14961637)\. All preprocessing, embedding extraction, model training, and figure generation code is available at[https://github\.com/yoadjei/yield\-africa](https://github.com/yoadjei/yield-africa)\. Processed results \(results\_all\.csv,results\_loco\_country\.csv\) and paper figures are included in the repository\.
## Acknowledgements
The author thanks the GROW\-Africa consortium for curating and releasing the smallholder yield dataset, and IBM Research and NASA for releasing the Prithvi\-EO\-1\.0\-100M model weights under an open licence\. Sentinel\-2 imagery was accessed free of charge through the Google Earth Engine research programme\.
AI assistance disclosure:Claude \(Anthropic\) was used as a coding assistant to help develop and debug the Python pipeline scripts\. All experimental design, analysis, interpretation, and written text are the author’s own work\.
## Appendix AReproducibility Checklist
This appendix provides the information required to exactly reproduce all reported numbers\.
### A\-ASoftware Versions
- •Python 3\.10
- •NumPy 1\.26, pandas 2\.2, scikit\-learn 1\.4, XGBoost 2\.0
- •PyTorch 2\.2 \(embedding extraction only\), timm 0\.9
- •earthengine\-api 0\.1 \(GEE patch export only\)
### A\-BRandom Seeds
All stochastic operations userandom\_state=42: Random Forest \(n\_estimators=100\), XGBoost \(random\_state=42\), KFold shuffling for random CV \(KFold\(shuffle=True, random\_state=42\)\)\. Ridge regression \(RidgeCV\) is deterministic given fixed features\.
### A\-CTrain/Test Split Logic
LOCO:For each held\-out countrycc, the training set is all observations withcountry≠c\\neq c; the test set is all observations withcountry=c=c\. No shuffling\. No stratification\. Splits are fully determined by thecountrycolumn inmaster\_dataset\.parquet\.
Random CV:Scikit\-learnKFold\(n\_splits=5, shuffle=True, random\_state=42\)applied to the full pooled dataset after dropping NaN yield rows\.
### A\-DPreprocessing
NaN features are imputed with the training\-set column median \(computed after country split for LOCO; computed per fold for random CV\)\. Columns that are all\-NaN in the training set are set to 0\. The yield target is log\-transformed \(np\.log1p\) prior to model fitting; all reported RMSE and MAE values are back\-transformed vianp\.expm1\.
### A\-EReproducing Each Table
- •Tables[II](https://arxiv.org/html/2605.08113#S5.T2)and[III](https://arxiv.org/html/2605.08113#S5.T3):python scripts/04\_train\_eval\.py→\\rightarrowdata/processed/results\_all\.csv
- •Table[IV](https://arxiv.org/html/2605.08113#S5.T4):python scripts/05\_figures\.py\(naive baseline is computed inline during figure generation\)
- •Tables[V](https://arxiv.org/html/2605.08113#S5.T5),[VI](https://arxiv.org/html/2605.08113#S5.T6), and[VII](https://arxiv.org/html/2605.08113#S5.T7):python scripts/04b\_sensitivity\.py→\\rightarrowdata/processed/results\_ndvi\_only\.csv,data/processed/results\_loco\_no\_nigeria\.csv,data/processed/results\_loco\_fold\_std\.csv
### A\-FCompute Requirements
- •Steps 1–3 \(data download, GEE export, CHIRPS\): network\-bound, <2 hours
- •Step 4 \(preprocessing\): CPU, <5 minutes
- •Step 5 \(embedding extraction\): GPU recommended \(NVIDIA A100:≈\\approx2 h; CPU:≈\\approx8 h\)
- •Step 6 \(train and evaluate, all conditions\): CPU,≈\\approx20 minutes
- •Step 7 \(sensitivity analyses\): CPU,≈\\approx25 minutes
## References
- \[1\]\(2001\)Random forests\.Machine Learning45\(1\),pp\. 5–32\.Cited by:[§IV\-B](https://arxiv.org/html/2605.08113#S4.SS2.p3.1)\.
- \[2\]M\. Burke and D\. B\. Lobell\(2017\)Satellite\-based assessment of yield variation and its determinants in smallholder African systems\.Proceedings of the National Academy of Sciences114\(9\),pp\. 2189–2194\.Cited by:[§I](https://arxiv.org/html/2605.08113#S1.p1.1),[§II\-A](https://arxiv.org/html/2605.08113#S2.SS1.p1.1)\.
- \[3\]T\. Chen and C\. Guestrin\(2016\)XGBoost: a scalable tree boosting system\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 785–794\.Cited by:[§IV\-B](https://arxiv.org/html/2605.08113#S4.SS2.p4.1)\.
- \[4\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly,et al\.\(2021\)An image is worth 16x16 words: Transformers for image recognition at scale\.International Conference on Learning Representations\.Cited by:[§I](https://arxiv.org/html/2605.08113#S1.p3.1),[§II\-B](https://arxiv.org/html/2605.08113#S2.SS2.p1.1),[§IV\-A3](https://arxiv.org/html/2605.08113#S4.SS1.SSS3.p1.1)\.
- \[5\]M\. Drusch, U\. Del Bello, S\. Carlier, O\. Colin, V\. Fernandez, F\. Gascon, B\. Hoersch, C\. Isola, P\. Laberinti, P\. Martimort,et al\.\(2012\)Sentinel\-2: ESA’s optical high\-resolution mission for GMES operational services\.Remote Sensing of Environment120,pp\. 25–36\.Cited by:[§III\-C](https://arxiv.org/html/2605.08113#S3.SS3.p1.1)\.
- \[6\]C\. Funk, P\. Peterson, M\. Landsfeld, D\. Pedreros, J\. Verdin, S\. Shukla, G\. Husak, J\. Rowland, L\. Harrison, A\. Hoell, and J\. Michaelsen\(2015\)The climate hazards infrared precipitation with stations — a new environmental record for monitoring extremes\.Scientific Data2,pp\. 150066\.Cited by:[§III\-E](https://arxiv.org/html/2605.08113#S3.SS5.p1.1)\.
- \[7\]E\. Hachbornet al\.\(2024\)GROW\-Africa: a multi\-country smallholder yield dataset for sub\-Saharan Africa\.Scientific Data11\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.14961637)Cited by:[§III\-B](https://arxiv.org/html/2605.08113#S3.SS2.p2.1)\.
- \[8\]K\. He, X\. Chen, S\. Xie, Y\. Li, P\. Dollár, and R\. Girshick\(2022\)Masked autoencoders are scalable vision learners\.pp\. 16000–16009\.Cited by:[§I](https://arxiv.org/html/2605.08113#S1.p3.1)\.
- \[9\]A\. E\. Hoerl and R\. W\. Kennard\(1970\)Ridge regression: biased estimation for nonorthogonal problems\.Technometrics12\(1\),pp\. 55–67\.Cited by:[§IV\-B](https://arxiv.org/html/2605.08113#S4.SS2.p2.1)\.
- \[10\]J\. Jakubik, S\. Roy, C\.E\. Phillips, P\. Fraccaro, D\. Godwin, B\. Zadrozny, D\. Szwarcman, C\. Gomes, G\. Nyirjesy, B\. Edwards,et al\.\(2023\)Foundation models for generalist geospatial artificial intelligence\.arXiv preprint arXiv:2310\.18660\.Cited by:[§I](https://arxiv.org/html/2605.08113#S1.p3.1),[§II\-B](https://arxiv.org/html/2605.08113#S2.SS2.p1.1),[§IV\-A2](https://arxiv.org/html/2605.08113#S4.SS1.SSS2.p1.1)\.
- \[11\]Z\. Jiang, A\. R\. Huete, K\. Didan, and T\. Miura\(2008\)Development of a two\-band enhanced vegetation index without a blue band\.Remote Sensing of Environment112\(10\),pp\. 3833–3845\.Cited by:[§III\-D](https://arxiv.org/html/2605.08113#S3.SS4.p1.1)\.
- \[12\]H\. Kerner, G\. Tseng, I\. Becker\-Reshef, B\. Barker, B\. Munshell, M\. Paliyam, and M\. Hosseini\(2020\)Rapid response crop maps in data sparse regions\.arXiv preprint arXiv:2006\.16866\.Cited by:[§II\-C](https://arxiv.org/html/2605.08113#S2.SS3.p1.1)\.
- \[13\]D\. Leeet al\.\(2025\)HarvestStat Africa: a subnational crop production dataset for sub\-Saharan Africa\.Scientific Data12\.External Links:[Document](https://dx.doi.org/10.1038/s41597-025-05001-z)Cited by:[§III\-B](https://arxiv.org/html/2605.08113#S3.SS2.p3.1)\.
- \[14\]D\. B\. Lobell, G\. Azzari, M\. Burke, S\. Gourlay, Z\. Jin, T\. Kilic, and S\. Murray\(2020\)Eyes in the sky, boots on the ground: assessing satellite\- and ground\-based approaches to crop yield measurement and analysis\.American Journal of Agricultural Economics102\(1\),pp\. 202–219\.Cited by:[§I](https://arxiv.org/html/2605.08113#S1.p1.1),[§II\-A](https://arxiv.org/html/2605.08113#S2.SS1.p1.1)\.
- \[15\]G\. Mai, N\. Lao, Y\. He, J\. Song, and S\. Ermon\(2023\)On the opportunities and challenges of foundation models for geospatial artificial intelligence\.arXiv preprint arXiv:2304\.06798\.Cited by:[§II\-B](https://arxiv.org/html/2605.08113#S2.SS2.p1.1)\.
- \[16\]O\. Mañas, A\. Lacoste, X\. Giro\-i\-Nieto, D\. Vazquez, and P\. Rodriguez\(2021\)Seasonal contrast: unsupervised pre\-training from uncurated remote sensing data\.pp\. 9414–9423\.Cited by:[§II\-C](https://arxiv.org/html/2605.08113#S2.SS3.p1.1)\.
- \[17\]J\.W\. Rouse, R\.H\. Haas, J\.A\. Schell, and D\.W\. Deering\(1974\)Monitoring vegetation systems in the Great Plains with ERTS\.NASA Special Publication351,pp\. 309–317\.Cited by:[§III\-D](https://arxiv.org/html/2605.08113#S3.SS4.p1.1)\.
- \[18\]G\. Tseng, I\. Zvonkov, C\. L\. Nakalembe, and H\. Kerner\(2021\)CropHarvest: a global dataset for crop\-type classification\.Advances in Neural Information Processing Systems Datasets and Benchmarks\.Cited by:[§II\-C](https://arxiv.org/html/2605.08113#S2.SS3.p1.1)\.
- \[19\]D\. Tuia, C\. Persello, and L\. Bruzzone\(2016\)Domain adaptation for the classification of remote sensing data: an overview of recent advances\.IEEE Geoscience and Remote Sensing Magazine4\(2\),pp\. 41–57\.Cited by:[§II\-C](https://arxiv.org/html/2605.08113#S2.SS3.p1.1)\.
- \[20\]Y\. Wang, N\. A\. A\. Braham, Z\. Xiong, C\. Liu, C\. M\. Albrecht, and X\. X\. Zhu\(2023\)SSL4EO\-S12: a large\-scale multimodal, multitemporal dataset for self\-supervised learning in earth observation\.IEEE Geoscience and Remote Sensing Magazine11\(3\),pp\. 98–106\.Cited by:[§II\-B](https://arxiv.org/html/2605.08113#S2.SS2.p1.1)\.
- \[21\]A\. Wolanin, G\. Mateo\-García, G\. Camps\-Valls, L\. Gómez\-Chova, M\. Meroni, G\. Duveiller, Y\. Liangzhi, and L\. Guanter\(2020\)Estimating and understanding crop yields with explainable deep learning in the indian wheat belt\.Environmental Research Letters15\(2\),pp\. 024019\.Cited by:[§II\-C](https://arxiv.org/html/2605.08113#S2.SS3.p1.1)\.
- \[22\]J\. You, X\. Li, M\. Low, D\. Lobell, and S\. Ermon\(2017\)Deep Gaussian process for crop yield prediction based on remote sensing data\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.31\.Cited by:[§II\-A](https://arxiv.org/html/2605.08113#S2.SS1.p1.1)\.Similar Articles
Can Machine Learning Forecast Rice Yields in Data-Constrained Settings? Satellite Climate Data, National Crop Statistics, and Lessons from Sierra Leone
This paper presents the first machine learning study for crop yield forecasting in Sierra Leone, finding that combining freely available satellite climate data (CHIRPS, NASA POWER) with national crop statistics reduces forecast error by a third compared to persistence, though crop statistics alone are insufficient.
Embedding Foundation Model Predictions in Discrete-Choice Models with Structural Guarantees
This paper proposes a two-stage adapter that embeds foundation model predictions into a multinomial logit model, preserving economic properties like cost monotonicity and interpretable willingness-to-pay while improving accuracy by up to 12.8 percentage points.
No One Knows the State of the Art in Geospatial Foundation Models
This paper audits 152 papers on geospatial foundation models and finds severe lack of standardization, making it impossible to determine state-of-the-art. The authors propose six concrete expectations to improve reproducibility and comparability.
Time-Series Foundation Model Embeddings for Remaining Useful Life Estimation
This paper introduces a lightweight approach for remaining useful life estimation using frozen embeddings from the Chronos-2 time-series foundation model combined with a simple regression head, achieving superior performance on industrial sensor data compared to baseline methods.
Land cover and flood type govern the detection limits of satellite-based flood mapping across diverse global flood events
A study evaluating the Prithvi-EO-2.0 foundation model for satellite-based flood mapping across 19 diverse global flood events, finding that detection accuracy is jointly governed by land cover and flood type.