Can Machine Learning Forecast Rice Yields in Data-Constrained Settings? Satellite Climate Data, National Crop Statistics, and Lessons from Sierra Leone

arXiv cs.LG 06/15/26, 04:00 AM Papers

Summary

This paper presents the first machine learning study for crop yield forecasting in Sierra Leone, finding that combining freely available satellite climate data (CHIRPS, NASA POWER) with national crop statistics reduces forecast error by a third compared to persistence, though crop statistics alone are insufficient.

arXiv:2606.13959v1 Announce Type: new Abstract: Sierra Leone's agriculture operates with almost no data-driven decision support, and no published machine learning study has examined the country's crop yields. We ask whether rice yield can be forecast from data Sierra Leone currently has. Using 25 years of FAOSTAT production data (2000-2024) for nine major crops, we train XGBoost, Gradient Boosting, and Random Forest under a strict anti-leakage protocol with expanding-window walk-forward evaluation across seven held-out years, benchmarked against naive persistence. No model trained on crop statistics alone outperforms persistence. Augmenting with free satellite climate data (CHIRPS rainfall, NASA POWER temperature) reverses this result: a climate-only XGBoost reduces forecast error by one third (RMSE 284 vs 428 kg/ha), a gain that holds for a linear model and is robust to excluding the anomalous 2018 season. Early-season (May-June) rainfall is the dominant predictor, implying seasonal yield risk is observable months before harvest. No model anticipated the 2018 collapse, whose origins were institutional rather than climatic. We translate the findings into policy recommendations for Sierra Leone's Feed Salone Strategy, with a fully open-source pipeline.

Original Article

View Cached Full Text

Cached at: 06/15/26, 09:08 AM

# Can Machine Learning Forecast Rice Yields in Data-Constrained Settings? Satellite Climate Data, National Crop Statistics, and Lessons from Sierra Leone Evidence for the Feed Salone Strategy 2023–2030
Source: [https://arxiv.org/html/2606.13959](https://arxiv.org/html/2606.13959)
Ibrahim Denis FofanahSeidenberg School of Computer Science & Information Systems Pace University, New York, USARiseAfrica Foundation for STEM and Innovation Sierra Leone, West Africa

###### Abstract

Agriculture employs the majority of Sierra Leone’s rural population, yet the sector operates with almost no data\-driven decision support, and no published machine learning study has examined Sierra Leonean crop yields specifically\. This paper provides the first such evidence base, and asks a question with direct consequences for the country’s Feed Salone Strategy \(2023–2030\): can rice yield be forecast from the data Sierra Leone currently has?

Using 25 years of FAOSTAT production data \(2000–2024\) for nine major crops, we train three ensemble algorithms \(XGBoost, Gradient Boosting, Random Forest\) under a strict anti\-leakage protocol: lagged predictors only, and an expanding\-window walk\-forward evaluation across seven held\-out years \(2018–2024\), benchmarked against naive persistence\. The answer is no: no model trained on crop statistics alone outperforms simply carrying forward the previous year’s yield\.

We then augment the models with freely available satellite climate data: CHIRPS rainfall and NASA POWER temperature, aggregated to national growing\-season features\. This reverses the result\. A climate\-only XGBoost model reduces forecast error by a third relative to persistence \(RMSE 284 vs\. 428 kg/ha\), a gain that holds for a linear model as well and is robust to excluding the anomalous 2018 season\. Early\-season \(May–June\) rainfall is the dominant predictor, implying that seasonal yield risk is observable months before harvest\. Two boundaries are documented\. No model anticipated the 2018 yield collapse, whose origins were institutional rather than climatic, and the 2020–2022 record yields occurred despite below\-average rainfall, consistent with input\-driven gains under recent policy programs\.

The findings carry a clear message for Feed Salone: Sierra Leone’s existing agricultural statistics cannot support yield forecasting, but combining them with free satellite climate data already can, and household\-level microdata would extend prediction to the district level where decisions are made\. The full pipeline is open\-source and replicable for other data\-constrained agricultural economies\.

Keywords:machine learning, crop yield prediction, data leakage, walk\-forward validation, CHIRPS, Sierra Leone, Feed Salone, sub\-Saharan Africa, food security

## 1\. Introduction

### 1\.1\. Agriculture, Data, and the Intelligence Gap

In the twenty\-first century, agriculture has become as much an information problem as an agronomic one\. Across the world’s most productive farming systems, data science has changed how farmers make decisions\. Satellite imagery informs planting schedules\. Machine learning models forecast seasonal yields before a single seed enters the ground\. Real\-time price information tells producers where to sell and when\. The result is a farming sector that is not merely productive but intelligent: responsive to information, adaptive to risk, and capable of generating wealth at scale\.

Sub\-Saharan Africa has not shared equally in this transformation\. Smallholder farmers, who operate plots typically smaller than two hectares and account for the majority of food production across the region, continue to make consequential agricultural decisions without the information those decisions require\[[1](https://arxiv.org/html/2606.13959#bib.bib1)\]\. What to plant this season? When will the rains arrive? What price will the market offer at harvest? These are not abstract questions\. They determine whether a farming household eats well or faces hunger, and they are answered not with models or forecasts but with memory, tradition, and approximation\. The consequences of that gap fall most heavily on those least equipped to absorb them\.

This paper is concerned with one country where that gap is particularly acute, and where the stakes of closing it are particularly high: Sierra Leone\.

### 1\.2\. Sierra Leone: Agricultural Potential and Persistent Underperformance

Sierra Leone possesses an estimated 5\.4 million hectares of fertile arable land, a tropical climate that supports year\-round cultivation, and river systems that feed inland valley swamps across every region\[[2](https://arxiv.org/html/2606.13959#bib.bib2)\]\. Rice, the country’s staple food, grows naturally and abundantly across both lowland and upland ecologies\. Cassava, sweet potato, groundnut, oil palm, cocoa, and coffee round out an agricultural portfolio that could, in principle, feed the nation and generate substantial export revenue\.

In practice, the sector has chronically underperformed\. Sierra Leone produced approximately 1\.4 billion kilograms of rice in 2023, yet this was insufficient to meet national demand\[[3](https://arxiv.org/html/2606.13959#bib.bib3)\]\. The country imported 480,000 metric tons of rice in 2022 alone, a dependency that drains foreign exchange reserves and leaves food security vulnerable to global price shocks\[[2](https://arxiv.org/html/2606.13959#bib.bib2)\]\. More troublingly, 75 percent of arable land remains uncultivated\[[2](https://arxiv.org/html/2606.13959#bib.bib2)\], which points not to a scarcity of land or labor but to a scarcity of the enabling conditions that make cultivation worthwhile: infrastructure, inputs, market access, and decision support\.

A further dimension of the problem, which motivates the broader research agenda of which this paper is the first step, is post\-harvest loss\. Smallholder farmers across sub\-Saharan Africa lose between 30 and 50 percent of their harvest to spoilage, poor storage, inadequate processing, and transport failures\[[7](https://arxiv.org/html/2606.13959#bib.bib7),[8](https://arxiv.org/html/2606.13959#bib.bib8)\]\. In Sierra Leone, threshing and winnowing are performed by hand, post\-harvest drying occurs on mud floors and tarmac roads, and access to even basic concrete drying infrastructure is limited to a small share of farming communities\[[9](https://arxiv.org/html/2606.13959#bib.bib9)\]\. The economic loss is substantial, but the deeper consequence is food insecurity: as of 2023, 39 percent of Sierra Leone’s population lived below the poverty line\[[10](https://arxiv.org/html/2606.13959#bib.bib10)\]\.

### 1\.3\. The Policy Context: Feed Salone

In October 2023, President Julius Maada Bio launched the Feed Salone Strategy as the flagship initiative of the Medium\-Term National Development Plan 2024–2030, with a mandate to transform Sierra Leone’s food systems from subsistence and dependency toward a resilient, commercially oriented, and technology\-driven sector\[[4](https://arxiv.org/html/2606.13959#bib.bib4)\]\. The strategy targets, among other goals, a doubling of rice production, a reduction in post\-harvest losses, and the integration of digital and data\-driven tools into agricultural extension and planning\[[4](https://arxiv.org/html/2606.13959#bib.bib4)\]\.

Yet the Ministry of Agriculture and Food Security’s own assessment identifies a critical constraint: MAFS lacks the data infrastructure and technical capacity needed to monitor, evaluate, and guide its own policies\[[5](https://arxiv.org/html/2606.13959#bib.bib5)\]\. Field staff in rural areas lack training in even basic analytical tools\. The Planning, Evaluation, Monitoring, and Statistics Division manages a decentralized monitoring system that struggles to produce timely, district\-level intelligence\[[5](https://arxiv.org/html/2606.13959#bib.bib5)\]\. The ambition of Feed Salone runs ahead of the data systems available to support it\.

This paper responds to that gap, not by assuming that data\-driven forecasting is possible in Sierra Leone, but by testing rigorously whether it is, with which data, and under what limits\.

### 1\.4\. The Research Gap

Machine learning applications in agricultural prediction have grown substantially over the past decade\. Ensemble methods, particularly Random Forest, XGBoost, and Gradient Boosting, have consistently demonstrated strong predictive performance for crop yield estimation across diverse agricultural environments\[[15](https://arxiv.org/html/2606.13959#bib.bib15),[16](https://arxiv.org/html/2606.13959#bib.bib16)\]\. Explainable AI frameworks have made these models increasingly useful to non\-specialist users\[[17](https://arxiv.org/html/2606.13959#bib.bib17)\]\.

Yet the geographic distribution of this work is skewed\. Systematic reviews of the crop yield prediction literature find that while South Africa, Ghana, and East Africa are increasingly well represented, West Africa, and Sierra Leone specifically, remains virtually absent from the published literature\[[12](https://arxiv.org/html/2606.13959#bib.bib12)\]\. The only published study directly examining post\-harvest losses in Sierra Leone used traditional statistical methods on a sample of 232 rice farmers across eight districts\[[13](https://arxiv.org/html/2606.13959#bib.bib13)\]\. No machine learning study has been built specifically on Sierra Leonean yield data\.

There is a second, methodological gap that this paper addresses directly\. Much of the small\-sample agricultural ML literature reports very high in\-sample or randomly cross\-validated accuracy \(R2\>0\.95R^\{2\}\>0\.95is common\) on national annual time series of only a few dozen observations\. As Section[4\.4](https://arxiv.org/html/2606.13959#S4.SS4)discusses, such designs are highly vulnerable to target leakage and to temporal information bleeding across random train–test splits\[[23](https://arxiv.org/html/2606.13959#bib.bib23),[24](https://arxiv.org/html/2606.13959#bib.bib24)\]\. This study adopts a strict anti\-leakage protocol and a walk\-forward evaluation, and benchmarks every model against naive baselines, a discipline that, as the results show, changes the conclusions entirely\.

### 1\.5\. Research Questions

Three research questions guide this study:

1. 1\.Can rice yield in Sierra Leone be forecast from the country’s existing agricultural statistics alone, using 25 years of FAOSTAT crop data \(2000–2024\), when evaluated against naive baselines under walk\-forward validation?
2. 2\.Does augmenting those statistics with freely available satellite climate data \(CHIRPS rainfall, NASA POWER temperature\) change the answer, and which climate signals carry the predictive weight?
3. 3\.What do the results imply for the implementation of the Feed Salone Strategy 2023–2030, and for the data\-infrastructure investments needed to make data\-driven agricultural governance sustainable in Sierra Leone?

### 1\.6\. Contributions

Four contributions emerge from this work\. First, it provides the first machine learning evidence base built specifically for Sierra Leonean agriculture, establishing under rigorous evaluation what the country’s existing public data can and cannot support\. Second, it demonstrates that free satellite climate data, requiring no accounts, licenses, or fees, is sufficient to move national rice yield forecasting from impossible to useful, reducing out\-of\-sample error by one third relative to naive persistence\. Third, it documents, as a methodological case study, how standard but flawed designs \(same\-year features, random train–test splits\) produce illusory accuracy on exactly this kind of small national time series\. Fourth, it translates the findings into concrete, data\-grounded policy recommendations aligned with the Feed Salone Strategy, and archives the full open\-source pipeline for replication in other data\-constrained agricultural economies\.

### 1\.7\. Paper Organization

Section 2 reviews the relevant literature\. Section 3 describes the study context and the three data sources\. Section 4 presents the methodological framework, including the anti\-leakage protocol and validation design\. Section 5 reports results\. Section 6 discusses findings, limits, and policy implications\. Section 7 concludes with policy recommendations\.

## 2\. Literature Review

### 2\.1\. Machine Learning in Agricultural Prediction

The application of machine learning to crop yield prediction has matured considerably over the past decade\. Early approaches relied on linear regression and simple statistical models that estimated yield as a function of rainfall and temperature\[[22](https://arxiv.org/html/2606.13959#bib.bib22)\]\. The limitations of those approaches, chiefly their inability to capture non\-linear relationships, interaction effects, and spatial variation, drove a shift toward ensemble methods and deep learning architectures that now dominate the field\.

Among supervised learning algorithms, Random Forest, XGBoost, and Gradient Boosting Machines have emerged as the most consistently reliable for crop yield prediction across diverse agricultural environments\[[15](https://arxiv.org/html/2606.13959#bib.bib15),[16](https://arxiv.org/html/2606.13959#bib.bib16)\]\. Deep learning architectures, particularly LSTM networks and CNNs, have shown strong results when integrating satellite\-derived vegetation indices and remote sensing data\[[12](https://arxiv.org/html/2606.13959#bib.bib12)\]\. However, these architectures demand large training datasets and computational resources that are rarely available in data\-constrained settings\. For smallholder agriculture in sub\-Saharan Africa, where short administrative time series rather than dense satellite stacks constitute the primary structured data source, ensemble methods offer a more practical and interpretable path to prediction\[[17](https://arxiv.org/html/2606.13959#bib.bib17)\]\.

The interpretability question deserves emphasis\. A recurring critique in the agricultural machine learning literature is that black\-box models, however accurate, are of limited use to farmers, extension officers, and policymakers who need to understand and trust what a model is telling them\[[12](https://arxiv.org/html/2606.13959#bib.bib12)\]\. SHAP \(SHapley Additive exPlanations\) has largely addressed this by providing feature\-level attribution that makes model outputs readable to non\-specialist users\[[21](https://arxiv.org/html/2606.13959#bib.bib21),[17](https://arxiv.org/html/2606.13959#bib.bib17)\]\. The integration of SHAP analysis into this study reflects that priority\.

### 2\.2\. Validation Pitfalls in Small\-Sample Yield Prediction

A growing methodological literature cautions that predictive accuracy claims in time\-series settings are only as credible as the validation design behind them\. Two failure modes are especially relevant to national annual yield series\. The first istarget leakage: the inclusion of predictors that are functions of, or near\-duplicates of, the outcome variable\[[23](https://arxiv.org/html/2606.13959#bib.bib23)\]\. In national crop statistics this arises easily\. An aggregate “cereals” indicator in a country where one cereal dominates production, for example, is arithmetically almost identical to that crop’s own series\. The second istemporal leakage through random splitting: when annual observations are assigned randomly to train and test sets, test\-year information enters training through lagged and rolling features of neighboring years, and the evaluation no longer simulates genuine forecasting\[[24](https://arxiv.org/html/2606.13959#bib.bib24),[25](https://arxiv.org/html/2606.13959#bib.bib25)\]\. Both failure modes inflateR2R^\{2\}dramatically on small samples\. The appropriate remedies, namely strictly lagged or exogenous predictors, expanding\-window walk\-forward evaluation, and comparison against naive baselines, are adopted throughout this study and described in Section[4\.4](https://arxiv.org/html/2606.13959#S4.SS4)\.

### 2\.3\. Machine Learning Applications in Sub\-Saharan African Agriculture

Research applying machine learning to agricultural prediction in sub\-Saharan Africa has grown meaningfully in recent years, though it remains concentrated in a handful of countries: Kenya, Rwanda, Nigeria, South Africa, and Ethiopia account for the majority of published work, while West Africa, and the smaller economies of the Mano River Union in particular, is conspicuously underrepresented\[[12](https://arxiv.org/html/2606.13959#bib.bib12)\]\.

A country\-level study on West African crop yield prediction trained models on FAOSTAT data from 1990 to 2020 across fourteen countries, including Sierra Leone\[[14](https://arxiv.org/html/2606.13959#bib.bib14)\]\. While that work showed that ML\-based yield prediction is feasible in the West African context, its multi\-country pooling and national\-level aggregation obscure the heterogeneity that matters most for agricultural policy\. A farmer in Kailahun District and a farmer in Port Loko District face different soil conditions, rainfall patterns, and market access constraints; a national average prediction is not directly useful to either of them\. This study responds by developing a Sierra Leone\-specific analytical pipeline calibrated to local data and policy context, while being explicit, in its limitations, that national aggregation remains a binding constraint that only household\- and district\-level microdata can relax\.

### 2\.4\. Post\-Harvest Loss Research in West Africa

Post\-harvest losses represent one of the most significant and least addressed threats to food security in sub\-Saharan Africa\. Meta\-analyses across the region estimate that smallholder farmers lose between 30 and 50 percent of their production annually to physical spoilage, quality degradation, and market failures\[[7](https://arxiv.org/html/2606.13959#bib.bib7)\]\. The causes are well documented: inadequate storage infrastructure, poor temperature and humidity management, delayed threshing and drying, pest and fungal contamination, and distance from processing facilities consistently appear as primary contributors\[[18](https://arxiv.org/html/2606.13959#bib.bib18),[8](https://arxiv.org/html/2606.13959#bib.bib8)\]\. Socioeconomic factors, including household poverty, limited credit access, and gender dynamics in post\-harvest labor, add further complexity\[[13](https://arxiv.org/html/2606.13959#bib.bib13)\]\.

What is missing is the application of machine learning to post\-harvest lossprediction: models that could identify high\-risk communities before losses happen, enabling proactive rather than reactive intervention\. Such models require household\-level data on storage, transport, and handling practices\. These data exist in Sierra Leone’s 2023 national agricultural survey but are not yet accessible \(Section[3](https://arxiv.org/html/2606.13959#S3)\)\. Post\-harvest loss modeling is therefore positioned in this paper as the next step of the research agenda rather than a delivered result\.

### 2\.5\. The Sierra Leone Agricultural Data Landscape

Sierra Leone’s agricultural data environment has improved in recent years, driven largely by the 50x2030 Initiative, a partnership between FAO, IFAD, the World Bank, and national statistical agencies aimed at transforming agricultural data systems in low\-income countries\[[6](https://arxiv.org/html/2606.13959#bib.bib6)\]\. The 2023 Sierra Leone Annual Agricultural Sample Survey \(SLAASS\), conducted by Statistics Sierra Leone in partnership with MAFS, represents the most comprehensive nationally representative agricultural household survey in the country’s history, covering 5,200 households across 520 enumeration areas\[[3](https://arxiv.org/html/2606.13959#bib.bib3)\]\.

Despite this progress, important structural challenges remain\. MAFS’s Planning, Evaluation, Monitoring, and Statistics Division manages a decentralized data collection system that faces real limitations in rural areas, where field staff lack training in analytical tools\[[5](https://arxiv.org/html/2606.13959#bib.bib5)\]\. A Data Ecosystem Mapping published in May 2024 under the 50x2030 Initiative concluded that a large gap between data collection and data use for decision\-making persists\[[6](https://arxiv.org/html/2606.13959#bib.bib6)\]\.

This paper engages directly with that gap\. By establishing empirically what existing publicly available data can and cannot support, it replaces a rhetorical case for data\-infrastructure investment with a measured one\.

## 3\. Study Context and Data

### 3\.1\. Agricultural Profile of Sierra Leone

Sierra Leone is a small coastal nation in West Africa with a population of approximately 8\.4 million people, of whom roughly 60 percent live in rural areas and depend on agriculture for their livelihoods\[[10](https://arxiv.org/html/2606.13959#bib.bib10)\]\. Agriculture contributes approximately 25 percent of GDP as of 2024\[[11](https://arxiv.org/html/2606.13959#bib.bib11)\], a figure that understates the sector’s importance to rural welfare, given that agricultural income is the primary livelihood source for the majority of the country’s poor\.

The agricultural system is dominated by smallholders\. According to the 2023 SLAASS, 92\.4 percent of agricultural households have between one and three economically active members, with most operating plots smaller than two hectares\[[3](https://arxiv.org/html/2606.13959#bib.bib3)\]\. Rice is the dominant staple crop, produced across all five administrative regions\. Sweet potato, cassava, groundnut, and oil palm are also widely cultivated\. The Eastern Region holds the largest share of agricultural plots at 31\.8 percent, followed by the Southern Region at 20\.4 percent and the North\-Western Region at 19\.6 percent\[[3](https://arxiv.org/html/2606.13959#bib.bib3)\]\.

### 3\.2\. Crop Statistics: FAOSTAT, 2000–2024

The first data source is the FAOSTAT Crops and Livestock Products database, maintained by the Food and Agriculture Organization of the United Nations\[[8](https://arxiv.org/html/2606.13959#bib.bib8)\]\. FAOSTAT provides standardized, nationally aggregated agricultural statistics compiled from official government submissions, FAO estimates, and imputation procedures designed to ensure time\-series consistency\. The database is freely accessible and requires no approval to download, which makes it a practical and fully reproducible foundation for research in data\-constrained settings\.

For Sierra Leone, the data cover the years 2000 through 2024, yielding a 25\-year annual time series\. Three indicators are extracted for each crop: area harvested \(hectares\), yield \(kg/ha\), and production quantity \(tonnes\)\. Nine crops with complete 25\-year coverage on all three indicators are included: rice, cassava, maize, groundnuts, oil palm fruit, sweet potato, sorghum, cocoa beans, and plantains\. Yams, although agriculturally relevant, are reported in FAOSTAT for Sierra Leone only from 2022 onward \(3 of 25 years\) and are therefore excluded, as are FAOSTAT aggregate items \(e\.g\., “Cereals, primary”\) and processed products, for reasons developed in Section[3\.4](https://arxiv.org/html/2606.13959#S3.SS4)\. The raw dataset and all processing code are archived in the study’s public GitHub repository \([https://github\.com/Denis060/sierraleone\-agri\-ml](https://github.com/Denis060/sierraleone-agri-ml)\) to ensure full reproducibility\.

Two properties of this source deserve explicit acknowledgment\. First, FAOSTAT provides national\-level aggregates only; it contains no household, plot, or district observations, and therefore cannot support the post\-harvest loss modeling component of this research agenda, which requires microdata on storage, transport, and handling\. The 2023 SLAASS would provide that microdata but was not accessible during the study period despite registration on the FAO microdata portal; it is treated as the primary direction for future research \(Section[6\.3](https://arxiv.org/html/2606.13959#S6.SS3)\)\. Second, a substantial share of the FAOSTAT values for Sierra Leone are not official national figures: across the full extract, only 467 of 2,593 records carry FAO’s “official figure” flag, with the remainder estimated or imputed by FAO or receiving agencies\. For the rice yield outcome series specifically, 14 of 25 annual values are flagged official, and the most recent years \(2022–2024\) are FAO estimates\. Some of the smoothness that any model exploits in this series therefore originates in FAO’s imputation procedures rather than in the underlying agricultural system, which is itself evidence of the thinness of the national statistical record that this paper’s policy recommendations address\.

### 3\.3\. Satellite Climate Data: CHIRPS and NASA POWER

The second and third data sources are free, globally available satellite climate products requiring no registration or license\.

CHIRPS rainfall\.Monthly precipitation for 2000–2024 is taken from the Climate Hazards Group InfraRed Precipitation with Station data \(CHIRPS v2\.0\), a quasi\-global 0\.05∘gridded product blending satellite imagery with in\-situ station data\[[26](https://arxiv.org/html/2606.13959#bib.bib26)\]\. Monthly Africa rasters are spatially averaged over Sierra Leone’s national boundary \(geoBoundaries ADM0\[[27](https://arxiv.org/html/2606.13959#bib.bib27)\]\) to produce a national monthly rainfall series\.

NASA POWER temperature\.Monthly 2\-meter air temperature \(T2M\) for 2000–2024 is taken from the NASA POWER agroclimatology API, sampled on a five\-point grid spanning the country \(approximately 7\.0–9\.9∘N, 10\.3–13\.3∘W\) and averaged into a national series\. NASA POWER’s precipitation product is retained solely as an independent cross\-check on CHIRPS; the two monthly rainfall series correlate atr=0\.73r=0\.73, providing reassurance that the national rainfall signal is not an artifact of a single product\.

### 3\.4\. Variable Construction

#### 3\.4\.1\. Outcome Variable: Rice Yield

The outcome variable is rice yield in kilograms per hectare, extracted from the FAOSTAT yield element for Sierra Leone across the full 25\-year study period\. The resulting annual series provides a continuous outcome variable suitable for supervised regression\.

#### 3\.4\.2\. Predictor Variables and the Anti\-Leakage Protocol

Predictors are organized into three groups, governed by one rule:no predictor may contain same\-year information about agricultural outcomes\.Every crop\-derived feature must be knowable before the prediction year’s harvest\. Exogenous climate observations are exempt from the lag requirement because they are measured independently of the outcome and are available during the growing season, before yield is realized\.

Group 1: lagged crop indicators \(35 features\)\.One\-year lags of area harvested, yield, and production for the nine crops, excluding rice’s own contemporaneous values\. Three classes of variables used in earlier, flawed versions of this pipeline are excluded by construction: \(i\) rice production per hectare, which is arithmetically the target; \(ii\) FAOSTAT aggregates such as “Cereals, primary,” of which rice constitutes 87–93 percent of production in every year of the series \(correlation with rice yield:r=0\.998r=0\.998\), making them near\-duplicates of the outcome; and \(iii\) all same\-year values of any crop variable\.

Group 2: autoregressive features\.Rice yield at lags of one, two, and three years, plus three\-year and five\-year rolling means computed strictly over past years \(the series is shifted before the rolling window is applied, so the window for yearttnever touches yeartt\)\.

Group 3: climate features \(used in the climate configurations\)\.Growing\-season total rainfall \(May–October\), early\-season rainfall \(May–June\), peak\-season rainfall \(July–September\), a standardized growing\-season rainfall anomaly whose climatology is computed only from pre\-test\-window years \(2000–2017\), the maximum one\-month rainfall deficit within the growing season, mean growing\-season temperature, and prior\-year growing\-season rainfall\. Structural indicators \(linear year trend; Ebola 2014–2016, COVID\-19 2020–2021, and Feed Salone 2023\+ dummies\) complete the set\.

A runtime assertion in the published pipeline fails the build if any same\-year crop variable reaches the predictor matrix\.

## 4\. Methodology

### 4\.1\. Research Design

Rice yield forecasting is treated as a supervised regression problem with an explicit forecasting simulation: in every evaluation, the model may use only information that would have been available before the year being predicted\. Three feature configurations are compared \(lagged crop statistics only, climate features only, and their combination\) across three ensemble learners and two naive baselines\. This factorial design separates two questions that single\-model studies conflate: whether thedatacontain a forecastable signal, and whethermachine learningextracts more of it than trivial rules do\.

### 4\.2\. Models

Three ensemble algorithms are compared\.Random Forest\[[19](https://arxiv.org/html/2606.13959#bib.bib19)\]averages decorrelated decision trees grown on bootstrap samples\.XGBoost\[[20](https://arxiv.org/html/2606.13959#bib.bib20)\]andGradient Boosting Machinesbuild trees sequentially, each correcting the residuals of the last; including both allows comparison of two boosting implementations\. Given the small sample, hyperparameters are fixed at deliberately conservative values rather than tuned: for XGBoost, maximum depth 2, learning rate 0\.05, 100 estimators, subsample 0\.8, and L2 regularizationλ=1\\lambda=1; the scikit\-learn GBM and Random Forest use comparably shallow, regularized settings\. No grid search is performed, because with roughly 13–18 training observations per fold, search procedures select for noise\. As a robustness check on whether tree ensembles are necessary at all, aRidge regressionon the climate features is evaluated under the identical protocol\.

### 4\.3\. Baselines

Two naive baselines are evaluated in the same walk\-forward loop as the learned models\.Persistencepredicts that this year’s yield equals last year’s\.Rolling meanpredicts the mean of the previous three years\. In a series with strong autocorrelation, persistence is a demanding benchmark; any model that cannot beat it has no forecasting value regardless of its in\-sample fit\.

### 4\.4\. Validation: Expanding\-Window Walk\-Forward

Evaluation uses an expanding\-window walk\-forward design: the models are trained on 2000–2017 and predict 2018; retrained on 2000–2018 and predict 2019; and so on through 2024\. This yields seven genuinely out\-of\-sample annual forecasts, each made using only prior information, exactly as an operational forecasting system would\. Random train–test splits are not used anywhere: with lagged and rolling features, randomly held\-out years exchange information with neighboring training years and the evaluation ceases to simulate forecasting\[[24](https://arxiv.org/html/2606.13959#bib.bib24),[25](https://arxiv.org/html/2606.13959#bib.bib25)\]\.

### 4\.5\. Evaluation Metrics

Performance is summarized by root mean squared error \(RMSE\) and mean absolute error \(MAE\), both in kg/ha, computed over the seven held\-out forecasts, with the coefficient of determination \(R2R^\{2\}\) reported for completeness\. With only seven test points spanning the most volatile period of the series \(including the 2018 collapse and 2021–2022 peak\),R2R^\{2\}against the test\-set mean is unstable and frequently negative; RMSE and MAE, and especially their comparison against the persistence baseline, are the primary metrics\. A per\-year error table is reported alongside the aggregates so that no single year’s result is hidden inside an average\.

### 4\.6\. Interpretability

For the best\-performing configuration, SHAP values\[[21](https://arxiv.org/html/2606.13959#bib.bib21)\]are computed on a model refit to the full series \(for interpretation only, after all evaluation is complete\) to identify which features drive predictions and in which direction\.

## 5\. Results

### 5\.1\. Agricultural Trends in Sierra Leone, 2000–2024

Understanding how crop yields and production have moved over the past 25 years provides context for the forecasting results\.

#### 5\.1\.1\. Rice Yield Dynamics

Figure[1](https://arxiv.org/html/2606.13959#S5.F1)shows rice yield in Sierra Leone from 2000 to 2024\. The 25\-year average stands at 1,441 kg/ha, well below the global average of approximately 4,600 kg/ha and far behind comparable tropical rice\-producing economies such as Thailand \(∼\\sim3,000 kg/ha\) and Vietnam \(∼\\sim5,900 kg/ha\)\[[8](https://arxiv.org/html/2606.13959#bib.bib8)\]\.

![Refer to caption](https://arxiv.org/html/2606.13959v1/figures/fig1_rice_yield_trend.png)Figure 1:Rice yield in Sierra Leone \(2000–2024\) with key agricultural and economic shock periods marked\. The dotted line is the 25\-year average of 1,441 kg/ha\. Source: FAOSTAT \(2025\)\.Three phases characterize the trend\. From 2000 to 2005, yields stagnated near 1,000–1,135 kg/ha\. From 2006 to 2013 they climbed, reaching 1,780 kg/ha by 2009 and a pre\-crisis plateau of 1,870 kg/ha over 2010–2013, likely reflecting the Smallholder Commercialization Programme and associated input subsidy interventions\[[1](https://arxiv.org/html/2606.13959#bib.bib1)\]\. The period from 2014 onward has been defined by volatility and repeated shocks\. The Ebola crisis disrupted agricultural labor supply and rural markets across all producing regions\. A more severe shock followed in 2018, when rice yield collapsed to 786 kg/ha, the lowest value in the series and a 58 percent decline from the 2013 plateau; lingering post\-Ebola disorganization, declining fertilizer access, and a documented contraction in area harvested converged\. Yields then recovered sharply to a series peak of 2,110 kg/ha in 2022 before contracting to 1,409 kg/ha in 2024, the first full implementation year of the Feed Salone Strategy\. The forecasting analysis below returns to both the 2018 collapse and the 2020–2022 recovery, because they set the limits of what any forecasting model can claim\.

#### 5\.1\.2\. Multi\-Crop Yield Patterns

Figure[2](https://arxiv.org/html/2606.13959#S5.F2)extends the picture across five key crops\.

![Refer to caption](https://arxiv.org/html/2606.13959v1/figures/fig2_multi_crop_yields.png)Figure 2:Crop yield trends in Sierra Leone \(2000–2024\) for key staple and cash crops\. Dashed lines show linear trend fits\. Source: FAOSTAT \(2025\)\.Cassava shows the most dramatic long\-run improvement, rising from approximately 6,500 kg/ha in 2000 to a peak of nearly 20,000 kg/ha in 2021\. Sweet potato shows the strongest and most consistent positive trend, growing from 2,600 kg/ha in 2000 to over 8,000 kg/ha by 2024\. Maize and groundnuts display high year\-to\-year volatility, with swings of 50 percent or more between consecutive years, reflecting rainfall sensitivity and the absence of irrigation in the communities where they are predominantly cultivated\. The groundnut spike to 2,400 kg/ha in 2019 followed by a collapse to 500 kg/ha the following year is particularly striking and warrants district\-level investigation once microdata become available\.

#### 5\.1\.3\. Production Overview

Figure[3](https://arxiv.org/html/2606.13959#S5.F3)shows 2024 production for the ten largest individual crops \(FAOSTAT aggregate items are excluded\)\.

![Refer to caption](https://arxiv.org/html/2606.13959v1/figures/fig3_production_overview.png)Figure 3:Agricultural production by individual crop, Sierra Leone, 2024\. Rice, the primary dietary staple, ranks second at 1,391,000 tonnes, behind cassava \(2,855,000 t\)\. Source: FAOSTAT \(2025\)\.Despite being the country’s primary dietary staple, rice ranks second in production volume at 1,391,000 tonnes, less than half of cassava’s 2,855,000 tonnes\. This imbalance between what Sierra Leone eats and what it grows most is a direct explanation for the country’s persistent rice import dependency, documented at 480,000 metric tons in 2022, and underscores the urgency of the rice yield agenda at the center of Feed Salone\.

### 5\.2\. Forecasting Performance

Table[1](https://arxiv.org/html/2606.13959#S5.T1)reports walk\-forward forecasting performance for all model–feature\-set combinations and both baselines over the seven held\-out years \(2018–2024\)\. Figure[4](https://arxiv.org/html/2606.13959#S5.F4)shows the same comparison visually\.

Table 1:Walk\-forward forecasting performance, rice yield, 2018–2024 \(7 out\-of\-sample years\)Feature setModelR2RMSE \(kg/ha\)MAE \(kg/ha\)Climate onlyXGBoost0\.542284\.1216\.3Climate onlyGradient Boosting0\.465306\.8242\.7Climate onlyRidge \(linear\)0\.222370\.2304\.0Climate onlyRandom Forest0\.162384\.2304\.6—Persistence \(t−1t\{\-\}1\)−0\.039\-0\.039427\.8341\.3Crop \+ climateRandom Forest−0\.097\-0\.097439\.5420\.5Crop \+ climateGradient Boosting−0\.181\-0\.181456\.1427\.2Crop \+ climateXGBoost−0\.191\-0\.191458\.0427\.0Crop lags onlyRandom Forest−0\.247\-0\.247468\.5454\.7—3\-yr rolling mean−0\.565\-0\.565524\.9511\.4Crop lags onlyXGBoost−0\.591\-0\.591529\.4497\.2Crop lags onlyGradient Boosting−0\.630\-0\.630535\.8496\.7
Expanding\-window walk\-forward evaluation: train 2000–2017→\\rightarrowpredict 2018, retrain through 2024\. Bold marks the best value per column\. With seven test points spanning the series’ most volatile years,R2R^\{2\}is unstable and frequently negative; RMSE and MAE relative to the persistence baseline are the primary metrics\.

![Refer to caption](https://arxiv.org/html/2606.13959v1/figures/fig4_model_comparison.png)Figure 4:Out\-of\-sample RMSE by model and feature set, walk\-forward 2018–2024\. Only the climate\-only configurations beat the persistence baseline \(dashed line\); models given lagged crop statistics, alone or in combination, do not\.Three results structure everything that follows\.

First, crop statistics alone cannot forecast rice yield\.Every model trained on the 35 lagged crop features performsworsethan naive persistence \(RMSE 468–536 vs\. 428 kg/ha\)\. The information in last year’s production statistics about this year’s yield is already fully captured by last year’s yield itself; the additional crop features contribute noise that the ensembles, with 13–18 training observations per fold, overfit\.

Second, free satellite climate data reverses the result\.The climate\-only XGBoost cuts forecast error by one third relative to persistence \(RMSE 284\.1 vs\. 427\.8 kg/ha; MAE 216\.3 vs\. 341\.3\)\. Critically, this is not a single\-model artifact: every learner improves when given climate features in place of crop lags, and even the linear Ridge model beats persistence \(RMSE 370\.2\)\. The climate signal is real and partly linear; the XGBoost–Ridge gap of 86 kg/ha suggests additional non\-linearity, though with seven test points that comparison rests heavily on two turning\-point years and is treated as suggestive rather than definitive\.

Third, more features hurt\.Adding the 35 crop lagstothe climate features degrades every model \(e\.g\., XGBoost RMSE rises from 284 to 458\)\. At this sample size, feature parsimony is not a stylistic preference but a binding statistical constraint, a finding with direct relevance to the many small\-sample agricultural ML studies that maximize feature counts\.

### 5\.3\. Where the Gains Come From: Per\-Year Analysis

Aggregate error statistics can conceal as much as they reveal with seven test points, so Table[2](https://arxiv.org/html/2606.13959#S5.T2)reports the per\-year picture for the decisive years \(the full per\-year table for all models is archived with the pipeline\)\. Figure[5](https://arxiv.org/html/2606.13959#S5.F5)plots actual versus predicted yield by year\.

Table 2:Per\-year absolute forecast error \(kg/ha\), climate\-only XGBoost vs\. persistence, selected yearsBold marks the lower error\. The climate model’s aggregate advantage is earned at turning points \(2019, 2023\), where persistence by construction lags one year behind\. Neither approach anticipated the 2018 collapse\. Full table:outputs/per\_year\_errors\.csvin the project repository\.

![Refer to caption](https://arxiv.org/html/2606.13959v1/figures/fig5_actual_vs_predicted.png)Figure 5:Actual versus walk\-forward predicted rice yield by year, 2018–2024, all models and baselines\. The climate\-only XGBoost tracks the turning points \(2019, 2023\) that the persistence baseline misses by construction; no model anticipates the 2018 collapse\.The mechanism behind the climate model’s advantage is visible in the table: persistence, by construction, lags one year behind at every turning point, while the climate model tracks the yield level\. In 2019 the climate model’s error is 171 kg/ha against persistence’s 789; in 2023, 33 against 601\. Robustness checks confirm the advantage is not an artifact of any single year: excluding 2018 entirelywidensthe gap \(climate RMSE 182 vs\. persistence 438 kg/ha\)\.

Equally important is the year both approaches failed\. In 2018, when yield collapsed 32 percent to 786 kg/ha, the climate model predicted 1,391 kg/ha, an error of 606 kg/ha, worse than persistence\. The 2018 collapse was not climatic in origin; it reflected institutional stressors \(post\-Ebola disorganization, fertilizer access, area contraction\) invisible to rainfall and temperature data\. The model’s value is therefore tracking normal\-year variation, not anticipating structural crises\. This boundary is made explicit here because it determines the appropriate policy use of such a system \(Section[6](https://arxiv.org/html/2606.13959#S6)\)\.

### 5\.4\. What Drives the Predictions: SHAP Analysis

![Refer to caption](https://arxiv.org/html/2606.13959v1/figures/fig6_shap_summary.png)Figure 6:SHAP feature attribution for the climate\-only XGBoost model \(refit on the full series for interpretation after evaluation\)\. Each point is one year; the horizontal axis is the feature’s contribution to predicted yield in kg/ha, colored by feature value \(low = blue, high = red\)\.SHAP attribution identifies early\-season rainfall \(May–June\) as the dominant driver of the model’s predictions, followed by total growing\-season rainfall \(May–October\) and prior\-year growing\-season rainfall\. The prominence of the May–June window is the single most policy\-relevant pattern in the analysis: it corresponds to the planting and establishment period for rice in Sierra Leone’s rain\-fed systems, and it is fully observable in CHIRPS by early July, months before harvest\. A season’s yield risk, to the extent it is climatic, is therefore knowable in time to act on it\.

Figure[7](https://arxiv.org/html/2606.13959#S5.F7)sets the rainfall–yield relationship against the full series, including its instructive exceptions: the 2018 collapse occurred without a commensurate rainfall anomaly, and the record yields of 2020–2022 occurred in below\-average rainfall years, consistent with input\-driven gains under the policy programs of that period rather than favorable weather\.

![Refer to caption](https://arxiv.org/html/2606.13959v1/figures/fig7_rainfall_vs_yield.png)Figure 7:Growing\-season rainfall and rice yield by year, 2000–2024\. The association is positive in typical years; the 2018 collapse and the dry\-but\-high 2020–2022 period mark its limits\.

## 6\. Discussion

### 6\.1\. What the Results Establish, and What They Rule Out

Read together, the results draw a precise map of what is currently possible\. Sierra Leone’s existing crop statistics, used alone, contain no forecastable signal beyond what last year’s yield already provides; a Ministry analyst equipped only with FAOSTAT extracts cannot build a yield forecasting system, no matter how sophisticated the algorithm\. Adding free satellite climate data changes that conclusion: forecast error falls by one third, the gain is robust across model classes and to the exclusion of the most anomalous year, and the dominant signal, early\-season rainfall, is available months before harvest\. At the same time, two boundaries emerge clearly from the analysis\. The model does not anticipate structural collapses whose causes are institutional rather than climatic, as 2018 demonstrates; and the dry\-but\-record\-high years of 2020–2022 show that policy and input effects can dominate weather, which is, from a development perspective, encouraging\.

A methodological lesson accompanies the substantive one\. An earlier version of this pipeline, using same\-year features, FAOSTAT aggregate items, and a random 70/30 split, produced an apparentR2R^\{2\}of 0\.96\. Every component of that result dissolved under the anti\-leakage protocol and walk\-forward evaluation: the aggregate “cereals” feature was a near\-duplicate of the target \(r=0\.998r=0\.998\), same\-year features answered a question no forecaster can be asked, and random splitting let test\-year information leak through lagged features\. The corrected headline numbers are far smaller and far more useful\. Given how common the flawed design is in the small\-sample agricultural ML literature\[[23](https://arxiv.org/html/2606.13959#bib.bib23),[24](https://arxiv.org/html/2606.13959#bib.bib24)\], the before\-and\-after contrast documented in this study’s public repository may be of independent value to the field\.

### 6\.2\. Policy Implications for Feed Salone

#### 6\.2\.1\. A Rainfall\-Based Early Warning Capability Is Available Now

The most actionable finding is the dominance of May–June rainfall in the model’s predictions\. CHIRPS data for the planting window is published with short latency and is free; a monitoring workflow that flags anomalously dry planting seasons by early July would give MAFS several months of lead time for targeted extension support, input mobilization, and import planning, before harvest confirms the problem\. No new data collection is required to stand up this capability; it is an analytical product of data that already exists\.

#### 6\.2\.2\. Forecasting Complements, but Cannot Replace, Institutional Monitoring

The 2018 lesson cuts the other way: the largest single\-year food security shock in the series was invisible to climate data because its causes were institutional\. A yield intelligence system for Sierra Leone therefore needs two layers: climate\-based seasonal forecasting for weather\-driven variation, and administrative monitoring of input markets, fertilizer availability, and area planted for the institutional shocks that climate cannot see\. The 2020–2022 period offers the optimistic corollary: input\-driven gains overrode unfavorable weather, implying that the policy levers Feed Salone controls are powerful enough to dominate climate variation in normal years\.

#### 6\.2\.3\. The Data\-Infrastructure Case, Restated Empirically

This study’s central institutional finding is no longer rhetorical\. The country’s official agricultural statistics are thin: only 467 of 2,593 FAOSTAT records for Sierra Leone are flagged as official figures, and the most recent rice yields are FAO estimates\. That thinness has a measurable cost: the statistics cannot support forecasting\. The marginal value of better data is equally measurable: one free satellite product moved the system from useless to useful\. District\-level microdata such as the 2023 SLAASS is the next increment, and the one that would move prediction to the administrative level where extension decisions are actually made\. MAFS’s PEMSD division currently lacks the technical capacity to operate such analytics\[[5](https://arxiv.org/html/2606.13959#bib.bib5)\]; the entire pipeline used here is openly archived to lower that barrier\.

### 6\.3\. Limitations

Five limitations should be acknowledged\. First, national aggregation masks the district\-level heterogeneity in soils, rainfall, and practices that matters most for targeting; only microdata can relax this\. Second, the evaluation rests on seven out\-of\-sample years, the maximum the series allows under honest walk\-forward design, but few enough that all aggregate metrics, and especially the XGBoost–Ridge comparison, should be read as indicative\. Third, roughly half the FAOSTAT records are FAO estimates rather than official figures, including the 2022–2024 rice yields; measurement error in the outcome bounds achievable accuracy\. Fourth, the climate features are national averages of products with their own uncertainties, partially mitigated by the CHIRPS–NASA POWER cross\-check \(r=0\.73r=0\.73\)\. Fifth, post\-harvest loss modeling, a co\-equal objective of this research agenda, requires household\-level data on storage, transport, and handling that exist in the 2023 SLAASS but were not accessible during the study period despite registration on the FAO microdata portal; it is the immediate next step of this work rather than a delivered result\.

## 7\. Conclusion and Policy Recommendations

This paper set out to answer a question that Sierra Leone’s Feed Salone Strategy implicitly assumes has a positive answer: can the country’s agricultural outcomes be forecast from available data? The answer is conditional, and the conditions themselves are informative\. Twenty\-five years of national crop statistics, evaluated under a strict anti\-leakage protocol and walk\-forward validation, cannot outperform the naive rule that next year will resemble this one\. Free satellite climate data, CHIRPS rainfall above all, changes the answer: forecast error falls by one third, the gain holds across model classes, and the dominant predictor, May–June planting\-season rainfall, is observable months before harvest\. The boundaries are equally clear: institutional shocks like the 2018 collapse are invisible to climate data, and input\-driven policy effects can dominate weather, as 2020–2022 showed\.

Five policy recommendations follow from the empirical findings\.

Recommendation 1: Stand up a CHIRPS\-based seasonal rainfall monitor within MAFS\.The single strongest predictor of rice yield, May–June rainfall, is free, published with short latency, and observable by early July\. A planting\-season rainfall dashboard flagging anomalously dry starts would give MAFS months of lead time for extension support and import planning, at essentially zero data cost\.

Recommendation 2: Build a two\-layer yield intelligence system\.Pair climate\-based seasonal forecasting with administrative monitoring of input markets, fertilizer availability, and area planted\. The 2018 collapse demonstrates that the largest shocks can be institutional, not climatic, and no climate model will see them coming\.

Recommendation 3: Front\-load post\-harvest infrastructure investment\.Cassava, the country’s largest crop by volume at 2\.9 million tonnes, is also its most spoilage\-prone\[[18](https://arxiv.org/html/2606.13959#bib.bib18)\]\. If Feed Salone’s productivity agenda succeeds, bumper harvests will strain storage and processing capacity precisely when output is highest; that capacity should be built before the gains arrive, not in response to them\.

Recommendation 4: Prioritize SLAASS microdata access and district\-level modeling\.National aggregates are a binding ceiling on what forecasting can deliver\. The 2023 SLAASS microdata, covering 5,200 households across 520 enumeration areas, would extend prediction and post\-harvest loss modeling to the district level where extension decisions are made\. Streamlining researcher access to this data is among the cheapest high\-return actions available to MAFS and Statistics Sierra Leone\.

Recommendation 5: Invest in Sierra Leonean data science capacity\.Every result in this paper was produced with free data and open\-source tools; the constraint is not data or technology but analytical capacity within the country’s institutions\[[5](https://arxiv.org/html/2606.13959#bib.bib5)\]\. The most durable path to data\-driven agricultural governance is a generation of Sierra Leonean researchers and analysts who understand both the tools and the context\. Programs like RiseAfrica Foundation’s RiseLab represent exactly that investment, and the full pipeline behind this study is archived publicly to support it\.

Agriculture has always been how Sierra Leone has fed itself\. This study shows precisely where data science can now help, and where it cannot yet\. The rainfall signal is free and waiting to be used\. The deeper gains wait on the data the country has already collected but not yet opened, and on the people trained to use it\. What remains is the institutional commitment to bring them together\.

## Data Availability Statement

All data used in this study are publicly available\. FAOSTAT crop statistics for Sierra Leone \(2000–2024\) are available from[https://www\.fao\.org/faostat](https://www.fao.org/faostat)\. CHIRPS v2\.0 monthly precipitation rasters are available from the Climate Hazards Center, University of California, Santa Barbara \([https://www\.chc\.ucsb\.edu/data/chirps](https://www.chc.ucsb.edu/data/chirps)\)\. NASA POWER agroclimatology data are available from[https://power\.larc\.nasa\.gov](https://power.larc.nasa.gov/)\. Administrative boundaries are from geoBoundaries \([https://www\.geoboundaries\.org](https://www.geoboundaries.org/)\)\. The complete analytical pipeline, including all preprocessing, feature engineering, model training, evaluation code, derived datasets, and per\-year results, is archived at[https://github\.com/Denis060/sierraleone\-agri\-ml](https://github.com/Denis060/sierraleone-agri-ml)\.

## Conflict of Interest Statement

The author declares no conflict of interest\. The author is the founder of the RiseAfrica Foundation for STEM and Innovation, a non\-profit organization referenced in Recommendation 5; this affiliation is disclosed on the title page\.

## Acknowledgments

The author thanks the Food and Agriculture Organization of the United Nations, the Climate Hazards Center at UC Santa Barbara, and the NASA POWER project for maintaining the open data resources on which this study depends\.

## References

- \[1\]World Bank\. \(2022\)\.Agricultural Data and Smallholder Decision\-Making in Sub\-Saharan Africa\. Washington, DC: World Bank Group\.
- \[2\]United States Department of Agriculture\. \(2023\)\.Sierra Leone: Agriculture Sector Country Commercial Guide\. Washington, DC: USDA Foreign Agricultural Service\.
- \[3\]Statistics Sierra Leone & Ministry of Agriculture and Food Security\. \(2023\)\.2023 Sierra Leone Annual Agricultural Sample Survey Report\. Freetown: Stats SL/MAFS\. Available at:[https://www\.statistics\.sl](https://www.statistics.sl/)
- \[4\]Government of Sierra Leone\. \(2023\)\.Feed Salone Strategy 2023–2030: A Blueprint for Agricultural Transformation in Sierra Leone\. Freetown: Office of the President / Ministry of Agriculture and Food Security\.
- \[5\]Fund for Innovation in Development\. \(2024\)\.Building Data\-Driven Agricultural Policy in Sierra Leone\. Project Documentation\. Available at:[https://fundinnovation\.dev](https://fundinnovation.dev/)
- \[6\]50x2030 Initiative\. \(2024\)\.One Year of Feed Salone: Progress Toward Food Security Objectives in Sierra Leone with 50x2030 Support\. Rome: FAO/IFAD\.
- \[7\]Affognon, H\., Mutungi, C\., Sanginga, P\., & Borgemeister, C\. \(2015\)\. Unpacking post\-harvest losses in sub\-Saharan Africa: A meta\-analysis\.World Development, 66, 49–68\.
- \[8\]Food and Agriculture Organization of the United Nations\. \(2019\)\.The State of Food and Agriculture: Moving Forward on Food Loss and Waste Reduction\. Rome: FAO\.
- \[9\]Government of Sierra Leone\. \(2022\)\.National Rice Development Strategy II \(NRDS2\)\. Freetown: Ministry of Agriculture and Food Security\.
- \[10\]World Bank\. \(2023\)\.Sierra Leone Overview: Development News, Research, Data\. Washington, DC: World Bank Group\.
- \[11\]World Bank\. \(2024\)\.World Development Indicators: Agriculture, forestry, and fishing, value added \(% of GDP\) – Sierra Leone\. Washington, DC: World Bank Group\. Available at:[https://data\.worldbank\.org](https://data.worldbank.org/)
- \[12\]Frontiers in Artificial Intelligence\. \(2026\)\. Maize yield prediction using machine learning: A systematic literature review\.Frontiers in Artificial Intelligence, 9\.[https://doi\.org/10\.3389/frai\.2026\.1735157](https://doi.org/10.3389/frai.2026.1735157)
- \[13\]Tandason, A\.F\., Bah, A\.S\., George, D\.R\., Musa, M\.F\., & Sheriff, H\.M\. \(2023\)\. Assessing post\-harvest losses of rice processing at Agricultural Business Centers \(ABCs\) in rice production in Sierra Leone\.International Journal of Agricultural Economics, 8\(2\), 48–67\.
- \[14\]Crops yield prediction based on machine learning models: Case of West African countries\. \(2022\)\.Smart Agricultural Technology, 2, 100168\.
- \[15\]Van Klompenburg, T\., Kassahun, A\., & Catal, C\. \(2020\)\. Crop yield prediction using machine learning: A systematic literature review\.Computers and Electronics in Agriculture, 177, 105709\.
- \[16\]Liakos, K\.G\., Busato, P\., Moshou, D\., Pearson, S\., & Bochtis, D\. \(2018\)\. Machine learning in agriculture: A review\.Sensors, 18\(8\), 2674\.
- \[17\]Arrieta, A\.B\., Díaz\-Rodríguez, N\., Del Ser, J\., et al\. \(2020\)\. Explainable Artificial Intelligence \(XAI\): Concepts, taxonomies, opportunities and challenges toward responsible AI\.Information Fusion, 58, 82–115\.
- \[18\]Sheahan, M\., & Barrett, C\.B\. \(2017\)\. Review: Food loss and waste in sub\-Saharan Africa\.Food Policy, 70, 1–12\.
- \[19\]Breiman, L\. \(2001\)\. Random forests\.Machine Learning, 45\(1\), 5–32\.
- \[20\]Chen, T\., & Guestrin, C\. \(2016\)\. XGBoost: A scalable tree boosting system\.Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794\.
- \[21\]Lundberg, S\.M\., & Lee, S\.I\. \(2017\)\. A unified approach to interpreting model predictions\.Advances in Neural Information Processing Systems, 30\.
- \[22\]Lobell, D\.B\., & Burke, M\.B\. \(2010\)\. On the use of statistical models to predict crop yield responses to climate change\.Agricultural and Forest Meteorology, 150\(11\), 1443–1452\.
- \[23\]Kaufman, S\., Rosset, S\., Perlich, C\., & Stitelman, O\. \(2012\)\. Leakage in data mining: Formulation, detection, and avoidance\.ACM Transactions on Knowledge Discovery from Data, 6\(4\), 1–21\.
- \[24\]Bergmeir, C\., & Benítez, J\.M\. \(2012\)\. On the use of cross\-validation for time series predictor evaluation\.Information Sciences, 191, 192–213\.
- \[25\]Roberts, D\.R\., Bahn, V\., Ciuti, S\., et al\. \(2017\)\. Cross\-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure\.Ecography, 40\(8\), 913–929\.
- \[26\]Funk, C\., Peterson, P\., Landsfeld, M\., et al\. \(2015\)\. The climate hazards infrared precipitation with stations — a new environmental record for monitoring extremes\.Scientific Data, 2, 150066\.
- \[27\]Runfola, D\., Anderson, A\., Baier, H\., et al\. \(2020\)\. geoBoundaries: A global database of political administrative boundaries\.PLoS ONE, 15\(4\), e0231866\.

Can Machine Learning Forecast Rice Yields in Data-Constrained Settings? Satellite Climate Data, National Crop Statistics, and Lessons from Sierra Leone

Similar Articles

Do Foundation Model Embeddings Improve Cross-Country Crop Yield Generalisation? A Leave-One-Country-Out Evaluation in Sub-Saharan Africa

Measuring Poverty and Inequality with Reduced Data: A Machine Learning Approach Using Nigerian Household Data

Physics-Informed Machine Learning for Short-Term Flood Prediction

A Survey on Data-Driven Models for Soil Moisture Regression and Classification

Measuring the impact of learning with AI in Sierra Leone and beyond

Submit Feedback

Similar Articles

Do Foundation Model Embeddings Improve Cross-Country Crop Yield Generalisation? A Leave-One-Country-Out Evaluation in Sub-Saharan Africa

Measuring Poverty and Inequality with Reduced Data: A Machine Learning Approach Using Nigerian Household Data

Physics-Informed Machine Learning for Short-Term Flood Prediction

A Survey on Data-Driven Models for Soil Moisture Regression and Classification

Measuring the impact of learning with AI in Sierra Leone and beyond