We want high-quality malaria prevalence maps using:
The main idea:
\[ \text{Observed data} \approx \underbrace{\text{Complex covariate signal}}_{\text{ML}} + \underbrace{\text{Residual spatial structure}}_{\text{SPDE/INLA}} \]
We will:
In practice, malaria risk depends on covariates in complicated ways:
ML models handle these patterns well.
But ML typically does not provide:
So we combine them.
By the end of Day 3 you should be able to:
At locations \(s_i\), we observe:
\[Y_i∣p(s_i) \sim \text{Binomial}(N_i,p(s_i))\] We work on the linear predictor scale:
\[\eta(s)=\text{logit}\{p(s)\}\]
We decompose the linear predictor as:
\[\eta(s)=m(x(s))+S(s)+ϵ(s)\]
We will estimate:
Directly fitting ML on \(Y_i/N_i\) is possible, but:
We use an empirical logit transform:
\[\tilde{\eta_i} =\log\left(\frac{Y_i+0.5}{(N_i−Y_i)+0.5}\right)\]
This stabilises extremes and matches the binomial model scale.
We already covered MBG/SPDE in Day 1, so here it’s a reference:
\[Y_i∣p(s_i) \sim \text{Binomial}(N_i,p(s_i))\]
\[\text{logit}\{p(s)\} = m(x(s))+S(s)+ϵ(s) \]
(We will compare this to the hybrid later.)
We treat ML as a flexible estimator of the systematic component:
\[m(x) \approx E\{\eta(s)∣x(s)=x\}\] ML learns:
Then we compute residuals:
\[r_i=\tilde{\eta_i}−\hat{m}(x_i)\]
If \(r_i\) is spatially correlated → we need a spatial residual model. If not???
A Random Forest is an ensemble of regression trees.
Each tree partitions covariate space into regions and predicts a constant:
\[\hat{m}_b(x)= \sum_{k=1}^{K_b} c_{bk} 1 (x \in R_{bk})\]
Forest prediction is the average:
\[\hat{m}_{RF}(x)= \frac{1}{B} \sum_{b=1}^{B} \hat{m}_b(x)\]
Two sources of randomness:
This reduces variance and stabilises predictions.
Example intuition:
No explicit interaction terms required.
RF offers some uncertainty tools:
But a standard RF fit does not produce a coherent spatial posterior.
In this workshop:
XGBoost uses boosting rather than bagging.
Model is additive:
\[\hat{m}(x)= \sum_{t=1}^{T}f_t(x)\]
where each \(f_t\) is a tree that corrects errors of the previous ensemble.
It often achieves higher accuracy but:
Note that RF is often more transparent.
We pick ML model based on:
Hybrid modelling is not “RF vs XGB” as a philosophy:
After ML, compute residuals:
\[r_i=\eta_i−\hat{m}(x_i)\]
If \(r_i\) is spatially correlated, ML has not captured all structure.
Then we fit:
\[r_i=S(s_i)+\epsilon_i, \quad \epsilon_i∼N(0,\sigma^2_\epsilon)\]
This is a clean and interpretable hybrid decomposition.
Empirical variogram:
\[\hat{\gamma}(u)= \frac{1}{2|N(u)|} \sum_{(1,j) \in N(u)} (r_i−r_j)^2\]
Interpretation:
Procedure:
If observed variogram lies outside envelope → evidence of spatial correlation.
This justifies adding the spatial residual field \(S(s)\).
Residuals are approximately continuous:
\[r_i∣S(s_i) \sim N(S(s_i),\sigma^2)\]
\[S(s)∼Matern SPDE\]
Outputs we get from INLA:
This is where the hybrid gets principled spatial uncertainty.
If ML captures most covariate structure, residual field:
This often improves:
At a new location \(s\):
\[\eta_{hyb}(s)=\hat{m}(x(s))+\hat{S}(s)\]
Convert back to prevalence:
\[\hat{p}_{hyb}(s)=\text{logit}^{−1}(\eta_{hyb}(s))\]
In principle there are three components:
In this workshop, we quantify (2) very clearly, and discuss how to extend to (3).
INLA gives:
\[S(s)∣data \approx \text{posterior with mean and sd}\]
So we can map:
Propagation to prevalence (approx):
Baseline geostatistical model uncertainty mixes:
Hybrid separates sources more clearly:
So we can say:
If you want full hybrid uncertainty you can add ML uncertainty via:
Then combine:
\[\eta^{(b)}(s)=m^{(b)}(x(s))+S^{(b)}(s)\]
This yields full predictive intervals for prevalence.
Baseline:
\[\eta(s)=\beta_0+ \beta^\top x(s)+S(s)\]
Hybrid:
\[\eta(s)=m(x(s))+S(s)\]
So differences are entirely in how we model the systematic component:
Hybrid often improves:
But hybrid can fail if:
Full Bayesian hybrid:
Spatio-temporal hybrids (Bamidele Toba):
\[\eta(s,t)=m(x(s,t))+S(s,t)\]