Across all areas of data science there is huge demand for innovative modeling solutions aimed at forecasting and elucidating dynamic phenomena. High profile use cases of modeling and forecasting dynamic phenomena include:
- Finance — prediction of share price movements or commodity price fluctuations
- Biomedical science — prediction of biological trajectories, e.g. growth, evolution, determination of drug treatment impact
- Epidemiology — assessing the impact of intervention strategies such as vaccination on epidemic waves of infection, e.g. covid-19
- Energy — predicting the output of wind or wave turbines, electricity demand forecasting
- Manufacturing — optimizing the output of continuous industrial processes such as evaporation in food production
Machine learning methods, e.g. LSTM networks or random forests, have been used to tackle these and similar problems, but such approaches are not without challenges. This article presents a different modeling approach — Bayesian diffusion modeling — which is a white-box method capable of rich dynamic behavior using only a few parameters. This methodology is somewhat advanced, at least in a mathematical sense, as diffusion processes involve stochastic calculus, but real-world application can be relatively painless using existing R/Python/Julia libraries, plus some programming knowledge, and a bit of previous exposure to Bayesian inference.
An end-to-end example is presented which demonstrates Bayesian diffusion modeling applied to an open-source dataset. R is used, specifically the R-nimble library, but the modeling could equally well be implemented in Python using, e.g., Stan or Tensorflow probability. R-nimble is a general computing framework and has an active community on google groups.
It is hoped that this short case study will stimulate interest in this rich and elegant methodology. It is not new but neither is it yet a part of many data scientists’ modeling toolbox and could be a worthy addition. For forecasting and prediction use cases in particular, because this is a continuous-time methodology then dealing with irregularly spaced data is built-in, which can be rather difficult in terms of modeling serial correlation in other approaches.
If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
1. Bayesian diffusion modeling — building blocks
Bayesian diffusion modeling comprises of two key building blocks:
Step 1,
- define a stochastic process that is flexible enough to describe the dynamic phenomenon of interest — this is encapsulated in a stochastic differential equation (SDE). Many standard formulations exist.
Step 2,
- estimate the parameters in the SDE(s) from step 1 from data using Markov chain Monte Carlo (MCMC) simulation, including if applicable, additional parameters needed to address complexities such as measurement error and hierarchical random effects structures.
SDEs are a highly flexible and concise way to describe dynamic phenomena and are widely used in quantitative finance, with some examples in medicine and energy. The model’s parameters can be estimated from data using MCMC, which while computationally expensive, is accessible and has the considerable advantage of being almost universally applicable regardless of model complexity. Combining these two methodologies gives Bayesian diffusion modeling.
The use case example presented below is somewhat simplified, to make it as easy to follow as possible, but the same modeling steps are present in more realistic situations. The main difference being that here the model formulation has been chosen apriori, whereas in reality this should be decided by some form of model selection process to ensure the model chosen is optimally parameterized given the data available.
2. Forecasting cancer tumor growth — case study
This comprises three sections:
- the data;
- the model formulation;
- the model fitting and forecasting.
The key modeling code snippet is provided below as a gist. The full data set is available at the author’s GitHub repo along with the full code needed to reproduce the results and plots.
2.1 The data
The data comprise a single growth trajectory of a breast cancer tumor and a snippet of these data are given in the below gist. The original data are available in Zenodo here but as they only comprise of 15 time points over 23 days, simulation was used to generate more densely sampled data sets giving n=231 rather than n=15 over the same time period. Several different simulated datasets are used. A specific point of note about the data format is that each row is a transition — start time (Time1), stop time (Time2) with starting point (Observation1) and ending point (Observation2). This format is to match how the model likelihood will be specified (see later).
# data snippet first 15 data points
Time1 Observation1 Time2 Observation2
0.0 109.0800 0.1 114.9240
0.1 114.9240 0.2 122.3569
0.2 122.3569 0.3 129.1326
0.3 129.1326 0.4 135.7717
0.4 135.7717 0.5 135.8355
0.5 135.8355 0.6 144.2088
0.6 144.2088 0.7 138.3619
0.7 138.3619 0.8 141.7520
0.8 141.7520 0.9 144.8600
0.9 144.8600 1.0 148.1573
1.0 148.1573 1.1 160.3782
1.1 160.3782 1.2 158.7539
1.2 158.7539 1.3 160.8774
1.3 160.8774 1.4 158.3105
1.4 158.3105 1.5 166.6299
Gist 1. Snippet of data. Full dataset available on author’s GitHub (see text for link)
2.2 Model formulation
Bayesian diffusion probabilistically describes how a process evolves from a given starting position, e.g. (Time1, Observation1) through to a later position (Time2, Observation2). This process is described using a stochastic differential equation (SDE). The SDE chosen for this case study is given below in equation 1. It is a stochastic form of the well known Gompertz function and chosen here because Gompertz models have been used for modeling of cancer tumor growth (e.g. see this publication).
In equation 1 the multiplicative factor before the dt term is called the drift and the term before the dW is called the diffusion. The dW is a Brownian motion process (or Wiener process, hence the “W” in dW) which continually perturbs the evolution of the process X(t) through time. Specifically considering tumor growth, then such perturbations are intuitively reasonable given the natural ebbs and flows which occur in any complex dynamical biological system, e.g. due to competition between the tumor and other organisms for essential nutrients.
In this example, a Gompertz-like SDE is a reasonable choice apriori. There are a number of standard SDE models, mostly used or developed initially in a financial modeling context. For example, the Ornstein-Uhlenbeck model which is a mean-reverting process and the Vasicek model which was developed to model interest rates. In general, an SDE can have any terms for drift and diffusion including explicit dependent on time t. Polynomial forms of drift have been used in biological applications to model child growth trajectories (see here). That said, a major practical caveat is that an SDE needs to have a tractable slice density (see later) and so drift and diffusion terms are typically chosen with this in mind.
A final remark on the model in equation 1 is that the diffusion process (dW) is scaled by X(t). This is a convenient way to ensure that the diffusion process cannot perturb the overall process to such an extent that this results in negative values for X(t) — which would make no biological sense here as X(t) is the size of tumor.
2.3 Model fitting using MCMC
To understand how to work with SDE models it is likely necessary to look at the code of a full example in some detail. A snippet of this follows in a gist, but first it is necessary to complete the specification of the model in equation 1 and determine its likelihood by:
- determining the slice density of the SDE (also called the transition probability density) and
- define any additional complexities that are needed in the model, e.g. measurement or observation error.
2.3.1 SDE slice density derivation
From equation 1 it is easy to derive the slice density using the Wolfram Language sandbox (no sign-in required) or else directly in Mathematica or in the Wolfram Engine (free for developers). See figure 2 which shows how to compute the slice density for the SDE in equation 1. This will be used in the later MCMC code in R-nimble.
The slice density defines how likely it is that the model (a stochastic process) will pass through a particular point in the future, based on what is known of the current state of the process. Hence this density function describes a probabilistic “slice” across all paths or realizations of the model between any two-time points. In figure 2 an expression for the log slice density is also computed — this is what is needed in the R-nimble model specification.
2.3.2 Additional model features in MCMC
An optimal model for any given dataset may comprise only the SDE — drift and diffusion — part, however, it is also typical to want to include an additional random error component such as independent and identically distributed (iid) Normal errors into the model. This would account for measurement error, for example, suppose the instrument used to measure tumor volume suffers from a small but non-ignorable measurement error. This would correspond to iid Normal errors in addition to the stochastic variability present in the underlying dynamical system.
Measurement error is included in the model given in the following gist. Note that by including this form of additional random error, the dynamic phenomena being modeled is now only latently observed — what is actually observed is Y(t) = X(t) + Z(t), where X(t) is the true tumor growth and Z(t) is measurement error (say).
2.3.3 Model formulation in R-nimble
The gist below contains the key snippet of code that defines a Bayesian diffusion model (with measurement error) in R-nimble. The full code is available here. Some remarks:
- the full code uses the CRAN parallel library for efficiency which adds slightly more complexity (the main modeling code needs wrapped into a function)
- defininga custom probability distribution is necessary for the slice density, this is straightforward — see the R-nimble manual
- to generate forecasts/fitted values the input data set needs to contain dummy records with NA for the forecasts/fitted values to be estimated
- As currently coded the full MCMC estimation can take 60 minutes across 8 cores (Intel Xeon 2.2 Ghz), so some patience is required (or reduce the length of the chains if only debugging).
#########################################################################
############ MODEL DEFINTION in R-NIMBLE #################
## (note: this has lines missing - see full code at GitHub repo.)
#########################################################################
## First set up custom density - this is the slice density
## how to setup custom densities is documented in the R-nimble manual
## note the formula is for the log of the density
## this function does not need registered
## dXXXX = d is for density
dmysde <- nimbleFunction(
run = function(x = double(0), # random variable value
x0 = double(0, default = 0),
t = double(0, default = 0),
alpha = double(0, default = 0),
K = double(0, default = 0),
sigma = double(0, default = 1), # standard deviation
log = integer(0, default = 0)) {
returnType(double(0))
# log pdf
logProb <- (1/2)*(
-log(2*pi*t) - 2*log(x) - ((
((t*sigma^2)/2) - log(K) + log(x) +
(exp(-t*alpha))*(log(K) - log(x0)))^2)/(t*sigma^2) -
2*log(sigma)
)
if(log) return(logProb)
else return(exp(logProb))
})
## this is not strictly needed but here as a check - the mcmc sampler should generate
## the variates and so a stop() is used here to catch if this is used, in which case we need to
## switch sampler type, to RW
rmysde <- nimbleFunction(
run = function(n = integer(0),
x0 = double(0, default = 0),
t = double(0, default = 0),
alpha = double(0, default = 0),
K = double(0, default = 0),
sigma = double(0, default = 1)) {
stop('rmynorm should never be run')
returnType(double(0))
return(0)
})
## This is the model definition - a BUGS model, format similar to JAGS, OpenBUGS
code <- nimbleCode({
sigma1 ~ dunif(0, 1000) # flat prior for sd
sigma2 ~ dunif(0, 1000) # flat prior for sd
sigma3 ~ dunif(0, 1000) # flat prior for sd
a ~ dunif(0,1) # flat prior for Gompertz parameter - must be non-negative
K ~ dnorm(0,sd=1000) # diffuse Gaussian prior, must be non-negative but is a large value
mu_t ~ dnorm(0,sd=1000) # diffuse Gaussian prior for mean
sigma_t ~ dunif(0, 1000) # flat prior for sd
for(i in 1:n) { # for each observation - each transition from X(t) to X(t+1) given X(t)
# the model includes iid Normal errors - measurement error
# without measurement error then only the third line below is needed
Observation1_latent[i] ~ dnorm(mu_t,
sd=sigma_t); # a prior as Observation1_latent
# is a random variable
Observation1[i] ~ dnorm(Observation1_latent[i],
sd=sigma1); # Observation1 = data likelihood with sd=sigma1
# this is the SDE part, and note that Observation2_latent is a random variable -
# not data as it's latent due to measurement error. Sigma3 = the sd for the Brownian
# motion diffusion
Observation2_latent[i] ~ dmysde(x0=Observation1_latent[i],
t=Time2[i]-Time1[i],
alpha=a,
K=K,
sigma=sigma3);
# this is data likelihood with sd=sigma2
Observation2[i]~dnorm(Observation2_latent[i],sd=sigma2);
}
Gist 2. Model code snippet. Full code available on GitHub (see text for link).
For completeness the R session information from a run of the full code is:
2.3.4 Results
A Bayesian diffusion model using the Gompertz formulation given in equation 1 (plus the addition of iid Normal errors as per the above gist) was fitted to two different data sets. Each data set comprised of n=231 data points and were different realizations from the original cancer tumor data set (see section 2.1). Figures 4 and 5 show forecasts and their associated uncertainty using the models, plus also a subset of fitted values to illustrate the model fit.
The results in figure 4 look intuitively reasonable — especially for such a simple model, Gompertz has only two parameters —and the fitted values are close to the observed values. The forecasts also look plausible and behave as forecasts should where the farther into the future the forecast the greater the uncertainty around it. Note that these are Bayesian posterior predictive forecasts, i.e. from quantiles of the posterior predictive distribution, and so the uncertainty estimates surrounding the forecasts also include parameter estimation uncertainty (as opposed to making predictions conditional on a single set of parameter estimates — which is typical in non-Bayesian modeling).
Figure 5 is similar to figure 4 in that the results, both fitted values and forecasts, look intuitively reasonable. The qualitative behavior of this trajectory is rather different from that in figure 4 — here the tumor’s growth initially increases linearly and then plateaus. The same form of stochastic Gompertz model is used here but the parameter estimates are very different reflecting the different shape of the data.
3. Bayesian diffusion modeling — building models
As with any modeling methodology, the real skill is often in the art of how to build and parameterize a model so that it’s optimal — at least in some sense — for a specific problem or dataset. This is particularly true with Bayesian diffusion models as there is both the SDE formulation itself and then the MCMC formulation, each of which can be customized. The following two sections contain some general advice and remarks on model building with SDEs.
3.1 General SDE formulations
Section 2 demonstrated the application of a Bayesian diffusion model based on a Gompertz formulation. There are many other possibilities, for example equation 2 shows an SDE which is linear in the drift term but also has an explicit dependence on time. Figure 6 shows how to compute the log slice density for the SDE in equation 2. Again, this is a relatively simple process (as Mathematica does all the hard work here!).
3.2. Transformations and nesting
Perturbation of a dynamic process by Brownian motion assumes that the process being modeling can both increase and decrease, which was appropriate for the above tumor example. For some dynamic phenomena, however, diffusion can realistically only happen in one direction. A good example is height, the growth in height of a child/animal/plant is a non-decreasing process — growth in height may slow or stop but the entity cannot (typically at least) become smaller. A solution to this problem is not to model the dynamic process directly with an SDE/diffusion model, but rather incorporate a diffusion model into the parameters in a model which does enforce non-decreasing behavior, e.g.,
- consider the growth model Y(t)= a + b X(t) + c X(t)² — a simple polynomial regression. Now, if modeling a non-decreasing process then simply include a transformation of the parameters such as
- Y(t)= exp(a) + exp(b) X(t) + exp(c) X(t)² — this is now a non-linear model, which is not a problem for MCMC estimation, and it forces the model to be non-decreasing no matter whether a, b and c take positive or negative values.
If the process being modeled is dynamic then assuming that the parameters a, b, and c are static over time might not be realistic, or at least this possibility should be considered in the model selection process. To allow the regression parameters to fluctuate stochastically over time then these parameters, either individually or jointly, can be modeled using Bayesian diffusion, for example see equation 3,
If it’s desired that the regression parameters are mutually correlated over time then they could be parameterized as a joint drift and diffusion process (which would involve a system of SDEs and multiple Brownian motion processes). Such approaches can provide an incredibly rich modeling framework.
3.3 Tractability
A practical consideration when model building is tractability. Mathematically speaking, SDEs in general are not analytically tractable, and so a formulation needs to be chosen that has a slice density available in a closed form. Strictly speaking, this is not true, as it is possible to estimate the slice density numerically. Simulation can be used, and there are a number of SDE simulation libraries available for R and also Julia, or else by numerically solving the Fokker-Planck equation. But both of these approaches involve considerable effort — technically and computationally, as this would also need to be nested inside the MCMC sampling process.
4. Final remarks
This article has provided a brief introduction to Bayesian diffusion modeling and has hopefully demonstrated that it can be a powerful tool for tackling challenging forecasting problems. The best way to learn how to apply Bayesian diffusion models to real problems is to work through an end-to-end coding example. This can be found on GitHub here.
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI updates.
We’ll let you know when we release more technical education.
Leave a Reply
You must be logged in to post a comment.