You can use Stan to write a statistical model and easily compile it to efficient C++ and native code, and cause it to be linked in to a running Python (or R) process while giving you access to not just the obvious HMC sampling methods but also the underying posterior density function $\pi(y \, | \theta)$ and its gradient under log_prob
and grad_log_prob
, respectively. This document shows how to do that in Python.
On the Stan team, we get a lot of questions from people looking to use Stan for algorithm development. Stan is opinionated software and believes that models are mostly separate from inference mechanisms, so most of the people looking to do algorithm development with us buy into that; they're looking to perhaps add some mechanism for discrete parameters or maybe add another type of approximation alternative to ADVI for a given statistical model.
I claim that this version is actually pretty easy to do in practice, but we haven't written anything up about it because the Stan team values robustness and generality first and foremost, and we're eagerly awaiting something that beats the No U-Turn Sampler (which has since been improved in our source code from the original paper). There are a few of us who would like to support the broader research community in using Stan for algorithm development, even if we do not include any of these algorithms with the distribution of Stan. This document should illustrate how this can work.
import pystan
import numpy as np
# Construct an arbitrary model
model_code = """
data {
int numObs;
int numGroups;
matrix[numObs, numGroups] group;
vector[numObs] y;
}
parameters {
vector[numGroups] beta;
real<lower=0> sigma;
}
model {
beta ~ normal(0, 20);
sigma ~ normal(0, 5);
y ~ normal(group * beta, sigma);
}
"""
model = pystan.StanModel(model_code=model_code)
# fake data
def gen_fake_data(N=100, K=3):
group_ = np.random.choice(K, N)
# one-hot encode in matrix
group = np.zeros((N, K))
group[np.arange(N), group_] = 1
beta = np.random.normal(0, 20, K)
sigma = np.abs(np.random.normal(0, 5))
print("sigma: {}".format(sigma))
y = np.random.normal(group.dot(beta), sigma)
return dict(group=group, y=y, numObs=N, numGroups=K, beta=beta)
fd = gen_fake_data()
fd
# You must first run one of the methods that gives you a fit object:
fit = model.sampling(data=fd)
fit
# This then exposes a log prob and grad_log_prob method:
list(filter(lambda x: "prob" in x, dir(fit)))
help(fit.grad_log_prob)
fit.log_prob(upar=[-24.4, -23.4, -20.7, np.log(8.8)]) # adjust_transform seems to default to True
# We can pass parameter values to `log_prob` in the order we defined them,
# which is shown in `flatnames`
print(fit.flatnames)
fit.log_prob(upar=[-50, 40, 32, 2.8])
Stochastic gradient descent is basically just gradient descent with random data subsets at each iteration. Stan does not implement this, but we can easily do this using the log_prob
and grad_log_prob
methods of the fit object PyStan returns. To be concrete, the algorithm will: randomly initialize, evaluate grad_log_prob
at that position in parameter space, adjust the parameters according to that gradient, and repeat the last two steps until the parameters change less than some tolerance.
# Generate a ton of data (or this would be less fun)
sgd_data = gen_fake_data(N=1000000)
# Assume the observations are always called 'y', and always 3 beta parameters,
# since this is just an illustration and not productionalizable code
def sgd(model, data, numParams, size=10000, alpha=0.0001, tol=1e-4):
params = np.arange(numParams)
new_params = np.random.normal(0, 20, 4)
new_params[-1] = np.log(8.8) # hax - sigma should be positive, also SGD SUX
print("new_params: {}".format(new_params))
i = 0
while np.abs(np.sum(new_params - params)) > tol:
params = new_params.copy()
subset = np.random.choice(np.arange(len(data['y'])), size)
# Use this hack to get a model with data
model_with_data = model.sampling(iter=1, algorithm='Fixed_param',
data={**data,
'y': data['y'][subset],
'group': data['group'][subset],
'numObs': size})
grad = model_with_data.grad_log_prob(params, adjust_transform=True)
#print("grad {}".format(grad))
new_params += alpha * grad
i += 1
if i % 50 == 0:
print("new params: {}".format(new_params))
return params
print(sgd_data['beta'])
sgd(model, sgd_data, 4)
np.exp(0.96846103)
As you can see, SGD takes forever and seems to severely underestimate the variance. Oh well, at least you can farm it out to millions of computers at once.
Expectation maximization can be thought of as a generalization of k-means with discrete cluster assignments to models that have additional parameters (that might be dependent on the cluster assigments). Bob Carpenter has a good write-up; I'll try to summarize the algorithm briefly here.
The algorithm is traditionally explained as broken up into two steps, one for calculating the expectation of the cluster assignments (or other latent parameters) and using those to generate the assignments, and another for maximizing the values of the other parameters given those assignments. One way to accomplish this with Stan is to provide two Stan models, one for each step. I'll call these expectation.stan
and maximization.stan
and use Markov Chain Monte Carlo to calculate the integral in the expectation step, and our BFGS optimizer for the maximization step. The model we already defined can be used as is for the maximization step
maxi_model = model
expectation_model = pystan.StanModel(model_code="""
data {
int numObs;
int numGroups;
row_vector[numGroups] beta;
vector[numObs] y;
}
parameters {
simplex[numGroups] group[numObs];
real<lower=0> sigma;
}
model {
vector[numObs] mu;
for (n in 1:numObs)
mu[n] = beta * group[n];
beta ~ normal(0, 20);
sigma ~ normal(0, 5);
y ~ normal(mu, sigma);
}
""")
efit = expectation_model.sampling(fd)
efit
def em_algo(maxi_model, expectation_model, data, num_iter=10):
print("Original beta: {}".format(data['beta']))
#Random initialization for parameters
data['beta'] = np.random.normal(0, 20, len(data['beta']))
for i in range(num_iter):
print("Iteration {}; beta: {}".format(i, data['beta']))
# E step
efit = expectation_model.sampling(data)
# Use the sample mean as the group assignment simplex
data['group'] = np.mean(efit.extract()['group'], axis=0)
# M step
mfit = maxi_model.optimizing(data)
data['beta'] = mfit['beta']
em_algo(maxi_model, expectation_model, fd)