Introduction to Bayesian Inference with Stan

2018.10.16

Potentially relevant bits of my story

Very lazy software engineer; like to automate things
Worked with some Expert Systems folks for a couple years at a place known for automated decision-making
The inability of these systems to deal with uncertainty bothered me to no end
Took time off to do Recurse Center and dive in
Realized the majority of people trying to solve problems like this were actaully doing Bayesian modeling
Joined Stan Core Development team

How do we make decisions?

Try to understand the state of the world
Guess at the processes through which the world reacts to stimuli
Associate utility scores with possible world states, and model relative chance
Often, communication of findings is key

Why is it hard to assign utilities to actions?

Hard to represent the state of our knowledge about the world and how it changes
- We have uncertainty and ambiguity around even the most deterministic of processes
- And limited resources to discover the true process
Many possible models for a process
- We will never know the True Model
- Nor the world’s True State

Concrete example - mail delivery

I get a few items of mail most days, but some days I get 0
How often should I be experiencing no-mail days?
I could use statistical analysis to make a case to a reporter that it is worth investigating
…imagine that I care a lot about the mail

Mail delivery uncertainty

Mail is easy to measure, but the process that generates mail is incredibly complex
- For example, in a moment of weakness I used some expiring airline miles to subscribe to Vanity Fair
- No one has time to model my weaknesses
Assume that the rate of mail creation is constant over time

The Poisson distribution

Modeling mail delivery with a Poisson

A Poisson distribution captures much of the behavior we care about, but
- Exactly how well does it describe our data?
- Want to keep track of our uncertainty when making predictions or decisions
- We suspect someone is messing with our Poisson
Visually explain to our reporter friend in terms they find compelling

Frequentism, unfairly summarized

Lots of totally valid frequentist math and procedures
- Typically quite difficult
Instead, set up your experiment correctly and then use some off-the-shelf test or estimator to get The Best Answer (point estimate)
- Oh, and make sure this list of assumptions all hold
Alternatively, do a ton of math to prove asymptotic gaurantees specific to your model
No parameter uncertainty, because probability only represents long-run frequency
- This actually matters when thinking about and communicating results
“No” priors; incorporate minimal expert knowledge

Machine learning, fairly summarized

We have computers now, calculus can GTFO
- fair enough
Great success with algorithms that have only later had statistical interpretations ascribed to them
Progress = predictions 0.01% better than the last paper according to cross validation
- Let’s ignore that the noise around cross-validated accuracy scores is much higher than 0.01%
Why would we need to interpret the model if its predictions are accurate?

What does Bayesianism preach?

Philosophically, inference without a prior is impossible
- In frequentism it’s buried in an assumption
- In ML you are not really sure if there is a deeper meaning under the code in the first place, but there is some implied prior
Retain a distribution (as opposed to point estimates like the mean) as long as possible before making a decision
We also (only recently?) have computers!
- We can actually express an accurate, bespoke model for our data generating process without worrying about an analytic integral
Construct many models
Always Be Graphing

What are we modeling?

“If we knew our parameter values \(\theta\), how would data \(y\) be generated?” \[\pi(y \, | \, \theta)\]
But we want to learn what our data implies about \(\theta\)
\[\pi(\theta \, | \, y) = \frac{\pi(y \, | \, \theta) \pi(\theta)}{\pi(y)}\]
In our case, \(\theta\) might be the average amount of mail we get on any given day, or the percentage of days the carrier skips

Practical modeling with Stan

Stan is an imperative probabilistic programming language
- in the vein of BUGS, but can do arbitrary computation
A Stan program defines a probability model
- declares data inputs and parameter variables
- defines a log posterior density function
Stan comes with a few inference algorithms
- Hamiltonian Monte Carlo for full Bayesian inference
- Variational Inference
- penalized Maximum Likelihood Estimation and optimization

Why choose Stan?

Expressive
- Stan is a full imperative programming language
- Easier to communicate models
Robust
- usually works; signals when it doesn’t
Efficient
- effective sample size per unit time winner
- 3 forms of parallelism coming out this year (CPU, GPU, cluster)
Great documentation, case studies, and community!

Anatomy of a Stan model

data and parameters just declare inputs to a density function
model block defines that function and, given some data and parameters, returns a scalar real value density of the posterior at that location in parameter space
This defines \(\pi(y \, | \, \theta)\) from a few slides ago
Stan provides inference mechanisms to learn \(\pi(\theta \, | \, y)\) from a model’s \(\pi(y \, | \, \theta)\)

Normal mail delivery Stan model

data {
  int<lower=0> num_days;
  int<lower=0> mail[num_days];
}
parameters {
  real<lower=0> avg_mail_per_day;
}
model {
  avg_mail_per_day ~ normal(0, 5);
  mail ~ poisson(avg_mail_per_day);
}

Fitting a Stan model with PyStan

import pystan, numpy as np
data = dict(mail = np.array(r["mail"], dtype="int"), num_days=int(r["num_days"]))
model = pystan.StanModel("usps1.stan")
fit = model.sampling(data)

4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean     sd   2.5%      50%  97.5%  n_eff   Rhat
avg_mail_per_day   4.96  2.3e-3   0.08    4.8    4.96   5.11   1316    1.0

For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Plotting the fit

from matplotlib import pyplot as plt
plt.hist(fit.extract()["avg_mail_per_day"], bins=20)

Visually communicating model fit

from matplotlib import pyplot as plt
mail_rate = fit.extract()["avg_mail_per_day"] # 4000 posterior samples for this param
fake_data = np.random.poisson(mail_rate) #4000 fake data simulations
plt.hist([fake_data, data["mail"]], bins=15, density=True, color=["red", "blue"])

Posterior Predictive Check (PPC) plot
Posterior distribution is a distribution over parameter space incorporating what we’ve learned about our parameters
Simulate data based on the posterior and plot it against the actual observed data

Visually communicating model fit

from matplotlib import pyplot as plt
mail_rate = fit.extract()["avg_mail_per_day"] # 4000 posterior samples for this param
fake_data = np.random.poisson(mail_rate) #4000 fake data simulations
plt.hist([fake_data, data["mail"]], bins=15, density=True, color=["red", "blue"])

Add skip_percentage to model skipped days

data {
  int<lower=0> num_days;
  int<lower=0> mail[num_days];
}
parameters {
  real<lower=0> avg_mail_per_day;
  real<lower=0, upper=1> skip_percentage;
}
model {
  avg_mail_per_day ~ normal(0, 5);
  for (day in 1:num_days) {
    real without_skipping_density = poisson_lpmf(mail[day] | avg_mail_per_day);
    if (mail[day] == 0)
      target += log_mix(skip_percentage, 0, without_skipping_density);
    else
      target += without_skipping_density;
  }
}

Fit model with skip_percentage

Inference for Stan model: anon_model_7abb4b18761050015bf05aee642e2a30.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean     sd   2.5%    50%  97.5%  n_eff   Rhat
avg_mail_per_day   5.77  2.0e-3   0.09   5.58   5.76   5.96   2328    1.0
skip_percentage    0.99  2.2e-4   0.01   0.96   0.99    1.0   2791    1.0
lp__              -1599    0.03   1.06  -1602  -1598  -1598   1336    1.0

Samples were drawn using NUTS at Tue Oct 16 22:41:45 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Simulate data implied by parameter draws

mail_rate15 = fit15.extract()["avg_mail_per_day"]
skip_percentage = fit15.extract()["skip_percentage"]
fake_data15 = np.zeros(len(mail_rate))
for i in range(len(mail_rate15)):
    todays_mail =  np.random.poisson(mail_rate15[i])
    if np.random.binomial(1, skip_percentage[i]):
        fake_data15[i] = 0
    else:
        fake_data15[i] = todays_mail

Visually compare simulations with observed data

Fix our math; intro log scales

parameters {
  real<lower=0> avg_mail_per_day;
  real<lower=0, upper=1> skip_percentage;
}
model {
  avg_mail_per_day ~ normal(0, 5);
  for (day in 1:num_days) {
    real without_skipping_density = poisson_lpmf(mail[day] | avg_mail_per_day);
    if (mail[day] == 0)
      target += log_mix(skip_percentage, 0, without_skipping_density);
    else
      target += without_skipping_density
                + bernoulli_lpmf(0 | skip_percentage);
  }
}

Fixed fit

Inference for Stan model: anon_model_387876dd9af8a07353d37466d2bde6fb.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean     sd    25%    50%  97.5%  n_eff   Rhat
avg_mail_per_day   5.52  1.6e-3   0.09   5.46   5.52    5.7   3350    1.0
skip_percentage     0.1  2.0e-4   0.01    0.1    0.1   0.13   3072    1.0
lp__              -1821    0.02   0.97  -1821  -1821  -1820   1884    1.0

Samples were drawn using NUTS at Sat Oct 13 13:45:07 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

PPC plot for corrected model

Pure poisson (left) vs. with skipped days (right)

A typical non-Bayesian approach

Plug the model (sans prior) and data into an estimator like maximum likelihood (w/ implied uniform prior)
This means using an optimizer to find the mode (maximum) of the distribution, and then
(Frequentist) Perform some test to get a p-value against some “null hypothesis” to argue for their arbitrary alternative
(Machine Learning) Show that with certain initial values, cross-validation scores on test set were higher at least one time

What is wrong with optimization?

It’s kind of fine here with just two parameters and a lot of data (N=730)
With less data, uncertainty starts to matter
In higher dimensions, the mode of a distribution characterizes it very poorly
Good resources on the typical set and inference in high dimensions:

Area in higher dimensions is weird

(from Betancourt’s Conceptual Introduction to HMC)

Curse of dimensionality

“Somewhat counterintuitively, the average member of a population is an outlier. How could an item with every feature being average be unusual? Precisely because it is unusual to have so many features that close to average.” -Carpenter
Recall the Airforce’s attempt to design a cockpit for the average pilot

Curse of dimensionality

(from Mackay’s book)

How Stan’s sampling differs from optimizing

Hamiltonian Monte Carlo (HMC) draws parameters from the typical set
Scales appropriately with dimension
No U-Turn Sampler (NUTS) provides an adaptation routine on top of HMC

How does HMC work?

(from Betancourt’s Conceptual Introduction to HMC)

Bayesian methods are intuitive and general

Retain uncertainty up to the point of a decision
Capture more interesting data generating processes and relationships
Understand how your model works
- enables fairness, accountability, and transparency
Can intuitively use draws to answer ad-hoc questions about inferences
- “Given this model, in what percentage of worlds is the mail carrier skipping >0 days”
- Replaces frequentist statistical tests
Sensitivity analysis (“What could this model/data tell me?”)
- Tomorrow in Betancourt’s class, in the case studies, or our course in November!

Potentially relevant bits of my story

How do we make decisions?

Why is it hard to assign utilities to actions?

Concrete example - mail delivery

Mail delivery uncertainty

The Poisson distribution

Modeling mail delivery with a Poisson

Frequentism, unfairly summarized

Machine learning, fairly summarized

What does Bayesianism preach?

What are we modeling?

Practical modeling with Stan

Why choose Stan?

Anatomy of a Stan model

Normal mail delivery Stan model

Fitting a Stan model with PyStan

Plotting the fit

Visually communicating model fit

Visually communicating model fit

Add skip_percentage to model skipped days

Fit model with skip_percentage

Simulate data implied by parameter draws

Visually compare simulations with observed data

Visually compare simulations with observed data

Fix our math; intro log scales

Fixed fit

PPC plot for corrected model

PPC plot for corrected model

Pure poisson (left) vs. with skipped days (right)

A typical non-Bayesian approach

What is wrong with optimization?

Area in higher dimensions is weird

Curse of dimensionality

Curse of dimensionality

Curse of dimensionality

How Stan’s sampling differs from optimizing

How does HMC work?

Bayesian methods are intuitive and general

Additional resources