Hierarchical Regression

Radon

radioactive gas that enters homes through the ground
radon levels vary greatly from household to household
carcinogenic
mediocre yelp reviews

Household radon contamination data

80,000 houses
Two predictors:
- measument taken in the basement?
- county uranium level (positive corr with radon levels)
We’ll focus on Minnesota, which has 919 households in 85 counties

Getting to know the data

df = read.csv("radon/data.csv")
str(df)

'data.frame':   919 obs. of  4 variables:
 $ log_radon  : num  0.8329 0.8329 1.0986 0.0953 1.1632 ...
 $ basement   : int  0 1 1 1 1 1 1 1 1 1 ...
 $ county     : int  1 1 1 1 2 2 2 2 2 2 ...
 $ log_uranium: num  -0.689 -0.689 -0.689 -0.689 -0.847 ...

Distribution of radon levels in MN (log scale)

Start simple: complete pooling

Estimate a radon level for basement vs non-basement readings, $y_i \sim \mathcal{N}(\alpha + \beta x_i, \sigma)$
$\sigma$ is the standard deviation of the error, capturing all of the variance we aren’t modeling

Complete pooling Stan model

$y_i \sim \mathcal{N}(\alpha + \beta x_i, \sigma)$ translates to:

data {
  int<lower=1> N;
  vector[N] basement;
  vector[N] log_radon;
}
parameters {
  real basement_effect;
  real<lower=0> sigma;
  real alpha;
}
model {
  log_radon ~ normal(alpha + basement_effect * basement, sigma);
}

Intermezzo: vectorization

With vector[N] basement;

  log_radon ~ normal(alpha + basement_effect * basement, sigma);

is equivalent to, and faster than

  for (i in 1:N)
    log_radon[i] ~ normal(basement_effect * basement[i] + a, sigma);

The default priors are not ideal

By default, Stan will act as if there is a uniform prior on any parameters not given explicit distributions. We can do better.

log_radon is between -3 and 4 ish
Seems reasonable to think the effect of a basement measurement is somewhere in that range
Same for intercept
$\sigma$ should capture unexplained variance, so we can use the same range here and be only weakly informative

Pooled model with priors

Let’s choose $\mathcal{N}(0, 5)$ for all of these. This gives us

model {
  log_radon ~ normal(alpha + basement_effect * basement, sigma);
  alpha ~ normal(0, 5);
  basement_effect ~ normal(0, 5);
  sigma ~ normal(0, 5);
}

Pooled fit in Stan

Inference for Stan model: pooled.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                mean se_mean   sd 2.5%  50% 97.5% n_eff Rhat
basement_effect 0.59       0 0.07 0.46 0.59  0.72  1626    1
sigma           0.79       0 0.02 0.76 0.79  0.83  2085    1
alpha           0.78       0 0.06 0.66 0.78  0.90  1655    1

Samples were drawn using NUTS(diag_e) at Tue May  1 17:50:16 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Pooled fit, visualized

What does this teach us?

basement_effect
$\alpha$ + 0 = distribution of non-basement measurements
$\alpha$ + basement_effect = distribution of basement measurements
$\sigma$ = variation
What about counties?

Learning about counties

Adding county-level intercepts is another conventional approach
$y_i \sim \mathcal{N}(\alpha_{j[i]} + \beta x_i, \sigma)$, where $j = 1, ..., 85$
As before both $\sigma$ and $\beta$ are scalars, so they are shared among the entire population (here, all homes in MN)
Each intercept $\alpha_j$ is independent and inferred based on data from that county alone

Complete pooling:

No pooling:

Exercise: create the unpooled model

Separate intercept for each county
Still sharing beta and sigma at the population level
Will now need to feed in J and county to the sampling call.

$y_i \sim ~ \mathcal{N}(\alpha_{j[i]} + \beta x_i, \, \sigma)$

data {
  int<lower=0> N;                 // number of houses
  int<lower=0> J;                 // number of counties
  vector[N] basement;             // 0 = basement, 1 = not basement
  int<lower=1,upper=J> county[N]; // county id for each house
  vector[N] log_radon;            // radon measurement
}

Unpooled Stan model

data {
  int<lower=0> N;                 // number of houses
  int<lower=0> J;                 // number of counties
  vector[N] basement;             // 0 = basement, 1 = not basement
  int<lower=1,upper=J> county[N]; // county id for each house
  vector[N] log_radon;            // radon measurement
}
parameters {
  real basement_effect;
  real<lower=0> sigma;
  vector[J] a;
}
model {
  log_radon ~ normal(basement_effect * basement + a[county], sigma);
  a ~ normal(0, 5);
  basement_effect ~ normal(0, 5);
  sigma ~ normal(0, 5);
}

The most and least certain county estimates

   county         sd
67     67 0.09091059
26     26 0.09507768
43     43 0.73486469
82     82 0.73924232

Distribution of measurements by county

Estimates with county measurement density

How can we share insight across counties?

Adding hierarchy

Principled way of determining how much to pool
Pull estimates towards population (state) mean based on amount of variation in population
- low variance population: more pooling
- high variance population: less pooling
In limit:
- as variance goes to 0, complete pooling
- as variance goes to $\infty$, no pooling

No pooling:

Hierarchical:

Exercise: adding hierarchy

Let’s take our current model and add hierarchical priors on alpha:

$y_i \sim ~ \mathcal{N}(\alpha_{j[i]} + \beta x_i, \, \sigma)$

$\alpha_{j} \sim \mathcal{N}(\mu_\alpha, \sigma_\alpha)$

$\mu_\alpha, \sigma_\alpha \sim \mathcal{N}(0, 5)$

data {
  int<lower=0> N;                 // number of houses
  int<lower=0> J;                 // number of counties
  vector[N] basement;             // 0 = basement, 1 = not basement
  int<lower=1,upper=J> county[N]; // county id for each house
  vector[N] log_radon;            // radon measurement
}

Hierarchical model on counties

parameters {
  real basement_effect;
  real<lower=0> sigma;
  real<lower=0> sigma_a;
  real mu_a;
  vector[J] a;
}
model {
  mu_a ~ normal(0, 5);
  sigma_a ~ normal(0, 5);
  a ~ normal(mu_a, sigma_a);
  basement_effect ~ normal(0, 5);
  sigma ~ normal(0, 5);
  log_radon ~ normal(a[county] + basement_effect * basement, sigma);
}

Hierarchical model fit

Inference for Stan model: partial_cp0.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                mean se_mean   sd 2.5%  50% 97.5% n_eff Rhat
a[67]           0.28    0.00 0.09 0.10 0.29  0.46  1469    1
a[82]           0.96    0.01 0.32 0.33 0.96  1.59  4000    1
a[50]           1.00    0.00 0.29 0.44 0.99  1.59  4000    1
sigma           0.73    0.00 0.02 0.69 0.73  0.76  4000    1
basement_effect 0.66    0.00 0.07 0.53 0.66  0.79   828    1

Samples were drawn using NUTS(diag_e) at Tue May  1 17:50:47 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Unpooled fit for reference:

Inference for Stan model: unpooled.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                mean se_mean   sd 2.5%  50% 97.5% n_eff Rhat
a[67]           0.22    0.00 0.09 0.04 0.22  0.40  2108    1
a[82]           1.51    0.01 0.74 0.10 1.50  2.96  4000    1
a[50]           1.76    0.01 0.72 0.36 1.76  3.15  4000    1
sigma           0.73    0.00 0.02 0.69 0.73  0.76  4000    1
basement_effect 0.70    0.00 0.07 0.56 0.70  0.84  1403    1

Samples were drawn using NUTS(diag_e) at Tue May  1 17:50:35 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Fit lines for each county

Hierarchical fit, visualized

Unpooled fit for reference

Hierarchical vs unpooled

Posterior predictive checks

Generate fake measurements to see how well our model reproduces the actual measurements
Can use these to do visual model comparison - add this snippet to the end of both the unpooled and hierarchical models

generated quantities {
  vector[N] yppc;
  for (i in 1:N)
    yppc[i] = normal_rng(a[county[i]] + basement_effect * basement[i], sigma);
}

For each measurement, we obtain a distribution

Hierarchical vs unpooled PPC - min house

Hierarchical vs unpooled PPC - max house

Shinystan is useful for PPCs

launch_shinystan(unpooled.fit)

launch_shinystan(partial.cp0.fit)

PPC model comparison with densities

Improving on the model

We also have a county-level uranium measurement available
This can potentially give us much more information on counties with few measurements
Better estimates save money and lives

Group + individual covariates = “multilevel”

Adding a county-level uranium covariate gives us

$y_i \sim ~ \mathcal{N}(\alpha_{j[i]} + \beta x_i + \gamma u_j, \, \sigma)$

We can use the same weak prior:

$\gamma \sim \mathcal{N}(0, 5)$

Exercise: add uranium data and parameters

$y_i \sim ~ \mathcal{N}(\alpha_{j[i]} + \beta x_i + \gamma u_j, \, \sigma)$
Add vector[N] u; to your data block
Pass in u = df$log_uranium in your call to sampling like the rest of the data

data {
  int<lower=0> N;
  int<lower=0> J;
  vector[N] log_radon;
  vector[N] basement;
  int county[N];
  vector[N] u;
}

Now with a group-level predictor, `u_effect`

model {
  mu_a ~ normal(0, 100);
  sigma ~ normal(0, 5);
  sigma_a ~ normal(0, 20);
  a ~ normal(mu_a, sigma_a);
  u_effect ~ normal(0, 5);
  basement_effect ~ normal(0, 2);
  log_radon ~ normal(a[county] + basement_effect * basement + u_effect * u, sigma);
}

Warning: There were 58 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low

Warning: Examine the pairs() plot to diagnose sampling problems

pairs()

pairs(group.hier.fit, pars=c('sigma', 'sigma_a', 'mu_a', 'u_effect', 'a[1]'))

Shinystan

launch_shinystan(group.hier.fit)

Neal’s funnel in small-data hierarchical params

$\mathrm{log\, y} \sim \mathcal{N}(0, 1.5)$

$\mathrm{x}_i \sim \mathcal{N}(0, \mathrm{y})$

`adapt_delta` makes HMC more careful

Increasing adapt_delta ends up decreasing step size
Helps HMC deal with funnels and other sharp geometric areas
But, makes it take much longer to explore larger and less dense areas of the typical set (like the mouth of our funnel)
control=list(adapt_delta=0.99) or control=dict(adapt_delta=0.99)
In this case, we can get rid of the divergences with adapt_delta but we still get BFMI warnings

Centered vs. non-centered parameterizations

Our funnel is caused by tight coupling between hierarchical scale and individual mean in small-data regime
Can attack this directly with a “non-centered parameterization,” which decouples the two

Remember z-score tables?

Can convert any normal to a standardized normal

$\mathrm{X} \sim \mathcal{N}(\mu, \sigma)$

$\mathrm{Z} = \frac{\mathrm{X} - \mu}{\sigma}$

$\mathrm{Z} \sim \mathcal{N}(0, 1)$

Exercise: non-center $\alpha$ our model

$\sigma_\alpha \sim \mathcal{N}(0, 5)$

$\mu_\alpha \sim \mathcal{N}(0, 5)$

$\alpha_j^{std} \sim \mathcal{N}(0, 1)$

$\alpha_j = \mu_\alpha + \sigma_\alpha \cdot \alpha_j^{std}$

parameters {
  real mu_a;
  real<lower=0> sigma_a;
  vector[J] a_std;
  real basement_effect;
  real u_effect;

Non-centered parameterization of $\alpha$

  real<lower=0> sigma;
}
transformed parameters {
  vector[J] a = mu_a + sigma_a * a_std;
}
model {
  log_radon ~ normal(a[county] + basement_effect * basement + u_effect * u, sigma);
  mu_a ~ normal(0, 5);
  sigma_a ~ normal(0, 5);
  a_std ~ normal(0, 1);
  sigma ~ normal(0, 5);