tensorflow 2.16.1
tensorflow_probability 0.24.0
Dataset: Palmer Penguins
https://github.com/allisonhorst/palmerpenguins
I was trying to implement a simple bayesian linear regression model to predict body mass from flipper length using Tensorflow/ Tensorflow Probability. When I sampled from the posterior, I realized that a small change in the input observed data (e.g. a parallel shift in body mass) will lead to very weird result in the posterior samples.
The Adaptive NUTS sampler is used.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
run_mcmc = tf.function(
tfp.experimental.mcmc.windowed_adaptive_nuts,
autograph=False,
jit_compile=True,
)
penguins = pd.read_csv("./penguins.csv")
missing_data = penguins.isnull()[
["bill_length_mm", "flipper_length_mm", "sex", "body_mass_g"]
].any(axis=1)
penguins = penguins.loc[~missing_data]
adelie_mask = (penguins["species"] == "Adelie")
adelie_flipper_length_obs = penguins.loc[adelie_mask, "flipper_length_mm"].values
adelie_mass_obs = penguins.loc[adelie_mask, "body_mass_g"].values
adelie_mass_obs += 0
def gen_model(flipper_length):
flipper_length = tf.constant(flipper_length, tf.float32)
jds_ab = tfd.JointDistributionNamedAutoBatched(
dict(
sigma=tfd.HalfStudentT(df=[tf.cast(100.0, tf.float32)], loc=[tf.cast(0.0, tf.float32)], scale=[tf.cast(2000.0, tf.float32)], name="sigma"),
beta_1=tfd.Normal(loc=[tf.cast(0.0, tf.float32)], scale=[tf.cast(4000.0, tf.float32)], name="beta_1"),
beta_0=tfd.Normal(loc=[tf.cast(0.0, tf.float32)], scale=[tf.cast(4000.0, tf.float32)], name="beta_0"),
mass=lambda beta_0, beta_1, sigma:
tfd.Normal(
loc=beta_0 + beta_1 * flipper_length,
scale=sigma,
)
)
)
return jds_ab
flipper_reg_model = gen_model(adelie_flipper_length_obs)
mcmc_samples, sampler_stats = run_mcmc(
1000,
flipper_reg_model,
n_chains=4,
num_adaptation_steps=1000,
mass=tf.constant(adelie_mass_obs, tf.float32),
)
print(mcmc_samples["beta_0"].numpy().mean())
print(mcmc_samples["beta_1"].numpy().mean())
Now using the original data, posterior mean estimate of (beta_0, beta_1) are around (-2323, 32).
But if I parallel shifted the observed body mass by 1000
adelie_mass_obs += 1000
The posterior mean estimate of (beta_0, beta_1) are (-0.46, -0.20) which do not make sense at all.
Not sure if anyone can reproduce the same result, and advise on how I can go about investigating what might have been wrong.
Thanks.