I thought that scipy.stats.binom.log*
family of functions were able to return a greater range of values because somehow, they kept all computations in log-space.
But this example shows that both binom.logsf
and binom.sf
fail due to precision for the same inputs so the log*
functions are just convenience functions i.e. equivalent to the code log(binom.sf())
QUESTION
Is this the expected behavior or is there a configuration step I missed?
for n in range(n_min_max, n_min_max + 20):
p_trial_log = binom.logsf(n, N, p_let)
p_trial = binom.sf(n, N, p_let)
print(f"args = {n}, {N}, {p_let}")
print(f"p_trial_log={p_trial_log:.1e}")
print(f"p_trial= {p_trial:.4e}")
print(f"p_trial_exp={exp(p_trial_log):.4e}")
print("")
Output:
args = 42, 100000, 4e-10
p_trial_log=-5.6e+02
p_trial= 1.2691e-242
p_trial_exp=1.2691e-242
args = 43, 100000, 4e-10
p_trial_log=-5.7e+02
p_trial= 1.1532e-248
p_trial_exp=1.1532e-248
args = 44, 100000, 4e-10
p_trial_log=-5.8e+02
p_trial= 1.0246e-254
p_trial_exp=1.0246e-254
args = 45, 100000, 4e-10
p_trial_log=-6.0e+02
p_trial= 8.9059e-261
p_trial_exp=8.9059e-261
args = 46, 100000, 4e-10
p_trial_log=-6.1e+02
p_trial= 7.5760e-267
p_trial_exp=7.5760e-267
args = 47, 100000, 4e-10
p_trial_log=-6.3e+02
p_trial= 6.3104e-273
p_trial_exp=6.3104e-273
args = 48, 100000, 4e-10
p_trial_log=-6.4e+02
p_trial= 5.1488e-279
p_trial_exp=5.1488e-279
args = 49, 100000, 4e-10
p_trial_log=-6.5e+02
p_trial= 4.1171e-285
p_trial_exp=4.1171e-285
args = 50, 100000, 4e-10
p_trial_log=-6.7e+02
p_trial= 3.2274e-291
p_trial_exp=3.2274e-291
args = 51, 100000, 4e-10
p_trial_log=-6.8e+02
p_trial= 2.4814e-297
p_trial_exp=2.4814e-297
args = 52, 100000, 4e-10
p_trial_log=-7.0e+02
p_trial= 1.8718e-303
p_trial_exp=1.8718e-303
args = 53, 100000, 4e-10
p_trial_log=-7.1e+02
p_trial= 1.3858e-309
p_trial_exp=1.3858e-309
args = 54, 100000, 4e-10
p_trial_log=-7.3e+02
p_trial= 1.0073e-315
p_trial_exp=1.0073e-315
args = 55, 100000, 4e-10
p_trial_log=-7.4e+02
p_trial= 7.2134e-322
p_trial_exp=7.2134e-322
args = 56, 100000, 4e-10
p_trial_log=-inf
p_trial= 0.0000e+00
p_trial_exp=0.0000e+00
args = 57, 100000, 4e-10
p_trial_log=-inf
p_trial= 0.0000e+00
p_trial_exp=0.0000e+00
args = 58, 100000, 4e-10
p_trial_log=-inf
p_trial= 0.0000e+00
p_trial_exp=0.0000e+00
args = 59, 100000, 4e-10
p_trial_log=-inf
p_trial= 0.0000e+00
p_trial_exp=0.0000e+00
args = 60, 100000, 4e-10
p_trial_log=-inf
p_trial= 0.0000e+00
p_trial_exp=0.0000e+00
args = 61, 100000, 4e-10
p_trial_log=-inf
p_trial= 0.0000e+00
p_trial_exp=0.0000e+00
It depends on the distribution.
logsf()
is implemented within either rv_discrete
or rv_continuous
, depending on whether you are dealing with a discrete or continuous distribution. logsf()
does some error checking, then calls _logsf()
to actually calculate the probability.
In some SciPy distributions, _logsf()
is not implemented. In these cases, the superclass implements it by calling the distribution-specific _sf()
(which is defined for all distributions) and taking the logarithm of that number. For very small probabilities, this runs into floating point underflow.
For other distributions, the risk of underflow has been recognized and mitigated by some alternate calculation. For example, the normal distribution has a _logsf()
implementation which avoids taking the logarithm of the survival function. Here’s what that looks like.
Because the class for the binomial distribution lacks a _logsf()
method, it instead calls _sf()
and takes the logarithm of that value.
Finally, patches are welcome. If you know of a way of calculating the logged survival function for any distribution, which has better accuracy than what is currently done, you should open an issue with SciPy.
Tracing though the docs and [source], for class rv_generic
, _logsf
is related to _sf
with
def _logsf(self, x, *args):
with np.errstate(divide='ignore'):
return log(self._sf(x, *args))
See also the source of logsf
for rv_generic
, which has some additional input processing, but ends up calling the above generic _logsf
.
2