Note
This page was generated from docs/source/notebooks/tutorial.ipynb. An interactive version of this notebook can be run using Binder by clicking
Note that if you wish to run this notebook on your own, you’ll need to have matplotlib and seaborn installed.
PyUnfold Tutorial¶
PyUnfold uses the iterative_unfold
function to perform iterative
unfoldings. In addition, we’ll use the Logger
callback (see the
Callbacks documentation for more information) to
display the unfolding status at each iteration.
In [1]:
from pyunfold import iterative_unfold
from pyunfold.callbacks import Logger
In [2]:
import numpy as np
np.random.seed(2)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [3]:
sns.set_context(context='poster')
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['lines.markeredgewidth'] = 2
True distribution¶
Before we can perform an unfolding, we’ll need some true and observed
data. In this tutorial, we’ll use the numpy.random
module to
generate some example data. Let’s assume that we are measuring some
observable, \(X\), that is given by a normal (Gaussian)
distribution.
In [4]:
num_samples = int(1e5)
true_samples = np.random.normal(loc=0.0, scale=1.0, size=num_samples)
true_samples
Out[4]:
array([-0.41675785, -0.05626683, -2.1361961 , ..., 2.60578854,
-0.21578792, -0.89995693])
These samples can then be binned to form a histogram of the true distribution.
In [5]:
bins = np.linspace(-3, 3, 21)
num_bins = len(bins) - 1
In [6]:
data_true, _ = np.histogram(true_samples, bins=bins)
data_true
Out[6]:
array([ 218, 478, 977, 1814, 3058, 4898, 6878, 9226, 10712,
11782, 11789, 10814, 8886, 6892, 4759, 3013, 1804, 1009,
497, 228])
The true distribution of \(X\) looks like
In [7]:
fig, ax = plt.subplots()
ax.step(np.arange(num_bins), data_true, where='mid', lw=3,
alpha=0.7, label='True distribution')
ax.set(xlabel='X bins', ylabel='Counts')
ax.legend()
plt.show()
Observed distribution¶
Generally speaking, measurement devices aren’t perfect. This means that the measured value of \(X\) isn’t always exactly the same as the true value. That is, our measurement device has some inherit bias and resolution.
We’ll model these detection imperfections by adding some random Gaussian noise to our true values of \(X\) to get our observed (measured) values.
In [8]:
random_noise = np.random.normal(loc=0.3, scale=0.5, size=num_samples)
observed_samples = true_samples + random_noise
The observed values can be binned into a histogram. This array,
data_observed
, is the distribution we want to unfold.
In [9]:
data_observed, _ = np.histogram(observed_samples, bins=bins)
data_observed
Out[9]:
array([ 201, 423, 805, 1409, 2361, 3546, 5203, 7122, 8360,
9842, 10669, 10602, 9908, 8474, 6925, 5054, 3549, 2342,
1454, 759])
We can see how the true and observed distribution of \(X\) differ.
In [10]:
fig, ax = plt.subplots()
ax.step(np.arange(num_bins), data_true, where='mid', lw=3,
alpha=0.7, label='True distribution')
ax.step(np.arange(num_bins), data_observed, where='mid', lw=3,
alpha=0.7, label='Observed distribution')
ax.set(xlabel='X bins', ylabel='Counts')
ax.legend()
plt.show()
In addition to the observations themselves, the uncertainty of the observations is also needed to perform an unfolding. Here, we’ll assume the errors are simply Poisson counting errors (i.e. \(\mathrm{error_{i} = \sqrt{counts_{i}}}\)).
In [11]:
data_observed_err = np.sqrt(data_observed)
data_observed_err
Out[11]:
array([ 14.17744688, 20.5669638 , 28.37252192, 37.53664876,
48.59012245, 59.54829972, 72.13182377, 84.39194274,
91.43303561, 99.2068546 , 103.29085148, 102.96601381,
99.53893711, 92.05433178, 83.21658489, 71.09149035,
59.57348403, 48.39421453, 38.13135193, 27.54995463])
Detection Efficiencies¶
Not every sample from our true distribution will lead to a measured observation. The probability that a sample from our true distribution is observed is quantified via our detection efficiencies. For now, we’ll assume uniform efficiencies of 1.
In [12]:
efficiencies = np.ones_like(data_observed, dtype=float)
efficiencies
Out[12]:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.])
Likewise, let’s assume an uncertainty on our efficiencies of 0.1.
In [13]:
efficiencies_err = np.full_like(efficiencies, 0.1, dtype=float)
efficiencies_err
Out[13]:
array([0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,
0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1])
Response matrix¶
The next step is to construct a response matrix that encapsulates our detector bias and resolution. This can be done by making a 2-dimensional histogram of the true vs. observed distributions.
In [14]:
response_hist, _, _ = np.histogram2d(observed_samples, true_samples, bins=bins)
As with the uncertainties on our observed data, we’ll assume Poisson counting errors for our response matrix.
In [15]:
response_hist_err = np.sqrt(response_hist)
In [16]:
fig, ax = plt.subplots()
im = ax.imshow(response_hist, origin='lower')
cbar = plt.colorbar(im, label='Counts')
ax.set(xlabel='Cause bins', ylabel='Effect bins')
plt.show()
However, there’s still one more step for the response matrix. What’s
needed for unfolding is not this 2-d counts histogram, but a matrix with
the conditional probabilities \(P(E_i|C_j)\). This matrix, called
the normalized response matrix, can be obtained by normalizing our 2-d
counts matrix. That is, scale each column of response_hist
such that
it adds to our detection efficiencies.
This normalization factor can be easily found via:
In [17]:
column_sums = response_hist.sum(axis=0)
normalization_factor = efficiencies / column_sums
In [18]:
response = response_hist * normalization_factor
response_err = response_hist_err * normalization_factor
As a check we can see that the columns of response
add to our
detection efficiencies.
In [19]:
response.sum(axis=0)
Out[19]:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.])
We can see how the structure of the 2-d response histogram differs from the normalized response matrix.
In [20]:
fig, ax = plt.subplots()
im = ax.imshow(response, origin='lower')
cbar = plt.colorbar(im, label='$P(E_i|C_{\mu})$')
ax.set(xlabel='Cause bins', ylabel='Effect bins',
title='Normalizes response matrix')
plt.show()
This difference in structure is because each column in the normalized response matrix is scaled by the number of samples in that column. So only the relative structure of each column of the 2-d response histogram is present in the normalized response matrix.
Iterative unfolding¶
Now we have everything we need to perform our unfolding:
- Observed distribution to unfold (with uncertainties):
data_observed
,data_observed_err
- Detection efficiencies (with uncertainties):
efficiencies
,efficiencies_err
- Normalized response matrix (with uncertainties):
response
,response_err
These are all input parameters into the PyUnfold iterative_unfold
function.
In [21]:
unfolded_results = iterative_unfold(data=data_observed,
data_err=data_observed_err,
response=response,
response_err=response_err,
efficiencies=efficiencies,
efficiencies_err=efficiencies_err,
callbacks=[Logger()])
Iteration 1: ts = 0.1405, ts_stopping = 0.01
Iteration 2: ts = 0.0370, ts_stopping = 0.01
Iteration 3: ts = 0.0122, ts_stopping = 0.01
Iteration 4: ts = 0.0045, ts_stopping = 0.01
What’s returned from iterative_unfold
is a dictionary that contains:
- Unfolded distribution (
unfolded
key) - Statistical uncertainties on the unfolded distribution (
stat_err
key) - Systematic uncertainties on the unfolded distribution based on the
response matrix uncertainties (
sys_err
key) - Number of unfolding iterations to reach test statistic stopping
condition (
num_iterations
key) - Test statistic value for final iteration (
ts_iter
key) - Test statistic stopping condition (
ts_stopping
key) - Unfolding matrix (
unfolding_matrix
key)
In [22]:
unfolded_results.keys()
Out[22]:
dict_keys(['num_iterations', 'stat_err', 'sys_err', 'ts_iter', 'ts_stopping', 'unfolded', 'unfolding_matrix'])
In [23]:
unfolded_results['unfolded']
Out[23]:
array([ 310.86521388, 510.24984975, 975.93115866, 1820.17860556,
3102.11778019, 4879.85122145, 7015.52494748, 8981.07735716,
10707.88612751, 11730.53337574, 11728.54653455, 10756.91277901,
8984.92352888, 6837.92414987, 4606.18470207, 2858.77526528,
1512.64260778, 868.22035667, 500.28580382, 319.36863468])
In [24]:
unfolded_results['sys_err']
Out[24]:
array([ 46.20897611, 65.68431793, 120.33554668, 201.02566916,
300.50353578, 396.6598952 , 487.03457737, 534.88778388,
579.5283416 , 592.75528923, 582.1886126 , 580.96624808,
549.87854794, 493.35713996, 405.72190552, 300.92531524,
186.67615767, 118.274722 , 73.06780024, 51.66847962])
Finally, we can see how the true, observed, and unfolded distributions compare to one another.
In [25]:
fig, ax = plt.subplots()
ax.step(np.arange(num_bins), data_true, where='mid', lw=3,
alpha=0.7, label='True distribution')
ax.step(np.arange(num_bins), data_observed, where='mid', lw=3,
alpha=0.7, label='Observed distribution')
ax.errorbar(np.arange(num_bins), unfolded_results['unfolded'],
yerr=unfolded_results['sys_err'],
alpha=0.7,
elinewidth=3,
capsize=4,
ls='None', marker='.', ms=10,
label='Unfolded distribution')
ax.set(xlabel='X bins', ylabel='Counts')
plt.legend()
plt.show()
Note that the unfolded distribution is consistent with the true distribution!