Note

This page was generated from docs/source/notebooks/multivariate.ipynb. An interactive version of this notebook can be run using Binder by clicking Binder badge

Note that if you wish to run this notebook on your own, you’ll need to have matplotlib and seaborn installed.

Multivariate unfolding

In [1]:
from pyunfold import iterative_unfold
from pyunfold.callbacks import Logger
In [2]:
import numpy as np
np.random.seed(2)
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
In [3]:
sns.set_context(context='poster')
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['lines.markeredgewidth'] = 2

Example dataset

We’ll generate the same example dataset that is used in the Getting Started tutorial, i.e. a Gaussian sample that is smeared by some noise.

In [4]:
# True distribution
num_samples = int(1e5)
true_samples = np.random.normal(loc=10.0, scale=4.0, size=num_samples)
bins = np.linspace(0, 20, 21)
num_causes = len(bins) - 1
data_true, _ = np.histogram(true_samples, bins=bins)

# Observed distribution
random_noise = np.random.normal(loc=0.3, scale=0.5, size=num_samples)
observed_samples = true_samples + random_noise
data_observed, _ = np.histogram(observed_samples, bins=bins)
data_observed_err = np.sqrt(data_observed)

# Efficiencies
efficiencies = np.ones_like(data_observed, dtype=float)
efficiencies_err = np.full_like(efficiencies, 0.1, dtype=float)

# Response matrix
response_hist, _, _ = np.histogram2d(observed_samples, true_samples, bins=bins)
response_hist_err = np.sqrt(response_hist)

# Normalized response
column_sums = response_hist.sum(axis=0)
normalization_factor = efficiencies / column_sums

response = response_hist * normalization_factor
response_err = response_hist_err * normalization_factor

We can see what the true and observed distributions look like for this example dataset

In [5]:
fig, ax = plt.subplots()
ax.step(np.arange(len(data_true)), data_true, where='mid', lw=3,
        alpha=0.7, label='True distribution')
ax.step(np.arange(len(data_observed)), data_observed, where='mid', lw=3,
        alpha=0.7, label='Observed distribution')
ax.set(xlabel='Cause bins', ylabel='Counts')
ax.legend()
plt.show()
../_images/notebooks_multivariate_7_0.png

as well as the normalized response matrix

In [6]:
fig, ax = plt.subplots()
im = ax.imshow(response, origin='lower')
cbar = plt.colorbar(im, label='$P(E_i|C_{\mu})$')
ax.set(xlabel='Cause bins', ylabel='Effect bins',
       title='Normalized response matrix')
plt.show()
../_images/notebooks_multivariate_9_0.png

Cause Groups

To illustrate the generalization to multivariate unfolding, we consider a set of effects originating from two different cause types having their own ranges. We can assign a new superscript \(i\) to denote these types:

\[C_{\mu}^{i} \text{ where } i \in [1,2]\]

and the subscript \(\mu\) runs over the respective number of causes in each type \(\, n_{C1}\) and \(\, n_{C2}\).

But since the PyUnfold doesn’t care how we label the bins, we can simply redefine our cause index \(\mu\) to run over a larger index range. Hence,

\[C_{\mu}^i \rightarrow C_{\mu} \text{ where } \mu \in [1, \, n_{C1} + n_{C2}]\]

Thus, given a general multidimensional (\(i>1\)) response matrix, we can effectively unroll it onto a two dimensional array and still use PyUnfold.

Example—two identical cause groups

Here we use two identical cause groups, simply copying our original response matrix along the cause axis.

In [7]:
# Response with two groups
response_hist_groups = np.concatenate((response_hist, response_hist), axis=1)
response_hist_groups_err = np.sqrt(response_hist_groups)

# Efficiencies with two groups
efficiencies_groups = np.ones(response_hist_groups.shape[1], dtype=float)
efficiencies_groups_err = np.full_like(efficiencies_groups, 0.1, dtype=float)
num_causes = response_hist_groups.shape[1]

# Scale by efficiency
column_sums_groups = response_hist_groups.sum(axis=0)
normalization_factor_groups = efficiencies_groups / column_sums_groups

# Response matrix with two groups
response_groups = response_hist_groups * normalization_factor_groups
response_groups_err = response_hist_groups_err * normalization_factor_groups
In [8]:
fig, ax = plt.subplots(figsize=(16, 8))
im = ax.imshow(response_groups, origin='lower')
plt.colorbar(im, label='$P(E_i|C_{\mu})$')
ax.set(xlabel='Cause bins', ylabel='Effect bins',
       title='Normalized response matrix - 2 groups')
plt.show()
../_images/notebooks_multivariate_13_0.png

Now the two (identical) cause groups are unrolled clearly along the abscissa. Since the unfolding method is cause agnostic, we can perform an unfolding, remembering that we’ve kept the same observed data.

In [9]:
unfolded_results_groups = iterative_unfold(data=data_observed,
                                           data_err=data_observed_err,
                                           response=response_groups,
                                           response_err=response_groups_err,
                                           efficiencies=efficiencies_groups,
                                           efficiencies_err=efficiencies_groups_err,
                                           ts='ks',
                                           ts_stopping=0.01,
                                           callbacks=[Logger()])
Iteration 1: ts = 0.0730, ts_stopping = 0.01
Iteration 2: ts = 0.0030, ts_stopping = 0.01
In [10]:
fig, ax = plt.subplots()
ax.errorbar(np.arange(num_causes), unfolded_results_groups['unfolded'],
            yerr=unfolded_results_groups['sys_err'],
            alpha=0.7,
            elinewidth=3,
            capsize=4,
            ls='None', marker='.', ms=10)
ax.set(xlabel='Cause bins', ylabel='Counts',
        title='Group Unfolding')
plt.show()
../_images/notebooks_multivariate_16_0.png

So the result is two equal copies of the causes. This makes sense because in our example we have simply considered two identical groups of causes, so they should contribute identically to producing the measured effects.

User Priors with Groups

What if we want to use the a user-defined prior (e.g. Jeffreys’ prior)? In this case, we have to setup the prior input to contain the priors we want for each group.

In [11]:
from pyunfold.priors import jeffreys_prior
In [12]:
# Cause limits
cause_lim = np.logspace(0, 3, num_causes // 2)
In [13]:
# Setup group Jeffreys' prior
prior_jeff_groups = np.concatenate([jeffreys_prior(cause_lim), jeffreys_prior(cause_lim)])
prior_jeff_groups = prior_jeff_groups / prior_jeff_groups.sum()
In [14]:
fig, ax = plt.subplots()
ax.step(np.arange(num_causes), prior_jeff_groups, where='mid', lw=3,
        alpha=0.8)
ax.set(xlabel='Cause bins', ylabel='$P(C_{\mu})$',
       title='Group Jeffreys Priors')
ax.set_yscale('log')
plt.show()
../_images/notebooks_multivariate_22_0.png
In [15]:
unfolded_results_groups_jeff = iterative_unfold(data=data_observed,
                                                data_err=data_observed_err,
                                                response=response_groups,
                                                response_err=response_groups_err,
                                                efficiencies=efficiencies_groups,
                                                efficiencies_err=efficiencies_groups_err,
                                                prior=prior_jeff_groups,
                                                ts='ks',
                                                ts_stopping=0.01,
                                                callbacks=[Logger()])
Iteration 1: ts = 0.3615, ts_stopping = 0.01
Iteration 2: ts = 0.0085, ts_stopping = 0.01
In [16]:
fig, ax = plt.subplots()
plt.errorbar(np.arange(num_causes), unfolded_results_groups_jeff['unfolded'],
             yerr=unfolded_results_groups_jeff['sys_err'],
             alpha=0.7,
             elinewidth=3,
             capsize=4,
             ls='None', marker='.', ms=10)

ax.set(xlabel='Cause bins', ylabel='Counts',
       title='Group Unfolding - Jeffreys Prior')
plt.show()
plt.show()
../_images/notebooks_multivariate_24_0.png

Again, we recover two copies of the causes.

Regularization with Groups

In general, groups of causes do not share continuity of the cause axis. In the above examples, the cause arrays are stacked next to each other, so while the unfolding method doesn’t care about the cause definitions, the default regularization smooths over all cause bins without regard to types.

PyUnfold implements group regularization, where smoothing of the unfolded distributions is performed only within designated cause types (i.e. each cause type is regularized independently). This is done by providing the SplineRegularizer function a groups list, defining a group identification for each cause bin.

In [17]:
from pyunfold.callbacks import SplineRegularizer
In [18]:
# Here we know we have two copies of the causes
n_c1 = response_groups.shape[1] // 2
n_c2 = response_groups.shape[1] // 2

groups = [0]*n_c1 + [1]*n_c2
print("Group definitions: \n{}".format(groups))
Group definitions:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
In [19]:
group_reg = SplineRegularizer(degree=3, smooth=1.25, groups=groups)
In [20]:
unfolded_results_groups_reg = iterative_unfold(data=data_observed,
                                               data_err=data_observed_err,
                                               response=response_groups,
                                               response_err=response_groups_err,
                                               efficiencies=efficiencies_groups,
                                               efficiencies_err=efficiencies_groups_err,
                                               ts='ks',
                                               ts_stopping=0.01,
                                               callbacks=[Logger(), group_reg])
Iteration 1: ts = 0.0730, ts_stopping = 0.01
Iteration 2: ts = 0.0030, ts_stopping = 0.01
In [21]:
fig, ax = plt.subplots()
ax.errorbar(np.arange(num_causes), unfolded_results_groups_reg['unfolded'],
            yerr=unfolded_results_groups_reg['sys_err'],
            alpha=0.7,
            elinewidth=3,
            capsize=4,
            ls='None', marker='.', ms=10)

ax.set(xlabel='Cause bins', ylabel='Counts',
       title='Group Unfolding - regularized')
plt.show()
../_images/notebooks_multivariate_31_0.png

Again, for this simple example, everything looks fine.

However, if we consider a more complicated example, we’ll see the power of these implementations.

Pitfalls of multivariate unfolding

Let’s generate some more example data by keeping our two-type response matrix.

In [22]:
# True distribution
group_data_true = np.concatenate((np.linspace(10, 100, n_c1), np.linspace(1, 10, n_c2)))

# Observed data, no smearing
group_data_observed = response_groups.dot(group_data_true)
group_data_observed_err = np.sqrt(group_data_observed)
In [23]:
fig, (ax_1, ax_2) = plt.subplots(ncols=2, figsize=(20, 8))
# Cause distribution
ax_1.step(np.arange(num_causes), group_data_true, where='mid', lw=3,
         alpha=0.7, label='True distribution')
ax_1.set(xlabel='Cause bins', ylabel='Counts',
         title='Cause Distribution')

# Effects distribution
ax_2.step(np.arange(len(group_data_observed)), group_data_observed, where='mid', lw=3,
          alpha=0.7, label='Observed distribution')
ax_2.set(xlabel='Effect bins', ylabel='Counts',
         title='Effects Distribution')

plt.show()
../_images/notebooks_multivariate_35_0.png
In [24]:
# Setup all cause spline
spline_reg = SplineRegularizer(degree=3, smooth=5e6)

# Setup group spline
group_reg = SplineRegularizer(degree=3, smooth=1.25, groups=groups)
In [25]:
unfolded_results_groups_default_reg = iterative_unfold(data=group_data_observed,
                                                       data_err=group_data_observed_err,
                                                       response=response_groups,
                                                       response_err=response_groups_err,
                                                       efficiencies=efficiencies_groups,
                                                       efficiencies_err=efficiencies_groups_err,
                                                       ts='ks',
                                                       ts_stopping=0.01,
                                                       callbacks=[Logger(), spline_reg])
Iteration 1: ts = 0.0919, ts_stopping = 0.01
Iteration 2: ts = 0.0479, ts_stopping = 0.01
Iteration 3: ts = 0.0246, ts_stopping = 0.01
Iteration 4: ts = 0.0125, ts_stopping = 0.01
Iteration 5: ts = 0.0064, ts_stopping = 0.01
In [26]:
unfolded_results_groups_group_reg = iterative_unfold(data=group_data_observed,
                                                     data_err=group_data_observed_err,
                                                     response=response_groups,
                                                     response_err=response_groups_err,
                                                     efficiencies=efficiencies_groups,
                                                     efficiencies_err=efficiencies_groups_err,
                                                     ts='ks',
                                                     ts_stopping=0.01,
                                                     callbacks=[Logger(), group_reg])
Iteration 1: ts = 0.1070, ts_stopping = 0.01
Iteration 2: ts = 0.0011, ts_stopping = 0.01
In [27]:
fig, ax = plt.subplots()
ax.step(np.arange(num_causes), group_data_true, where='mid', lw=3,
        alpha=0.7, label='True Distribution')

ax.errorbar(np.arange(num_causes), unfolded_results_groups_group_reg['unfolded'],
            yerr=unfolded_results_groups_group_reg['sys_err'],
            alpha=0.7,
            elinewidth=3,
            capsize=4,
            ls='None', marker='.', ms=10,
            label='Group Regularization')

ax.errorbar(np.arange(num_causes), unfolded_results_groups_default_reg['unfolded'],
            yerr=unfolded_results_groups_default_reg['sys_err'],
            alpha=0.7,
            elinewidth=3,
            capsize=4,
            ls='None', marker='.', ms=10,
            label='Default Regularization')

ax.set(xlabel='Cause bins', ylabel='Counts',
       title='Group Unfolding')
ax.legend()
plt.show()
../_images/notebooks_multivariate_39_0.png

So, what’s going on here? There are actually two problems here.

  1. It’s clear that the default regularization tries to connect the two cause groups in a smooth manner.
  2. And while the group regularization is doing its best to smooth each group individually, it’s clear we are getting copies again. This is due to the prior assumption that all causes have equal probabilities.

Let’s redo this by giving a strong preference in the prior to one of the groups.

In [28]:
# Flat prior to each group, but one has stronger preference.
flat_pref_prior = np.concatenate((8.*np.ones(n_c1), np.ones(n_c1)))
flat_pref_prior = flat_pref_prior / flat_pref_prior.sum()
In [29]:
unfolded_results_groups_group_reg_prior = iterative_unfold(data=group_data_observed,
                                                           data_err=group_data_observed_err,
                                                           response=response_groups,
                                                           response_err=response_groups_err,
                                                           efficiencies=efficiencies_groups,
                                                           efficiencies_err=efficiencies_groups_err,
                                                           prior=flat_pref_prior,
                                                           ts='ks',
                                                           ts_stopping=0.01,
                                                           callbacks=[Logger(), group_reg])
Iteration 1: ts = 0.1900, ts_stopping = 0.01
Iteration 2: ts = 0.0018, ts_stopping = 0.01
In [30]:
fig, ax = plt.subplots()
ax.step(np.arange(num_causes), group_data_true, where='mid', lw=3,
        alpha=0.7, label='True Distribution')

ax.errorbar(np.arange(num_causes), unfolded_results_groups_group_reg_prior['unfolded'],
            yerr=unfolded_results_groups_group_reg_prior['sys_err'],
            alpha=0.7,
            elinewidth=3,
            capsize=4,
            ls='None', marker='.', ms=10,
            label='Group Regularization')

ax.set(xlabel='Cause bins', ylabel='Counts',
       title='Group Unfolding')
ax.legend()
plt.show()
../_images/notebooks_multivariate_43_0.png

Now we’re starting to get something that looks right.

However, we are still using two identical copies of the response matrix which means that we’re considering the same sets of causes, just split up into two groups.

This also demonstrates another potential issue with doing group unfolding: if there is high degree of degeneracy between the respective response functions, then the unfolding results will be strongly dependent on intial priors.

The solution to this is to ensure that the cause group response matrices have different structure, for example by having different normalizations (efficiency). We demonstrate this below to finish.

In [31]:
# Response with two groups having different structures
n_c1 = len(response_hist)
n_c2 = 5
response_hist_groups = np.concatenate((response_hist, response_hist[:,0:n_c2]), axis=1)
response_hist_groups_err = np.sqrt(response_hist_groups)

# Efficiencies with two groups
efficiencies_groups = np.ones(response_hist_groups.shape[1], dtype=float)
efficiencies_groups_err = np.full_like(efficiencies_groups, 0.1, dtype=float)
# Make the second group 25% efficient
efficiencies_groups[0:n_c1] *= 1
efficiencies_groups[n_c1:-1] *= 0.25
efficiencies_groups_err = np.full_like(efficiencies_groups, 0.1, dtype=float)

# Scale by efficiency
column_sums_groups = response_hist_groups.sum(axis=0)
normalization_factor_groups = efficiencies_groups / column_sums_groups

# Response matrix with two groups
response_groups = response_hist_groups * normalization_factor_groups
response_groups_err = response_hist_groups_err * normalization_factor_groups
In [32]:
groups = [0]*n_c1 + [1]*n_c2
print("Group definitions: {}".format(groups))
Group definitions: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
In [33]:
group_reg = SplineRegularizer(degree=3, smooth=1.25, groups=groups)
In [34]:
fig, ax = plt.subplots(figsize=(16, 8))
im = ax.imshow(response_groups, origin='lower')
plt.colorbar(im, label='$P(E_i|C_{\mu})$')
ax.set(xlabel='Cause bins', ylabel='Effect bins',
       title='Normalized response matrix - groups')
plt.show()
../_images/notebooks_multivariate_48_0.png

Regenerate some example data

In [35]:
# True distribution
group_data_true = np.concatenate((np.linspace(10, 100, n_c1), np.linspace(1, 10, n_c2)))
num_causes = len(group_data_true)

# Observed data, no smearing
group_data_observed = response_groups.dot(group_data_true)
group_data_observed_err = np.sqrt(group_data_observed)
In [36]:
unfolded_results_groups_group_reg = iterative_unfold(data=group_data_observed,
                                                     data_err=group_data_observed_err,
                                                     response=response_groups,
                                                     response_err=response_groups_err,
                                                     efficiencies=efficiencies_groups,
                                                     efficiencies_err=efficiencies_groups_err,
                                                     ts='ks',
                                                     ts_stopping=0.01,
                                                     callbacks=[Logger(), group_reg])
Iteration 1: ts = 0.1548, ts_stopping = 0.01
Iteration 2: ts = 0.0021, ts_stopping = 0.01
In [37]:
# Extend range of bins
fig, ax = plt.subplots()
ax.step(np.arange(num_causes), group_data_true, where='mid', lw=3,
        alpha=0.7, label='True Distribution')

ax.errorbar(np.arange(num_causes), unfolded_results_groups_group_reg['unfolded'],
            yerr=unfolded_results_groups_group_reg['sys_err'],
            alpha=0.7,
            elinewidth=3,
            capsize=4,
            ls='None', marker='.', ms=10,
            label='Group Regularization')

ax.set(xlabel='Cause bins', ylabel='Counts',
       title='Group Unfolding')
ax.legend()
plt.show()
../_images/notebooks_multivariate_52_0.png

Since the groups are differentiable in terms of response structure and efficiencies, we obtain reasonable unfolding results even using the default uniform prior across all bins!

This simply illustrates the need for proper understanding of one’s data.