comptools API

comptools.LDFfunctions module

comptools.LDFfunctions.DLP(dists, log_s125, beta)[source]

Double Logarithmic Parabola (DLP) function to parameterize air showers

For a reference see IceCube internal report 200702001, ‘A Lateral Distribution Function and Fluctuation Parametrisation for IceTop’ by Stefan Klepser.

Parameters:
dists : float, array-like

Tank distance(s), in meters, from the shower core.

log_s125 : float

Base-10 logarithm of the signal deposited 125m away from the shower core.

beta : float

Slope of the DLP function. Related to the air shower age.

Returns:
float, numpy.array

Returns the base-10 logarithm of the signal deposited at a distance, dists, away from the shower core.

comptools.LDFfunctions.fit_DLP_params(charges, distances, lap_log_s125, lap_beta)[source]
comptools.LDFfunctions.top_ldf_sigma(r, logq)[source]

comptools.base module

exception comptools.base.ComputingEnvironemtError[source]

Bases: exceptions.Exception

Custom exception that should be raised when a problem related to the computing environment is found

comptools.base.check_output_dir(outfile, makedirs=True)[source]

Function to check if the directory for an output file exists

This function will check whether the output directory containing the outfile specified exists. If the output directory doesn’t exist, then there is an option to create the output directory. Otherwise, this function will raise an IOError.

Parameters:
outfile : str

Path to output file.

makedirs : bool, optional

Option to create the output directory containing the output file if it doesn’t already exist (default: True)

Returns:
None
comptools.base.get_config_paths()[source]

Function to return paths used in this analysis

Specifically,

metaproject: Path to IceCube metaproject being used comp_data_dir: Path to where data and simulation is stored condor_data_dir: Path to where HTCondor error and output files are stored condor_scratch_dir: Path to where HTCondor log and submit files are stored figures_dir: Path to where figures are saved project_root: Path to where cr-composition project is located

Returns:
paths : collections.namedtuple

Namedtuple containing relavent paths (e.g. figures_dir is where figures will be saved, condor_data_dir is where data/simulation will be saved to / loaded from, etc).

comptools.base.get_energybins(config='IC86.2012')[source]
comptools.base.get_training_features(feature_list=None)[source]
comptools.base.partition(seq, size, max_batches=None)[source]

Generates partitions of length size from the iterable seq

Parameters:
seq : iterable

Iterable object to be partitioned.

size : int

Number of items to have in each partition.

max_batches : int, optional

Limit the number of partitions to yield (default is to yield all partitions).

Yields:
batch : list

Partition of seq that is (at most) size items long.

Examples

>>> from comptools import partition
>>> list(partition(range(10), 3))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9,)]
comptools.base.requires_icecube(func)[source]

Decorator to wrap functions that require any icecube software

comptools.composition_encoding module

comptools.composition_encoding.comp_to_label(composition)[source]
comptools.composition_encoding.composition_group_labels(compositions, num_groups=2)[source]
comptools.composition_encoding.decode_composition_groups(labels, num_groups=2)[source]
comptools.composition_encoding.encode_composition_groups(groups, num_groups=2)[source]
comptools.composition_encoding.get_comp_list(num_groups=2)[source]
comptools.composition_encoding.label_to_comp(label)[source]

comptools.data_functions module

comptools.data_functions.averaging_error(values, errors)[source]
comptools.data_functions.get_bin_mids(bins, infvalue=None)[source]
comptools.data_functions.get_cumprob_sigma(values)[source]
comptools.data_functions.get_difference_error(errors)[source]
comptools.data_functions.get_median_std(x, y, bins)[source]

Function that returns the median and standard deviation stats.binned_statistic

comptools.data_functions.get_medians(x, y, bins)[source]
comptools.data_functions.get_ratio_error(num, num_err, den, den_err)[source]
comptools.data_functions.get_resolution(x, y, bins)[source]
comptools.data_functions.get_summation_error(errors)[source]
comptools.data_functions.product_error(term1, term1_err, term2, term2_err)[source]
comptools.data_functions.ratio_error(num, num_err, den, den_err, nan_to_num=False)[source]

comptools.datafunctions module

comptools.datafunctions.get_data_configs()[source]
comptools.datafunctions.get_level3_livetime_hist(config=None, month=None)[source]
comptools.datafunctions.get_run_list(config=None)[source]
comptools.datafunctions.it_stream(config)[source]
comptools.datafunctions.level3_data_GCD_file(config, run)[source]
comptools.datafunctions.level3_data_file_batches(config, run, size, max_batches=None)[source]

Generates level3 data file paths in batches

Parameters:
config : str

Detector configuration

run : str

Number of data taking run

size: int

Number of files in each batch

max_batches : int, optional

Option to only yield max_batches number of file batches (default is to yield all batches)

Returns:
generator

Generator that yields batches of data files

Examples

Basic usage:

>>> from comptools.datafunctions import level3_data_file_batches
>>> list(level3_data_file_batches(config='IC86.2012', run='00122174', size=3, max_batches=2))
[('/data/ana/CosmicRay/IceTop_level3/exp/IC86.2012/2013/0413/Run00122174/Level3_IC86.2012_data_Run00122174_Subrun00000050.i3.bz2',
  '/data/ana/CosmicRay/IceTop_level3/exp/IC86.2012/2013/0413/Run00122174/Level3_IC86.2012_data_Run00122174_Subrun00000110.i3.bz2',
  '/data/ana/CosmicRay/IceTop_level3/exp/IC86.2012/2013/0413/Run00122174/Level3_IC86.2012_data_Run00122174_Subrun00000000.i3.bz2'),
 ('/data/ana/CosmicRay/IceTop_level3/exp/IC86.2012/2013/0413/Run00122174/Level3_IC86.2012_data_Run00122174_Subrun00000330.i3.bz2',
  '/data/ana/CosmicRay/IceTop_level3/exp/IC86.2012/2013/0413/Run00122174/Level3_IC86.2012_data_Run00122174_Subrun00000150.i3.bz2',
  '/data/ana/CosmicRay/IceTop_level3/exp/IC86.2012/2013/0413/Run00122174/Level3_IC86.2012_data_Run00122174_Subrun00000240.i3.bz2')]
comptools.datafunctions.level3_data_files(config=None, run=None)[source]
comptools.datafunctions.null_stream(config)[source]
comptools.datafunctions.reco_pulses()[source]
comptools.datafunctions.run_generator(config=None)[source]

comptools.effective_area module

comptools.effective_area.calculate_effective_area_vs_energy(*args, **kwargs)[source]

Calculated effective area vs. energy from simulation

Parameters:
df_sim : pandas.DataFrame

Simulation DataFrame returned from comptools.load_sim.

energy_bins : array-like

Energy bins (in GeV) that will be used for calculation.

verbose : bool, optional

Option for verbose output (default is True).

Returns:
eff_area : numpy.ndarray

Effective area for each bin in energy_bins

eff_area_error : numpy.ndarray

Statistical ucertainty on the effective area for each bin in energy_bins.

energy_midpoints : numpy.ndarray

Midpoints of energy_bins. Useful for plotting effective area versus energy.

comptools.effective_area.effective_area(df, log_energy_bins)[source]
comptools.effective_area.get_effective_area_fit(config='IC86.2012', fit_func=<function sigmoid_slant>, energy_points=None)[source]

Calculated effective area from simulation

Parameters:
config : str, optional

Detector configuration (default is IC86.2012).

Returns:
avg : float

Effective area

avg_err : float

Statistical error on effective area

comptools.effective_area.sigmoid_flat(energy, p0, p1, p2)[source]
comptools.effective_area.sigmoid_slant(energy, p0, p1, p2, p3)[source]

comptools.io module

comptools.io.add_convenience_variables(df, datatype='sim')[source]
comptools.io.apply_quality_cuts(df, datatype='sim', return_cut_dict=False, verbose=True)[source]
comptools.io.dataframe_to_X_y(df, feature_list, target='comp_target_2', drop_null=True)[source]
comptools.io.dataframe_to_array(df, columns, drop_null=True)[source]
comptools.io.load_data(df_file=None, config='IC86.2012', energy_reco=True, energy_cut_key='reco_log_energy', log_energy_min=6.0, log_energy_max=8.0, columns=None, n_jobs=1, verbose=False, compute=True, processed=True)[source]

Function to load processed data DataFrame

Parameters:
df_file : path, optional

If specified, the given path to a pandas.DataFrame will be loaded (default is None, so the file path will be determined from the datatype and config).

config : str, optional

Detector configuration (default is ‘IC86.2012’).

test_size : int, float, optional

Fraction or number of events to be split off into a seperate testing set (default is 0.3). test_size will be passed to sklearn.model_selection.ShuffleSplit.

energy_reco : bool, optional

Option to perform energy reconstruction for each event (default is True).

energy_cut_key : str, optional

Energy key to apply energy range cuts to (default is ‘lap_log_energy’).

log_energy_min : int, float, optional

Option to set a lower limit on the reconstructed log energy in GeV (default is 6.0).

log_energy_max : int, float, optional

Option to set a upper limit on the reconstructed log energy in GeV (default is 8.0).

columns : array_like, optional

Option to specify the columns that should be in the returned DataFrame(s) (default is None, all columns are returned).

n_jobs : int, optional

Number of chunks to load in parallel (default is 1).

verbose : bool, optional

Option for verbose progress bar output (default is True).

processed : bool, optional

Whether to load processed (quality + energy cuts applied) or pre-processed data (default is True).

Returns:
pandas.DataFrame

Return a DataFrame with processed data

comptools.io.load_sim(df_file=None, config='IC86.2012', test_size=0.5, energy_reco=True, energy_cut_key='reco_log_energy', log_energy_min=6.0, log_energy_max=8.0, columns=None, n_jobs=1, verbose=False, compute=True)[source]

Function to load processed simulation DataFrame

Parameters:
df_file : path, optional

If specified, the given path to a pandas.DataFrame will be loaded (default is None, so the file path will be determined from the datatype and config).

config : str, optional

Detector configuration (default is ‘IC86.2012’).

test_size : int, float, optional

Fraction or number of events to be split off into a seperate testing set (default is 0.3). test_size will be passed to sklearn.model_selection.ShuffleSplit.

energy_reco : bool, optional

Option to perform energy reconstruction for each event (default is True).

energy_cut_key : str, optional

Energy key to apply energy range cuts to (default is ‘lap_log_energy’).

log_energy_min : int, float, optional

Option to set a lower limit on the reconstructed log energy in GeV (default is 6.0).

log_energy_max : int, float, optional

Option to set a upper limit on the reconstructed log energy in GeV (default is 8.0).

columns : array_like, optional

Option to specify the columns that should be in the returned DataFrame(s) (default is None, all columns are returned).

n_jobs : int, optional

Number of chunks to load in parallel (default is 1).

verbose : bool, optional

Option for verbose progress bar output (default is True).

Returns:
pandas.DataFrame, tuple of pandas.DataFrame

Return a single DataFrame if test_size is 0, otherwise return a 2-tuple of training and testing DataFrame.

comptools.io.load_tank_charges(config='IC79.2010', datatype='sim', return_dask=False)[source]
comptools.io.load_trained_model(pipeline_str='BDT', config='IC86.2012', return_metadata=False)[source]

Function to load pre-trained model to avoid re-training

Parameters:
pipeline_str : str, optional

Name of model to load (default is ‘BDT’).

config : str, optional

Detector configuration (default is ‘IC86.2012’).

return_metadata : bool, optional

Option to return metadata associated with saved model (e.g. list of training features used, scikit-learn version, etc) (default is False).

Returns:
pipeline : sklearn.Pipeline

Trained scikit-learn pipeline.

model_dict : dict

Dictionary containing trained model as well as relevant metadata.

comptools.io.validate_dataframe(df)[source]
comptools.io.validate_datatype(datatype)[source]

comptools.livetime module

comptools.livetime.get_detector_livetime(config=None, months=None)[source]
comptools.livetime.get_livetime_file()[source]

comptools.model_selection module

comptools.model_selection.cross_validate_comp(df_train, df_test, pipeline_str, param_name, param_values, feature_list=None, target='comp_target_2', scoring='accuracy', num_groups=2, n_splits=10, n_jobs=1, verbose=False)[source]

Calculates stratified k-fold CV scores for a given hyperparameter value

Similar to sklearn.model_selection.cross_validate, but returns results for individual composition groups as well as the combined CV result.

Parameters:
df_train : pandas.DataFrame

Training DataFrame (see comptools.load_sim()).

df_test : pandas.DataFrame

Testing DataFrame (see comptools.load_sim()).

pipeline_str : str

Name of pipeline to use (e.g. ‘BDT’, ‘RF_energy’, etc.).

param_name : str

Name of hyperparameter (e.g. ‘max_depth’, ‘learning_rate’, etc.).

param_values : array-like

Values to set hyperparameter to.

feature_list : list, optional

List of training feature columns to use (default is to use comptools.get_training_features()).

target : str, optional

Training target to use (default is ‘comp_target_2’).

scoring : str, optional

Scoring metric to calculate for each CV fold (default is ‘accuracy’).

num_groups : int, optional

Number of composition class groups to use (default is 2).

n_splits : int, optional

Number of folds to use in (KFold) cross-validation (default is 10).

n_jobs : int, optional

Number of jobs to run in parallel (default is 1).

verbose : bool, optional

Option to print a progress bar (default is False).

Returns:
df_cv : pandas.DataFrame

Returns a DataFrame with average scores as well as CV errors on those scores for each composition.

comptools.model_selection.get_CV_frac_correct(df_train, feature_list, target, pipeline_str, num_groups, log_energy_bins, n_splits=10, n_jobs=1)[source]
comptools.model_selection.get_param_grid(pipeline_name=None)[source]

Returns dictionary with hyperparameter values to search

Parameters:
pipeline_name : str, optional

Pipeline name. Should be formatted as <name>_comp_<config>_<num_groups>-groups. For example, pipeline_name=BDT_comp_IC86.2012_2-groups (default is None).

Returns:
param_grid : dict

Dictionary with hyperparameter names / values to be passed to GridSearchCV.

comptools.model_selection.gridsearch_optimize(pipeline, param_grid, X_train, y_train, scoring='accuracy', n_jobs=1, return_gridsearch=False)[source]

Runs a grid search to optimize hyperparameters

Parameters:
pipeline : sklearn.model_selection.Pipeline

Pipeline to fit.

param_grid : dict

Dictionary with hyperparameter names / values to be passed to GridSearchCV.

X_train : array_like

Training features.

y_train : array_like

Training labels.

scoring : str

Scoring metric to use (default is ‘accuracy’).

n_jobs : int, optional

Number of jobs to run in parallel (default is 1).

return_gridsearch : bool, optional

Whether to return the fitted GridSearchCV object, or the best_estimator_ object (default is False, so will return the best_estimator_).

Returns:
best_pipeline : sklearn.model_selection.Pipeline

Pipeline with optimal hyperparameter values that has been trained on the entire training dataset (X_train, y_train).

gridsearch : sklearn.model_selection.GridSearchCV

Fitted GridSearchCV object.

comptools.pipelines module

class comptools.pipelines.CustomClassifier(p=0.8, neighbor_weight=2.0, num_groups=4, random_state=2)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

fit(X, y)[source]
predict(y)[source]

Performs random composition classification

class comptools.pipelines.LineCutClassifier(demo_param='demo')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

fit(X, y)[source]
predict(X)[source]
comptools.pipelines.get_pipeline(classifier_name='BDT')[source]

Function to get classifier pipeline.

comptools.pipelines.line(x, x1, y1, x2, y2)[source]

comptools.plotting module

comptools.plotting.colorbar(mappable, label=None)[source]
comptools.plotting.get_color(composition)[source]
comptools.plotting.get_color_dict()[source]
comptools.plotting.get_colormap(composition)[source]
comptools.plotting.histogram_2D(x, y, bins, weights=None, log_counts=False, make_prob=False, colorbar=True, logx=False, logy=False, vmin=None, vmax=None, cmap='viridis', ax=None, **opts)[source]
comptools.plotting.make_comp_frac_histogram(x, y, proton_mask, iron_mask, bins, ax)[source]
comptools.plotting.plot_steps(edges, y, yerr=None, color=None, fillcolor=None, lw=1, ls='-', alpha=1.0, fillalpha=0.2, label=None, ax=None)[source]

comptools.serialize module

comptools.serialize.deserialize_SFS(infile)[source]
comptools.serialize.serialize_SFS(sfs, outfile)[source]

comptools.simfunctions module

comptools.simfunctions.comp2mass(composition)[source]
comptools.simfunctions.config_to_sim(config)[source]
comptools.simfunctions.filter_mask(config)[source]
comptools.simfunctions.get_level3_sim_files_iterator(sim_list)[source]

Function to return an iterable of simulation files

Parameters:
sim_list : int, array-like

Simulation(s) sets to get i3 files for (e.g. 12360 or [12360, 12362, 12630, 12631]).

Returns:
file_iter : itertools.chain

Iterable of simulation i3 files.

comptools.simfunctions.get_sim_configs()[source]
comptools.simfunctions.get_sim_dict()[source]
comptools.simfunctions.it_stream(config)[source]
comptools.simfunctions.level3_sim_GCD_file(sim)[source]
comptools.simfunctions.level3_sim_file_batches(sim, size, max_batches=None)[source]

Generates level3 simulation file paths in batches

Parameters:
sim : int

Simulation dataset (e.g. 7006, 7241)

size: int

Number of files in each batch

max_batches : int, optional

Option to only yield max_batches number of file batches (default is to yield all batches)

Returns:
generator

Generator that yields batches of simulation files

Examples

Basic usage:

>>> from comptools.simfunctions import level3_sim_file_batches
>>> list(level3_sim_file_batches(7241, size=3, max_batches=2))
[('/data/ana/CosmicRay/IceTop_level3/sim/IC79/7241/Level3_IC79_7241_Run005347.i3.gz',
  '/data/ana/CosmicRay/IceTop_level3/sim/IC79/7241/Level3_IC79_7241_Run005393.i3.gz',
  '/data/ana/CosmicRay/IceTop_level3/sim/IC79/7241/Level3_IC79_7241_Run009678.i3.gz'),
 ('/data/ana/CosmicRay/IceTop_level3/sim/IC79/7241/Level3_IC79_7241_Run001015.i3.gz',
  '/data/ana/CosmicRay/IceTop_level3/sim/IC79/7241/Level3_IC79_7241_Run002597.i3.gz',
  '/data/ana/CosmicRay/IceTop_level3/sim/IC79/7241/Level3_IC79_7241_Run007939.i3.gz')]
comptools.simfunctions.level3_sim_files(sim, files_per_batch=None, max_batches=None)[source]
comptools.simfunctions.null_stream(config)[source]
comptools.simfunctions.reco_pulses()[source]
comptools.simfunctions.run_to_energy_bin(run, sim)[source]

Gives the CORSIKA energy bin for a given simulation run

Parameters:
run : int

Run number for a simulation set.

Returns:
energy_bin : float

Corresponding CORSIKA energy bin for run.

comptools.simfunctions.sim_to_comp(*args, **kwargs)[source]
comptools.simfunctions.sim_to_config(sim)[source]
comptools.simfunctions.sim_to_energy_bins(sim)[source]
comptools.simfunctions.sim_to_thinned(sim)[source]

comptools.spectrumfunctions module

comptools.spectrumfunctions.broken_power_law_flux(energy, gamma_before=-2.7, gamma_after=-3.1, energy_break=3000000.0)[source]

Broken power law flux

This is a “realistic” flux (simple broken power law with a knee @ 3PeV) to weight simulation to. More information can be found on the IT73-IC79 data-MC comparison wiki page https://wiki.icecube.wisc.edu/index.php/IT73-IC79_Data-MC_Comparison

Parameters:
energy : array_like

Energy values (in GeV) to calculate the flux for.

gamma_before : float, optional

Spectral index before break point (default is -2.7).

gamma_after : float, optional

Spectral index after break point (default is -3.1).

energy_break : float, optional

Energy (in GeV) at which the spectral index break occurs (default is 3e6, or 3 PeV).

Returns:
flux : array_like

Broken power law evaluated at energy points.

comptools.spectrumfunctions.counts_to_flux(counts, counts_err=None, energybins=array([ 1258925.41179417, 1584893.19246111, 1995262.31496887, 2511886.43150957, 3162277.66016837, 3981071.70553495, 5011872.33627269, 6309573.44480189, 7943282.34724276, 9999999.99999992, 12589254.11794156, 15848931.92461098, 19952623.14968858, 25118864.31509551, 31622776.6016834, 39810717.0553492, 50118723.36272653, 63095734.44801839, 79432823.47242692, 99999999.99999836]), eff_area=156390.673059, eff_area_err=None, livetime=27114012.0, livetime_err=1, solid_angle=1.0, scalingindex=None)
comptools.spectrumfunctions.get_flux(counts, counts_err=None, energybins=array([ 1258925.41179417, 1584893.19246111, 1995262.31496887, 2511886.43150957, 3162277.66016837, 3981071.70553495, 5011872.33627269, 6309573.44480189, 7943282.34724276, 9999999.99999992, 12589254.11794156, 15848931.92461098, 19952623.14968858, 25118864.31509551, 31622776.6016834, 39810717.0553492, 50118723.36272653, 63095734.44801839, 79432823.47242692, 99999999.99999836]), eff_area=156390.673059, eff_area_err=None, livetime=27114012.0, livetime_err=1, solid_angle=1.0, scalingindex=None)[source]
comptools.spectrumfunctions.model_flux(*args, **kwargs)[source]

comptools.unfolding module

comptools.unfolding.column_normalize(res, res_err, efficiencies, efficiencies_err)[source]
comptools.unfolding.response_hist(true_energy, reco_energy, true_target, pred_target, energy_bins=None)[source]

Computes energy-composition response matrix

Parameters:
true_energy : array_like

Array of true (MC) energies.

reco_energy : array_like

Array of reconstructed energies.

true_target : array_like

Array of true compositions that are encoded to numerical values.

pred_target : array_like

Array of predicted compositions that are encoded to numerical values.

energy_bins : array_like, optional

Energy bins to be used for constructing response matrix (default is to use energy bins from comptools.get_energybins() function).

Returns:
res : numpy.ndarray

Response matrix.

res_err : numpy.ndarray

Uncerainty of the response matrix.

comptools.unfolding.response_matrix(true_energy, reco_energy, true_target, pred_target, efficiencies, efficiencies_err, energy_bins=None)[source]

Computes normalized energy-composition response matrix

Parameters:
true_energy : array_like

Array of true (MC) energies.

reco_energy : array_like

Array of reconstructed energies.

true_target : array_like

Array of true compositions that are encoded to numerical values.

pred_target : array_like

Array of predicted compositions that are encoded to numerical values.

efficiencies : array_like

Detection efficiencies (should be in a PyUnfold-compatable form).

efficiencies_err : array_like

Detection efficiencies uncertainties (should be in a PyUnfold-compatable form).

energy_bins : array_like, optional

Energy bins to be used for constructing response matrix (default is to use energy bins from comptools.get_energybins() function).

Returns:
res_normalized : numpy.ndarray

Normalized response matrix.

res_normalized_err : numpy.ndarray

Uncerainty of the normalized response matrix.

comptools.unfolding.unfolded_counts_dist(unfolding_df, iteration=-1, num_groups=4)[source]

Convert unfolded distributions DataFrame from PyUnfold counts arrays to a dictionary containing a counts array for each composition.

Parameters:
unfolding_df : pandas.DataFrame

Unfolding DataFrame returned from PyUnfold.

iteration : int, optional

Specific unfolding iteration to retrieve unfolded counts (default is -1, the last iteration).

num_groups : int, optional

Number of composition groups (default is 4).

Returns:
counts : dict

Dictionary with composition-counts key-value pairs.

counts_sys_err : dict

Dictionary with composition-systematic error key-value pairs.

counts_stat_err : dict

Dictionary with composition-statistical error key-value pairs.