comptools
API¶
comptools.LDFfunctions module¶
-
comptools.LDFfunctions.
DLP
(dists, log_s125, beta)[source]¶ Double Logarithmic Parabola (DLP) function to parameterize air showers
For a reference see IceCube internal report 200702001, ‘A Lateral Distribution Function and Fluctuation Parametrisation for IceTop’ by Stefan Klepser.
Parameters: - dists : float, array-like
Tank distance(s), in meters, from the shower core.
- log_s125 : float
Base-10 logarithm of the signal deposited 125m away from the shower core.
- beta : float
Slope of the DLP function. Related to the air shower age.
Returns: - float, numpy.array
Returns the base-10 logarithm of the signal deposited at a distance, dists, away from the shower core.
comptools.base module¶
-
exception
comptools.base.
ComputingEnvironemtError
[source]¶ Bases:
exceptions.Exception
Custom exception that should be raised when a problem related to the computing environment is found
-
comptools.base.
check_output_dir
(outfile, makedirs=True)[source]¶ Function to check if the directory for an output file exists
This function will check whether the output directory containing the outfile specified exists. If the output directory doesn’t exist, then there is an option to create the output directory. Otherwise, this function will raise an IOError.
Parameters: - outfile : str
Path to output file.
- makedirs : bool, optional
Option to create the output directory containing the output file if it doesn’t already exist (default: True)
Returns: - None
-
comptools.base.
get_config_paths
()[source]¶ Function to return paths used in this analysis
Specifically,
metaproject: Path to IceCube metaproject being used comp_data_dir: Path to where data and simulation is stored condor_data_dir: Path to where HTCondor error and output files are stored condor_scratch_dir: Path to where HTCondor log and submit files are stored figures_dir: Path to where figures are saved project_root: Path to where cr-composition project is located
Returns: - paths : collections.namedtuple
Namedtuple containing relavent paths (e.g. figures_dir is where figures will be saved, condor_data_dir is where data/simulation will be saved to / loaded from, etc).
-
comptools.base.
partition
(seq, size, max_batches=None)[source]¶ Generates partitions of length
size
from the iterableseq
Parameters: - seq : iterable
Iterable object to be partitioned.
- size : int
Number of items to have in each partition.
- max_batches : int, optional
Limit the number of partitions to yield (default is to yield all partitions).
Yields: - batch : list
Partition of
seq
that is (at most)size
items long.
Examples
>>> from comptools import partition >>> list(partition(range(10), 3)) [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9,)]
comptools.composition_encoding module¶
comptools.data_functions module¶
comptools.datafunctions module¶
-
comptools.datafunctions.
level3_data_file_batches
(config, run, size, max_batches=None)[source]¶ Generates level3 data file paths in batches
Parameters: - config : str
Detector configuration
- run : str
Number of data taking run
- size: int
Number of files in each batch
- max_batches : int, optional
Option to only yield
max_batches
number of file batches (default is to yield all batches)
Returns: - generator
Generator that yields batches of data files
Examples
Basic usage:
>>> from comptools.datafunctions import level3_data_file_batches >>> list(level3_data_file_batches(config='IC86.2012', run='00122174', size=3, max_batches=2)) [('/data/ana/CosmicRay/IceTop_level3/exp/IC86.2012/2013/0413/Run00122174/Level3_IC86.2012_data_Run00122174_Subrun00000050.i3.bz2', '/data/ana/CosmicRay/IceTop_level3/exp/IC86.2012/2013/0413/Run00122174/Level3_IC86.2012_data_Run00122174_Subrun00000110.i3.bz2', '/data/ana/CosmicRay/IceTop_level3/exp/IC86.2012/2013/0413/Run00122174/Level3_IC86.2012_data_Run00122174_Subrun00000000.i3.bz2'), ('/data/ana/CosmicRay/IceTop_level3/exp/IC86.2012/2013/0413/Run00122174/Level3_IC86.2012_data_Run00122174_Subrun00000330.i3.bz2', '/data/ana/CosmicRay/IceTop_level3/exp/IC86.2012/2013/0413/Run00122174/Level3_IC86.2012_data_Run00122174_Subrun00000150.i3.bz2', '/data/ana/CosmicRay/IceTop_level3/exp/IC86.2012/2013/0413/Run00122174/Level3_IC86.2012_data_Run00122174_Subrun00000240.i3.bz2')]
comptools.effective_area module¶
-
comptools.effective_area.
calculate_effective_area_vs_energy
(*args, **kwargs)[source]¶ Calculated effective area vs. energy from simulation
Parameters: - df_sim : pandas.DataFrame
Simulation DataFrame returned from comptools.load_sim.
- energy_bins : array-like
Energy bins (in GeV) that will be used for calculation.
- verbose : bool, optional
Option for verbose output (default is True).
Returns: - eff_area : numpy.ndarray
Effective area for each bin in energy_bins
- eff_area_error : numpy.ndarray
Statistical ucertainty on the effective area for each bin in energy_bins.
- energy_midpoints : numpy.ndarray
Midpoints of energy_bins. Useful for plotting effective area versus energy.
-
comptools.effective_area.
get_effective_area_fit
(config='IC86.2012', fit_func=<function sigmoid_slant>, energy_points=None)[source]¶ Calculated effective area from simulation
Parameters: - config : str, optional
Detector configuration (default is IC86.2012).
Returns: - avg : float
Effective area
- avg_err : float
Statistical error on effective area
comptools.io module¶
-
comptools.io.
load_data
(df_file=None, config='IC86.2012', energy_reco=True, energy_cut_key='reco_log_energy', log_energy_min=6.0, log_energy_max=8.0, columns=None, n_jobs=1, verbose=False, compute=True, processed=True)[source]¶ Function to load processed data DataFrame
Parameters: - df_file : path, optional
If specified, the given path to a pandas.DataFrame will be loaded (default is None, so the file path will be determined from the datatype and config).
- config : str, optional
Detector configuration (default is ‘IC86.2012’).
- test_size : int, float, optional
Fraction or number of events to be split off into a seperate testing set (default is 0.3). test_size will be passed to sklearn.model_selection.ShuffleSplit.
- energy_reco : bool, optional
Option to perform energy reconstruction for each event (default is True).
- energy_cut_key : str, optional
Energy key to apply energy range cuts to (default is ‘lap_log_energy’).
- log_energy_min : int, float, optional
Option to set a lower limit on the reconstructed log energy in GeV (default is 6.0).
- log_energy_max : int, float, optional
Option to set a upper limit on the reconstructed log energy in GeV (default is 8.0).
- columns : array_like, optional
Option to specify the columns that should be in the returned DataFrame(s) (default is None, all columns are returned).
- n_jobs : int, optional
Number of chunks to load in parallel (default is 1).
- verbose : bool, optional
Option for verbose progress bar output (default is True).
- processed : bool, optional
Whether to load processed (quality + energy cuts applied) or pre-processed data (default is True).
Returns: - pandas.DataFrame
Return a DataFrame with processed data
-
comptools.io.
load_sim
(df_file=None, config='IC86.2012', test_size=0.5, energy_reco=True, energy_cut_key='reco_log_energy', log_energy_min=6.0, log_energy_max=8.0, columns=None, n_jobs=1, verbose=False, compute=True)[source]¶ Function to load processed simulation DataFrame
Parameters: - df_file : path, optional
If specified, the given path to a pandas.DataFrame will be loaded (default is None, so the file path will be determined from the datatype and config).
- config : str, optional
Detector configuration (default is ‘IC86.2012’).
- test_size : int, float, optional
Fraction or number of events to be split off into a seperate testing set (default is 0.3). test_size will be passed to sklearn.model_selection.ShuffleSplit.
- energy_reco : bool, optional
Option to perform energy reconstruction for each event (default is True).
- energy_cut_key : str, optional
Energy key to apply energy range cuts to (default is ‘lap_log_energy’).
- log_energy_min : int, float, optional
Option to set a lower limit on the reconstructed log energy in GeV (default is 6.0).
- log_energy_max : int, float, optional
Option to set a upper limit on the reconstructed log energy in GeV (default is 8.0).
- columns : array_like, optional
Option to specify the columns that should be in the returned DataFrame(s) (default is None, all columns are returned).
- n_jobs : int, optional
Number of chunks to load in parallel (default is 1).
- verbose : bool, optional
Option for verbose progress bar output (default is True).
Returns: - pandas.DataFrame, tuple of pandas.DataFrame
Return a single DataFrame if test_size is 0, otherwise return a 2-tuple of training and testing DataFrame.
-
comptools.io.
load_trained_model
(pipeline_str='BDT', config='IC86.2012', return_metadata=False)[source]¶ Function to load pre-trained model to avoid re-training
Parameters: - pipeline_str : str, optional
Name of model to load (default is ‘BDT’).
- config : str, optional
Detector configuration (default is ‘IC86.2012’).
- return_metadata : bool, optional
Option to return metadata associated with saved model (e.g. list of training features used, scikit-learn version, etc) (default is False).
Returns: - pipeline : sklearn.Pipeline
Trained scikit-learn pipeline.
- model_dict : dict
Dictionary containing trained model as well as relevant metadata.
comptools.livetime module¶
comptools.model_selection module¶
-
comptools.model_selection.
cross_validate_comp
(df_train, df_test, pipeline_str, param_name, param_values, feature_list=None, target='comp_target_2', scoring='accuracy', num_groups=2, n_splits=10, n_jobs=1, verbose=False)[source]¶ Calculates stratified k-fold CV scores for a given hyperparameter value
Similar to sklearn.model_selection.cross_validate, but returns results for individual composition groups as well as the combined CV result.
Parameters: - df_train : pandas.DataFrame
Training DataFrame (see comptools.load_sim()).
- df_test : pandas.DataFrame
Testing DataFrame (see comptools.load_sim()).
- pipeline_str : str
Name of pipeline to use (e.g. ‘BDT’, ‘RF_energy’, etc.).
- param_name : str
Name of hyperparameter (e.g. ‘max_depth’, ‘learning_rate’, etc.).
- param_values : array-like
Values to set hyperparameter to.
- feature_list : list, optional
List of training feature columns to use (default is to use comptools.get_training_features()).
- target : str, optional
Training target to use (default is ‘comp_target_2’).
- scoring : str, optional
Scoring metric to calculate for each CV fold (default is ‘accuracy’).
- num_groups : int, optional
Number of composition class groups to use (default is 2).
- n_splits : int, optional
Number of folds to use in (KFold) cross-validation (default is 10).
- n_jobs : int, optional
Number of jobs to run in parallel (default is 1).
- verbose : bool, optional
Option to print a progress bar (default is False).
Returns: - df_cv : pandas.DataFrame
Returns a DataFrame with average scores as well as CV errors on those scores for each composition.
-
comptools.model_selection.
get_CV_frac_correct
(df_train, feature_list, target, pipeline_str, num_groups, log_energy_bins, n_splits=10, n_jobs=1)[source]¶
-
comptools.model_selection.
get_param_grid
(pipeline_name=None)[source]¶ Returns dictionary with hyperparameter values to search
Parameters: - pipeline_name : str, optional
Pipeline name. Should be formatted as <name>_comp_<config>_<num_groups>-groups. For example, pipeline_name=BDT_comp_IC86.2012_2-groups (default is None).
Returns: - param_grid : dict
Dictionary with hyperparameter names / values to be passed to GridSearchCV.
-
comptools.model_selection.
gridsearch_optimize
(pipeline, param_grid, X_train, y_train, scoring='accuracy', n_jobs=1, return_gridsearch=False)[source]¶ Runs a grid search to optimize hyperparameters
Parameters: - pipeline : sklearn.model_selection.Pipeline
Pipeline to fit.
- param_grid : dict
Dictionary with hyperparameter names / values to be passed to GridSearchCV.
- X_train : array_like
Training features.
- y_train : array_like
Training labels.
- scoring : str
Scoring metric to use (default is ‘accuracy’).
- n_jobs : int, optional
Number of jobs to run in parallel (default is 1).
- return_gridsearch : bool, optional
Whether to return the fitted GridSearchCV object, or the best_estimator_ object (default is False, so will return the best_estimator_).
Returns: - best_pipeline : sklearn.model_selection.Pipeline
Pipeline with optimal hyperparameter values that has been trained on the entire training dataset (X_train, y_train).
- gridsearch : sklearn.model_selection.GridSearchCV
Fitted GridSearchCV object.
comptools.pipelines module¶
-
class
comptools.pipelines.
CustomClassifier
(p=0.8, neighbor_weight=2.0, num_groups=4, random_state=2)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
-
class
comptools.pipelines.
LineCutClassifier
(demo_param='demo')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
comptools.plotting module¶
comptools.serialize module¶
comptools.simfunctions module¶
-
comptools.simfunctions.
get_level3_sim_files_iterator
(sim_list)[source]¶ Function to return an iterable of simulation files
Parameters: - sim_list : int, array-like
Simulation(s) sets to get i3 files for (e.g. 12360 or [12360, 12362, 12630, 12631]).
Returns: - file_iter : itertools.chain
Iterable of simulation i3 files.
-
comptools.simfunctions.
level3_sim_file_batches
(sim, size, max_batches=None)[source]¶ Generates level3 simulation file paths in batches
Parameters: - sim : int
Simulation dataset (e.g. 7006, 7241)
- size: int
Number of files in each batch
- max_batches : int, optional
Option to only yield
max_batches
number of file batches (default is to yield all batches)
Returns: - generator
Generator that yields batches of simulation files
Examples
Basic usage:
>>> from comptools.simfunctions import level3_sim_file_batches >>> list(level3_sim_file_batches(7241, size=3, max_batches=2)) [('/data/ana/CosmicRay/IceTop_level3/sim/IC79/7241/Level3_IC79_7241_Run005347.i3.gz', '/data/ana/CosmicRay/IceTop_level3/sim/IC79/7241/Level3_IC79_7241_Run005393.i3.gz', '/data/ana/CosmicRay/IceTop_level3/sim/IC79/7241/Level3_IC79_7241_Run009678.i3.gz'), ('/data/ana/CosmicRay/IceTop_level3/sim/IC79/7241/Level3_IC79_7241_Run001015.i3.gz', '/data/ana/CosmicRay/IceTop_level3/sim/IC79/7241/Level3_IC79_7241_Run002597.i3.gz', '/data/ana/CosmicRay/IceTop_level3/sim/IC79/7241/Level3_IC79_7241_Run007939.i3.gz')]
comptools.spectrumfunctions module¶
-
comptools.spectrumfunctions.
broken_power_law_flux
(energy, gamma_before=-2.7, gamma_after=-3.1, energy_break=3000000.0)[source]¶ Broken power law flux
This is a “realistic” flux (simple broken power law with a knee @ 3PeV) to weight simulation to. More information can be found on the IT73-IC79 data-MC comparison wiki page https://wiki.icecube.wisc.edu/index.php/IT73-IC79_Data-MC_Comparison
Parameters: - energy : array_like
Energy values (in GeV) to calculate the flux for.
- gamma_before : float, optional
Spectral index before break point (default is -2.7).
- gamma_after : float, optional
Spectral index after break point (default is -3.1).
- energy_break : float, optional
Energy (in GeV) at which the spectral index break occurs (default is 3e6, or 3 PeV).
Returns: - flux : array_like
Broken power law evaluated at energy points.
-
comptools.spectrumfunctions.
counts_to_flux
(counts, counts_err=None, energybins=array([ 1258925.41179417, 1584893.19246111, 1995262.31496887, 2511886.43150957, 3162277.66016837, 3981071.70553495, 5011872.33627269, 6309573.44480189, 7943282.34724276, 9999999.99999992, 12589254.11794156, 15848931.92461098, 19952623.14968858, 25118864.31509551, 31622776.6016834, 39810717.0553492, 50118723.36272653, 63095734.44801839, 79432823.47242692, 99999999.99999836]), eff_area=156390.673059, eff_area_err=None, livetime=27114012.0, livetime_err=1, solid_angle=1.0, scalingindex=None)¶
-
comptools.spectrumfunctions.
get_flux
(counts, counts_err=None, energybins=array([ 1258925.41179417, 1584893.19246111, 1995262.31496887, 2511886.43150957, 3162277.66016837, 3981071.70553495, 5011872.33627269, 6309573.44480189, 7943282.34724276, 9999999.99999992, 12589254.11794156, 15848931.92461098, 19952623.14968858, 25118864.31509551, 31622776.6016834, 39810717.0553492, 50118723.36272653, 63095734.44801839, 79432823.47242692, 99999999.99999836]), eff_area=156390.673059, eff_area_err=None, livetime=27114012.0, livetime_err=1, solid_angle=1.0, scalingindex=None)[source]¶
comptools.unfolding module¶
-
comptools.unfolding.
response_hist
(true_energy, reco_energy, true_target, pred_target, energy_bins=None)[source]¶ Computes energy-composition response matrix
Parameters: - true_energy : array_like
Array of true (MC) energies.
- reco_energy : array_like
Array of reconstructed energies.
- true_target : array_like
Array of true compositions that are encoded to numerical values.
- pred_target : array_like
Array of predicted compositions that are encoded to numerical values.
- energy_bins : array_like, optional
Energy bins to be used for constructing response matrix (default is to use energy bins from comptools.get_energybins() function).
Returns: - res : numpy.ndarray
Response matrix.
- res_err : numpy.ndarray
Uncerainty of the response matrix.
-
comptools.unfolding.
response_matrix
(true_energy, reco_energy, true_target, pred_target, efficiencies, efficiencies_err, energy_bins=None)[source]¶ Computes normalized energy-composition response matrix
Parameters: - true_energy : array_like
Array of true (MC) energies.
- reco_energy : array_like
Array of reconstructed energies.
- true_target : array_like
Array of true compositions that are encoded to numerical values.
- pred_target : array_like
Array of predicted compositions that are encoded to numerical values.
- efficiencies : array_like
Detection efficiencies (should be in a PyUnfold-compatable form).
- efficiencies_err : array_like
Detection efficiencies uncertainties (should be in a PyUnfold-compatable form).
- energy_bins : array_like, optional
Energy bins to be used for constructing response matrix (default is to use energy bins from comptools.get_energybins() function).
Returns: - res_normalized : numpy.ndarray
Normalized response matrix.
- res_normalized_err : numpy.ndarray
Uncerainty of the normalized response matrix.
-
comptools.unfolding.
unfolded_counts_dist
(unfolding_df, iteration=-1, num_groups=4)[source]¶ Convert unfolded distributions DataFrame from PyUnfold counts arrays to a dictionary containing a counts array for each composition.
Parameters: - unfolding_df : pandas.DataFrame
Unfolding DataFrame returned from PyUnfold.
- iteration : int, optional
Specific unfolding iteration to retrieve unfolded counts (default is -1, the last iteration).
- num_groups : int, optional
Number of composition groups (default is 4).
Returns: - counts : dict
Dictionary with composition-counts key-value pairs.
- counts_sys_err : dict
Dictionary with composition-systematic error key-value pairs.
- counts_stat_err : dict
Dictionary with composition-statistical error key-value pairs.