REST-for-Physics  v2.3
Rare Event Searches ToolKit for Physics
Data Structures | Public Member Functions | Protected Member Functions | Private Member Functions | Private Attributes
TRestDataSet Class Reference

Detailed Description

It allows to group a number of runs that satisfy given metadata conditions.

This class allows to make a selection of ROOT data files that fulfill certain metadata conditions allowing to create a group of files that define a particular dataset. The files will be searched in a relative or absolute path that is given together the filePattern parameter.

Basic file selection

We will be able to define the dates range where files will be accepted, using startTime and endTime parameters. The run start time and end time stored inside TRestRun will be evaluated to decide if the file should be considered.

A summary of the basic parameters follows:

Metadata rules

We may add rules for any metadata class existing inside our ROOT datafiles. For such, we use the filter key where we define the metadata member name where we want to evaluate the rules. We need to define the metadata field where we specify the class name or metadata user given name, together with the metadata member we want to access, the metadata member must be named using the coventions defined inside the methods TRestRun::ReplaceMetadataMember and TRestRun::ReplaceMetadataMembers.

Three optional fields can be used to apply the rule:

Example of metadata rule:

<filter metadata="TRestRun::fRunTag" contains="Baby" />

Branches (or observables) selection

Once the files that fulfill the given dates, filename pattern and metadata rules have been identified, the initialization will produce an instance to a ROOT::RDataFrame and an instance to a ROOT::TTree that will give access to the unified analysis tree. The available columns or branches at those instances will be defined by the user inside this metadata class, through the special keywords observables and processObservables.

Their use can be seen in the following example:

<TRestDataSet name="DummySet">
<parameter name="startTime" value = "2022/04/28 00:00" />
<parameter name="endTime" value = "2022/04/28 23:59" />
<parameter name="filePattern" value="test*.root"/>
<filter metadata="TRestRun::fRunTag" contains="Baby" />
// Will add to the final tree only the specific observables
<observables list="g4Ana_totalEdep:hitsAna_energy" />
// Will apply a cut to the observables
<cut name="c1" variable="rawAna_NumberOfGoodSignals" condition=">1" />
<cut name="c1" variable="rawAna_NumberOfGoodSignals" condition="<100" />
// Will add all the observables from the process `rawAna`
<processObservables list="rate:rawAna" />
A class to help on cuts definitions. To be used with TRestAnalysisTree.
Definition: TRestCut.h:31
It allows to group a number of runs that satisfy given metadata conditions.
Definition: TRestDataSet.h:34

Basic usage

The basic usage of this class is by loading the metadata class as any other metadata class. After initialization, the user will get access to the internal RDataFrame and TTree data members, as shown in the following example:

restRoot
[0] TRestDataSet d("dataset");
[1] d.GenerateDataSet();
[2] d.GetTree()->GetEntries()
[3] d.GetDataFrame().GetColumnNames()

We can then use our favorite ROOT::RDataFrame or TTree methods.

Exporting datasets

On top of performing a compilation of runs to construct a dataset and access the data in a unified way, we may also save the generated dataset to disk. This feature might be used to generate easier to handle data compilations that have been extracted from an official data repository.

Different output formats are supported:

Example 1 Generate DataSet from config file:

restRoot
[0] TRestDataSet d("dataset", "dataSetName");
[1] d.GenerateDataSet();
[2] d.Export("mydataset.csv");
[3] d.Export("mydataset.root");

Example 2 Import existing DataSet:

restRoot
[0] TRestDataSet d();
[1] d.Import("myDataSet.root");
[2] d.GetTree()->GetEntries()

Example 3 Automatically importing a dataset using restRoot

restRoot Dataset_FinalBabyIAXO_XMM_mM_P14.root
REST dataset file identified. It contains a valid TRestDataSet.
Importing dataset /nfs/dust/iaxo/user/jgalan//Dataset_FinalBabyIAXO_XMM_mM_P14.root as `dSet`
The dataset is ready. You may now access the dataset using:
- dSet->PrintMetadata()
- dSet->GetDataFrame().GetColumnNames()
- dSet->GetTree()->GetEntries()
- dSet->GetDataFrame().Display("")->Print()
- dSet->GetDataFrame().Display({"colName1,colName2"})->Print()
[0]
void PrintMetadata() override
Prints on screen the information about the metadata members of TRestDataSet.
void Print()
Implementing TObject::Print() method.

Relevant quantities

Sometimes we will be willing that our dataset contains few variables that are extremelly meaningful for the data compilation, and that will be required for further calculations or for the proper interpretation of the data. The key <quantity will allow the user to define relevant quantities that will be stored together with the dataset. These quantitites must be extracted from existing metadata members that are present at the original files. There are different fields allowed inside, such as: name, metadata, strategy and description.

Example:

<quantity name="Nsim" metadata="[TRestProcessRunner::fEventsToProcess]"
strategy="accumulate" description="The total number of simulated events."/>

The name field will be the user given name of the quantity. The metadata field inside the <quantity definition will allow to include a metadata member or a calculation based on a formula where metadata members intervine. The method TRestRun::ReplaceMetadataMembers is the responsible to translate the given metadata formula into a numeric value, check the documentation inside that method to find out the proper format of metadata members inside this field.

There are also different strategies for extracting the quantity value, which are defined by the user using the field strategy, the different options available are:

Adding a new column based on relevant quantities

Using the method TRestDataSet::Define method we can implement a formula based on column names and relevant quantities. Then, the relevant quantities will be sustituted by their dataset value.

dataset.GetColumnNames()
dataset.Define("newColumnName", "QuantityName * column1" )
dataset.GetColumnNames()
dataset.GetDataFrame().Display({"column1", "newColumnName"})->Print();

Adding a new column on-the-fly during the dataset generation

It is also possible to add new column definitions inside the RML so that the new column will be already pre-generated and included in the new dataset when we invoke TRestDataSet::GenerateDataSet.

We can use a valid mathematical expression where we may include any already existing column or observable, and/or any relevant quantity we introduced in our dataset definition.

We may use the addColumn keyword to add new columns as follows:

<addColumn name="NormalFlux" expression="SolarFlux*GeneratorArea/Nsim" />

where SolarFlux,GeneratorArea and Nsim are the given names of the relevant quantities inside the dataset.

Adding cuts

It is also possible to add cuts used to filter the data that will be stored inside the dataset. We can do that including a TRestCut definition inside the TRestDataSet.

For example, the following cut definition would discard entries with unexpected values inside the specified column, process_status.

<cut name="goodData" value="!TMath::IsNaN(process_status)"/>

REST-for-Physics - Software for Rare Event Searches Toolkit

History of developments:

2022-November: First implementation of TRestDataSet Javier Galan

Author
: Javier Galan (javie.nosp@m.r.ga.nosp@m.lan.l.nosp@m.acar.nosp@m.ra@ce.nosp@m.rn.c.nosp@m.h)

Definition at line 34 of file TRestDataSet.h.

#include <TRestDataSet.h>

Inheritance diagram for TRestDataSet:
TRestMetadata

Data Structures

struct  RelevantQuantity
 

Public Member Functions

ROOT::RDF::RNode ApplyRange (size_t from, size_t to)
 This method reduces the number of samples inside the dataset by selecting a range.
 
 ClassDefOverride (TRestDataSet, 8)
 
ROOT::RDF::RNode DefineColumn (const std::string &columnName, const std::string &formula)
 This function will add a new column to the RDataFrame using the same scheme as the usual RDF::Define method, but it will on top of that evaluate the values of any relevant quantities used. More...
 
void EnableMultiThreading (Bool_t enable=true)
 
void Export (const std::string &filename, std::vector< std::string > excludeColumns={})
 It will generate an output file with the dataset compilation. Only the selected branches and the files that fulfill the metadata filter conditions will be included. More...
 
void GenerateDataSet ()
 This function generates the data frame with the filelist and column names (or observables) that have been defined by the user.
 
auto GetAddedColumns () const
 
auto GetCut () const
 
ROOT::RDF::RNode GetDataFrame () const
 Gives access to the RDataFrame.
 
auto GetEndTime () const
 
size_t GetEntries ()
 It returns the number of entries found inside fDataFrame and prints out a warning if the number of entries inside the tree is not the same.
 
auto GetFilePattern () const
 
std::vector< std::string > GetFileSelection ()
 It returns a list of the files that have been finally selected.
 
auto GetFileSelection () const
 
auto GetFilterContains () const
 
auto GetFilterEndTime () const
 
auto GetFilterEqualsTo () const
 
auto GetFilterGreaterThan () const
 
auto GetFilterLowerThan () const
 
auto GetFilterMetadata () const
 
auto GetFilterStartTime () const
 
size_t GetNumberOfBranches ()
 Number of variables (or observables)
 
size_t GetNumberOfColumns ()
 Number of variables (or observables)
 
auto GetObservablesList () const
 
auto GetProcessObservablesList () const
 
auto GetQuantity () const
 
auto GetStartTime () const
 
Double_t GetTotalTimeInSeconds () const
 It returns the accumulated run time in seconds.
 
TTree * GetTree () const
 Gives access to the tree.
 
void Import (const std::string &fileName)
 This function imports metadata from a root file it import metadata info from the previous dataSet while it opens the analysis tree.
 
void Import (std::vector< std::string > fileNames)
 This function initializes the chained tree and the RDataFrame using as input several root files that should contain TRestDataSet metadata information. The values of the first dataset will be considered to be stored in this new instance. More...
 
void Initialize () override
 This function initialize different parameters from the TRestDataSet.
 
auto IsMergedDataSet () const
 
ROOT::RDF::RNode MakeCut (const TRestCut *cut)
 This function applies a TRestCut to the dataframe and returns a dataframe with the applied cuts. Note that the cuts are not applied directly to the dataframe on TRestDataSet, to do so you should do fDataFrame = MakeCut(fCut);.
 
Bool_t Merge (const TRestDataSet &dS)
 This function merge different TRestDataSet metadata in current dataSet.
 
TRestDataSetoperator= (TRestDataSet &dS)
 Operator to copy TRestDataSet metadata.
 
void PrintMetadata () override
 Prints on screen the information about the metadata members of TRestDataSet.
 
ROOT::RDF::RNode Range (size_t from, size_t to)
 This method returns a RDataFrame node with the number of samples inside the dataset by selecting a range. It will not modify internally the dataset. See ApplyRange to modify internally the dataset.
 
void SetDataFrame (const ROOT::RDF::RNode &dS)
 
void SetFilePattern (const std::string &pattern)
 
void SetObservablesList (const std::vector< std::string > &obsList)
 
void SetQuantity (const std::map< std::string, RelevantQuantity > &quantity)
 
void SetTotalTimeInSeconds (Double_t seconds)
 
 TRestDataSet ()
 Default constructor.
 
 TRestDataSet (const char *cfgFileName, const std::string &name="")
 Constructor loading data from a config file. More...
 
 ~TRestDataSet ()
 Default destructor.
 

Protected Member Functions

virtual std::vector< std::string > FileSelection ()
 Function to determine the filenames that satisfy the dataset conditions.
 
void RegenerateTree (std::vector< std::string > finalList={})
 It regenerates the tree so that it is an exact copy of the present DataFrame.
 

Private Member Functions

void InitFromConfigFile () override
 Initialization of specific TRestDataSet members through an RML file. More...
 

Private Attributes

std::vector< std::pair< std::string, std::string > > fColumnNameExpressions
 A list of new columns together with its corresponding expressions added to the dataset.
 
TRestCutfCut = nullptr
 Parameter cuts over the selected dataset.
 
ROOT::RDF::RNode fDataFrame = ROOT::RDataFrame(0)
 The resulting RDF::RNode object after initialization.
 
Double_t fEndTime = REST_StringHelper::StringToTimeStamp(fFilterStartTime)
 TimeStamp for the end time of the last file.
 
Bool_t fExternal = false
 
std::string fFilePattern = ""
 A glob file pattern that must be satisfied by all files.
 
std::vector< std::string > fFileSelection
 A list populated by the FileSelection method using the conditions of the dataset.
 
std::vector< std::string > fFilterContains
 If not empty it will check if the metadata member contains the string.
 
std::string fFilterEndTime = "3000/12/31"
 All the selected runs will have an ending date before fEndTime.
 
std::vector< Double_t > fFilterEqualsTo
 If the corresponding element is not empty it will check if the metadata member is equal.
 
std::vector< Double_t > fFilterGreaterThan
 If the corresponding element is not empty it will check if the metadata member is greater.
 
std::vector< Double_t > fFilterLowerThan
 If the corresponding element is not empty it will check if the metadata member is lower.
 
std::vector< std::string > fFilterMetadata
 A list of metadata members where filters will be applied.
 
std::string fFilterStartTime = "2000/01/01"
 All the selected runs will have a starting date after fStartTime.
 
std::vector< std::string > fImportedFiles
 The list of dataset files imported.
 
Bool_t fMergedDataset = false
 It keeps track if the generated dataset is a pure dataset or a merged one.
 
Bool_t fMT = false
 A flag to enable Multithreading during dataframe generation.
 
std::vector< std::string > fObservablesList
 It contains a list of the observables that will be added to the final tree or exported file.
 
std::vector< std::string > fProcessObservablesList
 It contains a list of the process where all observables should be added.
 
std::map< std::string, RelevantQuantityfQuantity
 The properties of a relevant quantity that we want to store together with the dataset.
 
Double_t fStartTime = REST_StringHelper::StringToTimeStamp(fFilterEndTime)
 TimeStamp for the start time of the first file.
 
Double_t fTotalDuration = 0
 The total integrated run time of selected files.
 
TChain * fTree = nullptr
 A pointer to the generated tree.
 

Additional Inherited Members

Constructor & Destructor Documentation

◆ TRestDataSet()

TRestDataSet::TRestDataSet ( const char *  cfgFileName,
const std::string &  name = "" 
)

Constructor loading data from a config file.

If no configuration path is defined using TRestMetadata::SetConfigFilePath the path to the config file must be specified using full path, absolute or relative.

The default behaviour is that the config file must be specified with full path, absolute or relative.

Parameters
cfgFileNameA const char* giving the path to an RML file.
nameThe name of the specific metadata. It will be used to find the corresponding TRestAxionMagneticField section inside the RML.

Definition at line 319 of file TRestDataSet.cxx.

Member Function Documentation

◆ DefineColumn()

ROOT::RDF::RNode TRestDataSet::DefineColumn ( const std::string &  columnName,
const std::string &  formula 
)

This function will add a new column to the RDataFrame using the same scheme as the usual RDF::Define method, but it will on top of that evaluate the values of any relevant quantities used.

For example, the following code line would create a new column named test replacing the relevant quantity Nsim and the previously existing column probability.

d.Define("test", "Nsim * probability");

Definition at line 618 of file TRestDataSet.cxx.

◆ Export()

void TRestDataSet::Export ( const std::string &  filename,
std::vector< std::string >  excludeColumns = {} 
)

It will generate an output file with the dataset compilation. Only the selected branches and the files that fulfill the metadata filter conditions will be included.

For the moment we produce two different output files.

  • A plain-text (file recognized with csv or txt extension): It will produce a table with observable values, including a header of the dataset conditions.
  • A root file (recognized through its root extension): It will write to disk an Snapshot of the current dataset, i.e. in standard TTree format, together with a copy of the TRestDataSet instance that contains the conditions used to generate the dataset.

Definition at line 861 of file TRestDataSet.cxx.

◆ Import()

void TRestDataSet::Import ( std::vector< std::string >  fileNames)

This function initializes the chained tree and the RDataFrame using as input several root files that should contain TRestDataSet metadata information. The values of the first dataset will be considered to be stored in this new instance.

The metadata member fMergedDataset will be set to true to understand this dataset is the combination of several datasets, and not a pure original one.

Definition at line 1097 of file TRestDataSet.cxx.

◆ InitFromConfigFile()

void TRestDataSet::InitFromConfigFile ( )
overrideprivatevirtual

Initialization of specific TRestDataSet members through an RML file.

Reading filters

Reading observables

Reading process observables

Reading relevant quantities

Reading new dataset columns

Reimplemented from TRestMetadata.

Definition at line 732 of file TRestDataSet.cxx.


The documentation for this class was generated from the following files: