Quickstart Tutorial
1. Install
See instructions at Install.
2. Get Dataset
For now, let's keep our data in a directory, with a location saved to /my/data/path
:
export EXATRKX_DATA=/my/data/path
(you can hard-code these into your custom configs later).
The easiest way to get the TrackML dataset is to use the Kaggle API. Install it with
pip install kaggle
and grab a small toy dataset with
kaggle competitions download \
-c trackml-particle-identification \
-f train_sample.zip \
-p $EXATRKX_DATA
3. Running the Pipeline
Configuration
A pipeline runs at three layers of configuration, to allow for as much flexibility as possible. To get running immediately however, you needn't change any of the defaults. From the Pipelines/TrackML_Example/
directory, we run
traintrack configs/pipeline_quickstart.yaml
While it's running, get a cup of tea and a Tim-Tam, and let's see what it's doing:
Default behaviour
Our quickstart pipeline is running three stages, with a single configuration for each. You can see in config/pipeline_quickstart.yaml
that the three stages are:
- A Processing stage with the class
FeatureStore
and configprepare_small_feature_store.yaml
; - An Embedding stage with the class
LayerlessEmbedding
and configtrain_small_embedding.yaml
; and - A GNN stage with the class
VanillaFilter
and configtrain_small_filter.yaml
.
The Processing stage is exactly that: data processing. It is not "trainable", and so the pipeline treats it differently than a trainable stage. Under the hood, it is a LightningDataModule, rather than the trainable models, which inherit from LightningModule. In this case, FeatureStore
is performing some calculations on the cell information in the detector, and constructing truth graphs that will later be used for training. These calculations are computationally expensive, so it doesn't make sense to calculate them on-the-fly while training.
The trainable models Embedding and GNN are learning the non-linear metric of the truth graphs, and pairwise likelihoods of hits sharing a truth graph edge, respectively. The details are not so important at this stage, what matters is that these stages are modular: Each one can be run alone, but by adding a callback to the end, it can prepare the dataset for the next stage. Looking at LightningModules/Embedding/train_small_embedding.yaml
you will see that a callback is given as callbacks: EmbeddingInferenceCallback
. Any number of callbacks can be added, and they adhere to the Lightning callback system. The one referred to here runs the best version of the trained Embedding model on each directory of the data split (train, val, test) and saves it for the next stage. We could also add a telemetry callback, e.g. the EmbeddingTelemetry
callback in LightningModules/Embedding/Models/inference.py
. This callback prints a PDF of the transverse momentum vs. the efficiency of the metric learning model, saving it in the output_dir
. It "hooks" into the testing phase, which is run after every training phase.
The default settings of this run pull from the three configuration files given at each stage. You can look at them