Nabarun Dev 1 Colin Jessop 1 Nancy Marinelli 1 Maurizio Pierini 2 08 11 2017 1 University of Notre Dame 2 CERN DQMML meeting Outline NDev University of Notre Dame DQMML Meeting 081117 ID: 807039
Download The PPT/PDF document "Anomaly detection for ECAL DQM" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Anomaly detection for ECAL DQM
Nabarun
Dev1, Colin Jessop1, Nancy Marinelli1, Maurizio Pierini208/11/2017
1
University of Notre
Dame2CERN
DQM-ML meeting
Slide2Outline
N.Dev, University of Notre Dame
DQM-ML Meeting, 08/11/17IntroductionDatasetModels
Preliminary results
Slide3Introduction
N.Dev, University of Notre Dame
DQM-ML Meeting, 08/11/17DQM (data quality monitoring) system is an important tool to ensure high quality data-taking for analyses purposes. It is used both online and offline.In real time data quality is currently assessed by looking at a dashboard containing a set of histograms which are compared to a reference set of histograms according to certain set of instructions.
By spotting anomalies in these images
(plots/histograms) it is possible to identify problems that appear in the detector, flag poor
quality data and/or take steps towards fixing these issues.
Slide4ECAL DQM
N.Dev, University of Notre Dame
DQM-ML Meeting, 08/11/17The ECAL DQM consists of a set of several histograms that help us monitor the running condition of the ECAL subdetector. DQM GUI shown below:The histograms are usually redrawn every lumisection. By monitoring these the shifter is able to spot problems in real time.
Slide5Avenues of improvements
N.Dev, University of Notre Dame
DQM-ML Meeting, 08/11/17Although the DQM system has been performing well over the years there are areas which can be potentially improved.The number of plots to monitor can be overwhelmingly large which can cause delay in spotting a problem or cause a transient problem to be overlooked.Lot of manpower is needed to constantly monitor these systems during data-taking. There is a possibility that the monitoring decisions can vary from shifter to shifter.
The aim of this project is to (study the feasibility of) automatize(
ing) the ECAL DQM system.
This is under the umbrella of the entire ML for DQM effort for the entire CMS detectorMachine Learning techniques can help reduce these issues.
Slide6N.Dev, University of Notre Dame
DQM-ML Meeting, 08/11/17
DISCLAIMER: WORK IN PROGRESS The following is very much a work in progress and all models and strategies discussed are preliminary. Kindly chime in with suggestions.
Slide7Current Strategy
N.Dev, University of Notre Dame
DQM-ML Meeting, 08/11/17Assume that normal instances occur much more frequently than anomalous instancesUse only normal instances (semi-supervised learning) to train an
autoencoder (input is mapped to ouput, the system learns to reconstruct the input with minimum loss ) [1] to minimize the loss function.
The loss then also serves as a metric:Good instances should be reconstructed with low lossBad instances should be reconstructed with higher loss
[1]:http://
ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/ , http
://
www.deeplearningbook.org
/contents/
autoencoders.html
Slide8Dataset
N.Dev, University of Notre Dame
DQM-ML Meeting, 08/11/17Current dataset consists of ~40000 samples from 2016 data from lumisections marked as good. (2016 goldenjson , CMSSW_9_2_11) [Thanks to Tanmay & Michael from ECAL DQM team]
The dataset (SingleElectron
/RAW) is processed to emulate the online DQM running conditions and produce one sample (set of images) per
lumisection ( Most of the RAW data required for this has been moved to TAPE. Was able to acquire only 40k images. Plan to add more images using 2017 data in the near future.)
The current image set per sample consists of rechit occupancy and timing plots: one for barrel and one each for both endcaps.
Slide9Dataset
N.Dev, University of Notre Dame
DQM-ML Meeting, 08/11/17The following study only uses the rechit occupancy images for the barrel: PREPROCESSING: The only preprocessing that was done was to normalize the histograms to their integral.
The holes in the plot are usually due to permanently masked channels/towers and network can be expected to learn that they are ok.
Slide10Model
0
N.Dev, University of Notre DameDQM-ML Meeting, 08/11/17FrameworkUsed:Keras library using tensorflow backendAuto-encoder with convolutional layersConv2D (8
channels,(3x3
) patches)
MaxPooling2D(2, 2)Conv2D (8 ,(3x
3)) MaxPooling2D
(5, 5)Conv2D (8 ,(3x3))
UpSampling2D
(5, 5
)
Conv2D
(
8
,
(
3
x
3
))
UpSampling2D(2, 2)
Conv2D
(
8
,
(
3
x
3
))
All
Conv
layers are ‘padded’ to keep size of output
channels same as input.
Training performance
N.Dev, University of Notre Dame
DQM-ML Meeting, 08/11/17Trained on gpu, batch size of 3060:20:20 split of train:test:validationPatience =5 (Number of epochs to wait in which validation loss doesn’t decrease by a minimum threshold (0.05%) before stopping training)Optimizer: Gradient Descent (learning rate 0.01), Loss Function: Binary-Cross Entropy loss
Slide12Loss as metric
N.Dev, University of Notre Dame
DQM-ML Meeting, 08/11/17Evaluate trained model over (each sample) training and test sets and histogram the reconstruction loss
Training and test sets have similar performance.
Slide13Choosing a better optimizer
N.Dev, University of Notre Dame
DQM-ML Meeting, 08/11/17Tried several other optimizers available within the KERAS library: ADAM, ADAM with Nesterov, RMSPROP, ADADELTA etc.Chose ADAM optimizer based on validation loss, rmsprop has similar performance
Training and test sets have similar performance.
Slide14Adding more layers: model
1
N.Dev, University of Notre DameDQM-ML Meeting, 08/11/17
Had to decrease batch size to 20, was hitting
gpu memory bottleneck (I think)
No decrease in loss, trains faster probably due to smaller batch size.
Slide15N.Dev, University of Notre Dame
DQM-ML Meeting, 08/11/17
Even more layers, less pooling: model 2Final train/val loss is worse with respect to model version 1 , with the same patienceIncreasing patience helps decrease loss to similar value as mode v1.
Patience=5
Patience=20
THIS IS WHERE I AM AT RIGHT NOW
Slide16Next Steps
N.Dev, University of Notre Dame
DQM-ML Meeting, 08/11/17Gather some anomalous examples and evaluate the model on them Compare their loss spectrum to that of normal examplesIncrease training set size.Try more sophisticated networks: bigger autoencoders with sparsity constraints etc.
Use other images besides occupancy (e.g. timing) as input.
Try other (supervised) learning techniques (e.g.- SVMs ) and compare performance.
Slide17DQM-ML Meeting, 08/11/17
N.Dev, University of Notre Dame
BACK UP
Slide18DQM-ML Meeting, 08/11/17
N.Dev, University of Notre Dame
Slide19DQM-ML Meeting, 08/11/17
N.Dev, University of Notre Dame