/
A first look at  CitusDB A first look at  CitusDB

A first look at CitusDB - PowerPoint Presentation

holly
holly . @holly
Follow
66 views
Uploaded On 2023-06-24

A first look at CitusDB - PPT Presentation

amp indatabase physics analysis M Limper 19062014 Introduction Physics Analysis is currently filebased Scanning through large datasets can be cumbersome The idea send jobs to the computing grid ID: 1002693

float vector total int vector float int total muon stvf staco data event top type boolean ntuple csv reffinal

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "A first look at CitusDB" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. A first look at CitusDB & in-database physics analysisM. Limper 19/06/2014

2. IntroductionPhysics Analysis is currently file-basedScanning through large datasets can be cumbersomeThe idea: send jobs to the computing gridIn practice: bored waiting for grid-job to finish, scientists filter datasets, throwing away data until it fits on the physicist’ laptopWhat if we could provide access to large datasets via a database?

3. IntroductionIn-database physics analysis:SQL goes in, results come out!J/ψΨ(3686)

4. My dataTest sample of 127 ntuple-files of collision-data recorded by ATLAS experiment => subset re-presenting 3 fat ‘LHC runs’, ~0.5% of total dataset7.1 million events total6022 “branches” per event2053 “scalar”-type branches3527 “vector”-type branches379 “vector-of-vector”-type branches63 “vector-of-vector-of-vector”-type branches~200 GB of data

5. Ntuple branch examples 2053 scalar-type variablesMissing Energy: one value per branch event Float_t MET_RefFinal_em_etx; Float_t MET_RefFinal_em_ety; Float_t MET_RefFinal_em_phi; Float_t MET_RefFinal_em_et; Float_t MET_RefFinal_em_sumet; Float_t MET_RefFinal_etx; Float_t MET_RefFinal_ety; Float_t MET_RefFinal_phi;Event Filter: one value per branch per event Bool_t EF_2b55_loose_j145_j55_a4tchad; Bool_t EF_2e12Tvh_loose1; Bool_t EF_2e5_tight1_Jpsi; Bool_t EF_2e7T_loose1_mu6; Bool_t EF_2e7T_medium1_mu6; Bool_t EF_2g15vh_medium_g10_medium; Bool_t EF_2g20vh_medium;Lots of variables but relatively small fraction of the total dataset

6. Ntuple branch examplesOne value per electron per event vector<float> *el_E; vector<float> *el_Et; vector<float> *el_pt; vector<float> *el_m; vector<float> *el_eta; vector<float> *el_phi; vector<float> *el_px; vector<float> *el_py; vector<float> *el_pz; vector<float> *el_charge; vector<int> *el_author; 3527 vector-type variablesOne value per muon per event vector<unsigned short> *mu_allauthor; vector<int> *mu_author; vector<float> *mu_beta; vector<float> *mu_isMuonLikelihood; vector<float> *mu_matchchi2; vector<int> *mu_matchndof; vector<float> *mu_etcone20; vector<float> *mu_etcone30; vector<float> *mu_etcone40; vector<float> *mu_nucone20; vector<float> *mu_nucone30; vector<float> *mu_nucone40;One value per photon per event vector<float> *ph_CaloPointing_eta; vector<float> *ph_CaloPointing_sigma_eta; vector<float> *ph_CaloPointing_zvertex; vector<float> *ph_CaloPointing_sigma_zvertex; vector<float> *ph_HPV_eta; vector<float> *ph_HPV_sigma_eta; vector<float> *ph_HPV_zvertex; vector<float> *ph_HPV_sigma_zvertex; vector<int> *ph_NN_passes; vector<float> *ph_NN_discriminant;Representing the bulk of the data (many particles per event!) Analysis relies heavily on filtering events by selection particles with certain properties

7. Ntuple branch examples 379 vector-of-vector type variablesOne value per ‘SpaceTime’-measurement on each muon per event: vector<vector<int> > *mu_SpaceTime_detID; vector<vector<float> > *mu_SpaceTime_t; vector<vector<float> > *mu_SpaceTime_tError; vector<vector<float> > *mu_SpaceTime_weight;One value per vertex per photon per event: vector<vector<float> > *ph_vx_px; vector<vector<float> > *ph_vx_py; vector<vector<float> > *ph_vx_pz; vector<vector<float> > *ph_vx_E; vector<vector<float> > *ph_vx_m; vector<vector<int> > *ph_vx_nTracks;Used for certain reconstruction performance studiesTo be stored in CLOB or separate table…

8. Ntuple branch examples 63 vector-of-vector-vector type variablesOne value per track per vertex per photon per event: vector<vector<vector<int> > > *ph_vx_convTrk_nSiHits; vector<vector<vector<float> > > *ph_vx_convTrk_TRTHighTHitsRatio; vector<vector<vector<float> > > *ph_vx_convTrk_TRTHighTOutliersRatio; vector<vector<vector<float> > > *ph_vx_convTrk_eProbabilityComb;Not using any of these in my queries, typically used for final corrections or certain in-depth studies of reconstruction performanceTo be stored in CLOB or separate table…

9. Converting ntuples to tablesSelf-made program to convert ntuples into database tablesOne physics-object is represented by one tableEach table still has hundreds of columns!

10. SQL analysisSQL analysis involves predicate filtering to select good objects and JOINs to put information from different tables together:CitusDB+column-store extension looks interesting:Object selection involves only a few out of many columns => would benefit from column storageWhen preselection passes many objects, JOINs can potentially become huge => would benefit from sharding, with shard-distribution based on EventNumber to reduce JOIN-size

11. Storing ntuple-data into to CitusDBRe-wrote my program to store data in CitusDBRead data from all branches with specific prefixWrite data as comma-delimited values in temporary valueAfter csv-file passes 5000 lines of data, store data into CitusDBProgram trigger command-line argument for psql to execute psql-macroPsql-macro uses \STAGE command to load data

12. Create table statementCREATE FOREIGN TABLE eventdata203779_c (RunNumber INTEGER NOT NULL,EventNumber INTEGER NOT NULL, lbn INTEGER NOT NULL,"bunch_configID" INT,"timestamp" INT,"timestamp_ns" INT,"bcid" INT,"detmask0" INT,"detmask1" INT,"actualIntPerXing" FLOAT,"averageIntPerXing" FLOAT,"pixelFlags" INT,"sctFlags" INT,"trtFlags" INT,"larFlags" INT,"tileFlags" INT,"fwdFlags" INT,"coreFlags" INT,"pixelError" INT,"sctError" INT,"trtError" INT,"larError" INT,"tileError" INT,"fwdError" INT,"coreError" INT,"streamDecision_Egamma" BOOLEAN,"streamDecision_Muons" BOOLEAN,"streamDecision_JetTauEtmiss" BOOLEAN,"isSimulation" BOOLEAN,"isCalibration" BOOLEAN,"isTestBeam" BOOLEAN,"el_n" INT,"v0_n" INT,"ph_n" INT,"mu_n" INT,"tau_n" INT,"trk_n" INT,"jet_n" INT,"vxp_n" INT,"top_hfor_type" INT,"Muon_Total_Staco_STVF_etx" FLOAT,"Muon_Total_Staco_STVF_ety" FLOAT,"Muon_Total_Staco_STVF_phi" FLOAT,"Muon_Total_Staco_STVF_et" FLOAT,"Muon_Total_Staco_STVF_sumet" FLOAT,"Muon_Total_Staco_STVF_top_etx" FLOAT,"Muon_Total_Staco_STVF_top_ety" FLOAT,"Muon_Total_Staco_STVF_top_phi" FLOAT,"Muon_Total_Staco_STVF_top_et" FLOAT,"Muon_Total_Staco_STVF_top_sumet" FLOAT,"mb_n" INT,"collcand_passCaloTime" BOOLEAN,"collcand_passMBTSTime" BOOLEAN,"collcand_passTrigger" BOOLEAN,"collcand_pass" BOOLEAN)DISTRIBUTE BY APPEND (EventNumber)SERVER cstore_server OPTIONS(filename '', compression 'pglz');Create foreign tables stored using column-store extensionDistribute shards by EventNumberKeep data from the same event togetherFacilitate joins between different tablesOne table per RunNumber: distribute shards by (RunNumber,EventNumber) not possible

13. \STAGE statement\STAGE eventdata203779_c FROM '/data_citusdb/csv/NTUP_TOPEL.00872780.NTUP_TOPEL.00872780._000001.root.1.eventdata.csv' (FORMAT CSV)Gives: \copy: ERROR:  copy column list is not supportedI can’t define columns when inserting into foreign tables using \STAGEToo bad, I found it useful to specify the columns as different ntuple can contain different branches: if column is not specified in csv, it should insert null Similarly I’d like to have an option to add columns (is this possible? Didn’t look at it yet)Instead I’ll use simple \STAGE command:\STAGE eventdata203779_c (RunNumber,EventNumber,lbn,"bunch_configID","timestamp","timestamp_ns","bcid","detmask0","detmask1","actualIntPerXing","averageIntPerXing","pixelFlags","sctFlags","trtFlags","larFlags","tileFlags","fwdFlags","coreFlags","pixelError","sctError","trtError","larError","tileError","fwdError","coreError","streamDecision_Egamma","streamDecision_Muons","streamDecision_JetTauEtmiss","isSimulation","isCalibration","isTestBeam","el_n","v0_n","ph_n","mu_n","tau_n","trk_n","jet_n","vxp_n","top_hfor_type","Muon_Total_Staco_STVF_etx","Muon_Total_Staco_STVF_ety","Muon_Total_Staco_STVF_phi","Muon_Total_Staco_STVF_et","Muon_Total_Staco_STVF_sumet","Muon_Total_Staco_STVF_top_etx","Muon_Total_Staco_STVF_top_ety","Muon_Total_Staco_STVF_top_phi","Muon_Total_Staco_STVF_top_et","Muon_Total_Staco_STVF_top_sumet","mb_n","collcand_passCaloTime","collcand_passMBTSTime","collcand_passTrigger","collcand_pass") FROM '/data1/citus_db/csv/NTUP_TOPEL.00872780.NTUP_TOPEL.00872780._000001.root.1.eventdata.csv' (FORMAT CSV)

14. Primary Key issuesI’d like to set primary key set on (RunNumber,EventNumber,ObjectNumber). Ntuple-files occassionaly store the same event twiceDue to the way experiments records data from ‘streams’, some overlap from different streamsSorting out doubles is yet another hassle for physicists to deal with, using a database with primary-key ensures unique event are store… but:Currently using \STAGE insert of all data in the entire .csv-file fails when it find 1 double among the 50000 linesWork-around= no primary key constraint for now…

15. Other issuesWhile testing, I’m frequently deciding to recreate some tables, but how do I drop FOREIGN table including the shards?

16. Storing ntuple-data into to CitusDBExample of insert-program churning through the data…

17. Query test15:34 Do I still have time to test something before the call??

18. To-doInsert all my data:I need to find a good way to use my 18 disks per node (mystery errors were coming from my raid setup)Maybe I just mount each disk separately and run one worker per disk? Get some queries going!

Related Contents


Next Show more