/
UPDATE ON YODIDROID MPI ISSUES UPDATE ON YODIDROID MPI ISSUES

UPDATE ON YODIDROID MPI ISSUES - PDF document

susan2
susan2 . @susan2
Follow
342 views
Uploaded On 2021-07-05

UPDATE ON YODIDROID MPI ISSUES - PPT Presentation

erhtjhtyhy DOUG BENJAMIN ANL HEP Doug Benjamin Problem Droids send a REQUESTJOB message to Yoda Yoda would responded with a NEWJOB message to each Droid Sometimes the message wo ID: 853984

message yoda doug benjamin yoda message benjamin doug size messages file default droid patch tadashi problem alcf panda buffer

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "UPDATE ON YODIDROID MPI ISSUES" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 UPDATE ON YODI/DROID MPI ISSUES erhtjhty
UPDATE ON YODI/DROID MPI ISSUES erhtjhtyhy DOUG BENJAMIN ANL – HEP Doug Benjamin Problem -  Droids send a REQUEST_JOB message to Yoda  Yoda would responded with a NEW_JOB message to each Droid  Sometimes the message would get through and other times it would not. – Task dependent  This behavior was not seen in all of the 1M event

2 validation jobs sent to Yoda/Droid 
validation jobs sent to Yoda/Droid  Until Monday the default message buffer size was 1M bytes.  Messages larger than 1 M bytes were not delivered whether or not they were non - blocking (default) or blocking messages.  As part of debugging sent the message to the log file and discovered the actual message size.  Why are we sending

3 such large messages? 2 Doug Benjamin Cau
such large messages? 2 Doug Benjamin Cause -  Very large NEW_JOB message size 12 MB  Analysis by Tadashi –  Here is the list of attributes and their sizes. 3459039B prodDBlocks 3093179B realDatasetsIn 1696259B dispatchDblock 1230619B GUID 1130839B inFiles 498899B ddmEndPointIn 399119B checksum 365859B scopeIn 355881B inFilePaths

4 332599B fsize 166299B dispatchDBlockT
332599B fsize 166299B dispatchDBlockToken 791B jobPars Note – Yoda needs very little of this information - 3 Doug Benjamin Fixes  Short term patch – in the file yoda_droid.py – https://github.com/PanDAWMS/panda - yoda/blob/master/pandayoda/yoda_droid.py Change Line 92 – from : mpi_default_message_buffer_size = 1000000 To: mpi_d

5 efault_message_buffer_size = 20000000 ï‚
efault_message_buffer_size = 20000000  Proper Patch by Tadashi – One problem is that the same information about one file is appended several times, e.g., which happened due to nEventsPerJob nEventsPerFile . I've fixed the panda server. The sizes of prodDBlocks ~ dispatchDBlockToken can be reduced by 1/10. 4 Doug Benjamin Fixes(2)  Harv

6 ester Patch by Tadashi – “I'
ester Patch by Tadashi – “I've added stripJobParams to shared_file_messenger in the git repo. You need in panda_queues.json ” "messenger": { ... " stripJobParams ": true  Harvester updated at NERSC and ALCF  Testing has begun at ALCF 5 Doug Benjamin Additional Yoda changes  Need to determine why changes to Yoda config file

7 (default buffersize change) did not t
(default buffersize change) did not take.  Need to accurately report the size of messages in the log files. New code needed to handle the nested nature of the messages. (code to be added today). 6 Doug Benjamin Conclusion  Problem solved  NERSC and ALCF producing events –  Now time to measure the efficiency (events per wall clo