Identifying and Optimizing I/O Bottlenecks in Deep Learning Model Training on Supercomputers

Presented at Argonne National Laboratory's Learning on The Lawn Summer Research Conference, 2023

AI models, particularly neural networks, have become pivotal and popular. Inspired by the human brain, neural networks excel in processing complex data, pattern recognition, and problem-solving. They can be used to create groundbreaking applications in various fields. Training these networks is computationally expensive, leading scientists to use supercomputers for distributed computation. This causes significant delays in I/O (storage and retrieval of data from a file system). These delays can be analyzed to identify performance bottlenecks and inform optimizations to the system architecture to better compute time and costs. With the goal of understanding I/O behaviors for complex AI models and the capabilities of profiling tools, we deployed the “CosmicTagger” model on the Polaris supercomputer. We utilized NVIDIA Nsight to trace GPU activity and PyDarshan to trace I/O activity during model training. Moving forward, we aim to correlate and combine these two data streams using a standard time format to observe I/O bottlenecks during training, providing insights into opportunities for potential I/O optimizations.

Download Paper | Download Slides

Share on

X (formerly Twitter) Facebook LinkedIn

Jay Sakarvadia

Share on