FENNEC
Detailed Notes | Link to Paper (Soon to be updated)(Aug. 2024 ~ Present) Distributed Learning Simulation with High Network Fidelity
What is FENNEC?
I will be conducting research on developing a framework that automatically synthesizes network traffic workloads, enabling high-fidelity network simulation for distributed machine learning tasks.
The framework will input an unmodified ML workload written in a library such as PyTorch or TensorFlow, and produce traffic patterns compatible with various network simulators. This capability allows for generating highly accurate traffic patterns based on the input workload, facilitating agile iterations in network stack design without the need for actual GPU clusters. I’ll be collaborating with JeongYoon Moon to design and implement several key components of the project, including a workload profiler and a traffic pattern generator.
I will be updating this page once I am finished with the project. Please check out this link to Notion for detailed notes that I periodically update.
So far, I’ve worked with ASTRA-sim and ns-3 to identify inaccuracies in network representations of the distributed workload simulator (ASTRA-sim) used by Meta.