`Introduction `__ \|\| **What is DDP** \|\| `Single-Node
Multi-GPU Training `__ \|\| `Fault
Tolerance `__ \|\| `Multi-Node
training <../intermediate/ddp_series_multinode.html>`__ \|\| `minGPT Training <../intermediate/ddp_series_minGPT.html>`__
What is Distributed Data Parallel (DDP)
=======================================
Authors: `Suraj Subramanian `__
.. grid:: 2
.. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
* How DDP works under the hood
* What is the DistributedSampler
* How gradients are synchronized across GPUs
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
* Familiarity with `basic non-distributed training `__ in PyTorch
Follow along with the video below or on `youtube `__.
.. raw:: html
This tutorial is a gentle introduction to PyTorch `DistributedDataParallel `__ (DDP)
which enables data parallel training in PyTorch. Data parallelism is a way to
process multiple data batches across multiple devices simultaneously
to achieve better performance. In PyTorch, the `DistributedSampler `__
ensures each device gets a non-overlapping input batch. The model is replicated on all the devices;
each replica calculates gradients and simultaneously synchronizes with the others using the `ring all-reduce
algorithm `__.
Why you should prefer DDP over DataParallel (DP)
-------------------------------------------------
`DataParallel `__
is an older approach to data parallelism. DP is trivially simple (with just one extra line of code) but it is much less performant.
DDP improves upon the architecture in a few ways:
+---------------------------------------+------------------------------+
| DataParallel | DistributedDataParallel |
+=======================================+==============================+
| More overhead; model is replicated | Model is replicated only |
| and destroyed at each forward pass | once |
+---------------------------------------+------------------------------+
| Only supports single-node parallelism | Supports scaling to multiple |
| | machines |
+---------------------------------------+------------------------------+
| Slower; uses multithreading on a | Faster (no GIL contention) |
| single process and runs into GIL | because it uses |
| contention | multiprocessing |
+---------------------------------------+------------------------------+
Further Reading
---------------
- `Multi-GPU training with DDP `__ (next tutorial in this series)
- `DDP
API `__
- `DDP Internal
Design `__