Distributed¶

For distributed training, TorchX relies on the scheduler’s gang scheduling capabilities to schedule n copies of nodes. Once launched, the application is expected to be written in a way that leverages this topology, for instance, with PyTorch’s DDP. You can express a variety of node topologies with TorchX by specifying multiple torchx.specs.Role in your component’s AppDef. Each role maps to a homogeneous group of nodes that performs a “role” (function) in the overall training. Scheduling-wise, TorchX launches each role as a sub-gang.

A DDP-style training job has a single role: trainers. Whereas a training job that uses parameter servers will have two roles: parameter server, trainer. You can specify different entrypoint (executable), num replicas, resource requirements, and more for each role.

DDP Builtin¶

DDP-style trainers are common and easy to templetize since they are homogeneous single role AppDefs, so there is a builtin: dist.ddp. Assuming your DDP training script is called main.py, launch it as:

# locally, 1 node x 4 workers
$ torchx run -s local_cwd dist.ddp --entrypoint main.py --nproc_per_node 4

# locally, 2 node x 4 workers (8 total)
$ torchx run -s local_cwd dist.ddp --entrypoint main.py \
    --rdzv_backend c10d \
    --nnodes 2 \
    --nproc_per_node 4 \

# remote (needs you to setup an etcd server first!)
$ torchx run -s kubernetes -cfg queue=default dist.ddp \
    --entrypoint main.py \
    --rdzv_backend etcd \
    --rdzv_endpoint etcd-server.default.svc.cluster.local:2379 \
    --nnodes 2 \
    --nproc_per_node 4 \

There is a lot happening under the hood so we strongly encourage you to continue reading the rest of this section to get an understanding of how everything works. Also note that while dist.ddp is convenient, you’ll find that authoring your own distributed component is not only easy (simplest way is to just copy dist.ddp!) but also leads to better flexbility and maintainability down the road since builtin APIs are subject to more changes than the more stable specs API. However the choice is yours, feel free to rely on the builtins if they meet your needs.

Distributed Training¶

Local Testing¶

Note

Please follow Prerequisites of running examples first.

Running distributed training locally is a quick way to validate your training script. TorchX’s local scheduler will create a process per replica (--nodes). The example below uses torchelastic, as the main entrypoint of each node, which in turn spawns --nprocs_per_node number of trainers. In total you’ll see nnodes*nprocs_per_node trainer processes and nnodes elastic agent procesess created on your local host.

$ torchx run -s local_cwd ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \
     --nnodes 2 \
     --nproc_per_node 2 \
     --rdzv_backend c10d \
     --rdzv_endpoint localhost:29500

Warning

There is a known issue with local_docker (the default scheduler when no -s argument is supplied), hence we use -s local_cwd instead. Please track the progress of the fix on issue-286, issue-287.

Remote Launching¶

Note

Please follow the Prerequisites first.

The following example demonstrate launching the same job remotely on kubernetes.

$ torchx run -s kubernetes -cfg queue=default \
    ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \
    --nnodes 2 \
    --nproc_per_node 2 \
    --rdzv_backend etcd \
    --rdzv_endpoint etcd-server.default.svc.cluster.local:2379
torchx 2021-10-18 18:46:55 INFO     Launched app: kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng
torchx 2021-10-18 18:46:55 INFO     AppStatus:
  msg: <NONE>
  num_restarts: -1
  roles: []
  state: PENDING (2)
  structured_error_msg: <NONE>
  ui_url: null

torchx 2021-10-18 18:46:55 INFO     Job URL: None

Note that the only difference compared to the local launch is the scheduler (-s) and --rdzv_backend. etcd will also work in the local case, but we used c10d since it does not require additional setup. Note that this is a torchelastic requirement not TorchX. Read more about rendezvous here.

Note

For GPU training, keep nproc_per_node equal to the amount of GPUs on the host and change the resource requirements in torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist method. Modify resource_def to the number of GPUs that your host has.

Components APIs¶

torchx.components.dist.ddp(*script_args: str, entrypoint: str, image: str = 'ghcr.io/pytorch/torchx:0.1.0', rdzv_backend: Optional[str] = None, rdzv_endpoint: Optional[str] = None, resource: Optional[str] = None, nnodes: int = 1, nproc_per_node: int = 1, name: str = 'test-name', role: str = 'worker', env: Optional[Dict[str, str]] = None) → torchx.specs.api.AppDef [source]¶

Distributed data parallel style application (one role, multi-replica). Uses torch.distributed.run to launch and coordinate pytorch worker processes.

Parameters

script_args – Script arguments.
image – container image.
entrypoint – script or binary to run within the image.
rdzv_backend – rendezvous backend to use, allowed values can be found in the rdzv registry docs The default backend is c10d
rdzv_endpoint – Controller endpoint. In case of rdzv_backend is etcd, this is a etcd endpoint, in case of c10d, this is the endpoint of one of the hosts. The default entdpoint it localhost:29500
resource – Optional named resource identifier. The resource parameter gets ignored when running on the local scheduler.
nnodes – Number of nodes.
nproc_per_node – Number of processes per node.
name – Name of the application.
role – Name of the ddp role.
env – Env variables.

Returns

Torchx AppDef

Return type

specs.AppDef

Distributed¶

DDP Builtin¶

Distributed Training¶

Local Testing¶

Remote Launching¶

Components APIs¶

文档

教程

资源