eformer.executor.ray.docker_executor#
Docker execution utilities for Ray-based distributed computing.
This module provides utilities for running Docker containers on distributed compute resources (TPUs, GPUs) using Ray. It supports single-pod and multi-slice execution patterns, image building, and asynchronous execution.
- Key Features:
Docker container configuration and command generation
Single-pod Docker execution with accelerator support
Multi-slice Docker execution for distributed workloads
Docker image building and registry push
Asynchronous Docker execution via Ray tasks
Example
Basic Docker execution on a TPU pod:
>>> from eformer.executor.ray import DockerConfig, run_docker_on_pod
>>> from eformer.executor.ray import TpuAcceleratorConfig
>>>
>>> config = DockerConfig(
... image="my-ml-image:latest",
... command="python train.py",
... volumes={"/data": "/data"},
... environment={"MODEL_NAME": "bert-base"}
... )
>>>
>>> tpu_config = TpuAcceleratorConfig(type="v4-8")
>>> output = run_docker_on_pod(config, tpu_config)
Multi-slice execution:
>>> outputs = run_docker_multislice(
... config,
... tpu_config,
... num_slices=4
... )
- class eformer.executor.ray.docker_executor.DockerConfig(image: str, command: str | list[str], volumes: dict[str, str] = None, environment: dict[str, str] = None, network: str = 'host', privileged: bool = False, gpus: str | None = None, shm_size: str | None = None, remove: bool = True, workdir: str | None = None, user: str | None = None)[source]#
Bases:
objectConfiguration for Docker container execution.
Encapsulates all settings needed to run a Docker container in a distributed environment.
- image#
Docker image to use (e.g., “python:3.9”).
- Type
str
- command#
Command to run in the container.
- Type
str | list[str]
- volumes#
Volume mappings from host to container paths (e.g., {“/host/data”: “/container/data”}).
- Type
dict[str, str] | None
- environment#
Environment variables to pass to the container.
- Type
dict[str, str] | None
- network#
Network mode for the container. Defaults to “host”.
- Type
str
- privileged#
Whether to run in privileged mode. Defaults to False.
- Type
bool
- gpus#
GPU configuration (e.g., “all”, “0,1”, or device IDs).
- Type
str | None
- shm_size#
Shared memory size (e.g., “2g”, “512m”).
- Type
str | None
- remove#
Whether to remove the container after execution. Defaults to True.
- Type
bool
- workdir#
Working directory inside the container.
- Type
str | None
- user#
User to run the container as (e.g., “1000:1000”).
- Type
str | None
Example
>>> config = DockerConfig( ... image="tensorflow/tensorflow:latest-gpu", ... command=["python", "train.py", "--epochs", "10"], ... volumes={"/data": "/data", "/models": "/models"}, ... environment={"TF_CPP_MIN_LOG_LEVEL": "2"}, ... gpus="all", ... shm_size="4g" ... )
- command: str | list[str]#
- environment: dict[str, str] = None#
- gpus: str | None = None#
- image: str#
- network: str = 'host'#
- privileged: bool = False#
- remove: bool = True#
- shm_size: str | None = None#
- user: str | None = None#
- volumes: dict[str, str] = None#
- workdir: str | None = None#
- eformer.executor.ray.docker_executor.build_and_push_docker_image(dockerfile_path: str, image_name: str, registry: str | None = None, build_args: dict[str, str] | None = None) str[source]#
Build a Docker image and optionally push to a registry.
Builds a Docker image from a Dockerfile and optionally pushes it to a container registry for use in distributed execution.
- Parameters
dockerfile_path (str) – Path to the Dockerfile.
image_name (str) – Name for the Docker image (e.g., “my-app:v1.0”).
registry (str | None) – Optional registry URL to push to (e.g., “gcr.io/my-project” or “docker.io/myuser”).
build_args (dict[str, str] | None) – Optional build arguments to pass to docker build.
- Returns
- Full image name with registry prefix if applicable
(e.g., “gcr.io/my-project/my-app:v1.0”).
- Return type
str
- Raises
RuntimeError – If Docker build or push fails.
Example
>>> image = build_and_push_docker_image( ... "./Dockerfile", ... "training-image:latest", ... registry="gcr.io/my-project", ... build_args={"PYTHON_VERSION": "3.9"} ... ) >>> print(f"Built and pushed: {image}")
- eformer.executor.ray.docker_executor.make_docker_run_command(config: DockerConfig) list[str][source]#
Construct a docker run command from configuration.
Converts a DockerConfig object into a list of command-line arguments suitable for subprocess execution.
- Parameters
config (DockerConfig) – Docker configuration object containing all container settings.
- Returns
List of command arguments for subprocess execution.
- Return type
list[str]
Example
>>> config = DockerConfig(image="python:3.9", command="python app.py") >>> cmd = make_docker_run_command(config) >>>
- eformer.executor.ray.docker_executor.run_docker_multislice(docker_config: DockerConfig, accelerator_config: eformer.executor.ray.resource_manager.TpuAcceleratorConfig | eformer.executor.ray.resource_manager.GpuAcceleratorConfig | eformer.executor.ray.resource_manager.CpuAcceleratorConfig, capture_output: bool = True, **executor_kwargs) list[Any][source]#
Run Docker containers across multiple slices.
Executes Docker containers in parallel across all hosts in the compute slices, typically used for distributed training or inference on TPU pods.
- Parameters
docker_config (DockerConfig) – Base Docker container configuration. Each slice will receive a copy with slice-specific environment variables added.
accelerator_config (AcceleratorConfigType) – Accelerator configuration with multi-slice support (e.g., TPU v4-32 with 4 slices).
capture_output (bool) – Whether to capture and return container output. Defaults to True.
**executor_kwargs – Additional arguments passed to RayExecutor.autoscale_execute_resumable().
- Returns
- List of outputs from each host’s Docker container.
Length equals the number of hosts across all slices. Each element is stdout if capture_output is True, None otherwise.
- Return type
list[Any]
- Raises
RuntimeError – If any Docker container exits with a non-zero status.
Example
>>> from eformer.executor.ray import TpuAcceleratorConfig >>> >>> tpu_config = TpuAcceleratorConfig(type="v4-32", num_slices=4) >>> outputs = run_docker_multislice( ... docker_config, ... tpu_config, ... max_retries=3 ... ) >>> print(f"Got {len(outputs)} outputs from all hosts")
- eformer.executor.ray.docker_executor.run_docker_on_pod(docker_config: DockerConfig, accelerator_config: eformer.executor.ray.resource_manager.TpuAcceleratorConfig | eformer.executor.ray.resource_manager.GpuAcceleratorConfig | eformer.executor.ray.resource_manager.CpuAcceleratorConfig, capture_output: bool = True, **executor_kwargs) Any[source]#
Run a Docker container on a compute pod (TPU/GPU).
Executes a Docker container on a specific accelerator-enabled pod using Ray for resource allocation and fault tolerance.
- Parameters
docker_config (DockerConfig) – Docker container configuration.
accelerator_config (AcceleratorConfigType) – Accelerator configuration specifying TPU or GPU resources.
capture_output (bool) – Whether to capture and return container output. Defaults to True.
**executor_kwargs – Additional arguments passed to RayExecutor.execute_resumable(), such as max_retries, retry_exceptions, etc.
- Returns
- The stdout output from the container if capture_output is True,
None otherwise.
- Return type
Any
- Raises
RuntimeError – If the Docker container exits with a non-zero status.
Example
>>> from eformer.executor.ray import GpuAcceleratorConfig >>> >>> gpu_config = GpuAcceleratorConfig(count=2, type="v100") >>> output = run_docker_on_pod( ... docker_config, ... gpu_config, ... max_retries=3 ... )