eformer.executor.ray.docker_executor#

Docker execution utilities for Ray-based distributed computing.

This module provides utilities for running Docker containers on distributed compute resources (TPUs, GPUs) using Ray. It supports single-pod and multi-slice execution patterns, image building, and asynchronous execution.

Key Features:
  • Docker container configuration and command generation

  • Single-pod Docker execution with accelerator support

  • Multi-slice Docker execution for distributed workloads

  • Docker image building and registry push

  • Asynchronous Docker execution via Ray tasks

Example

Basic Docker execution on a TPU pod:

>>> from eformer.executor.ray import DockerConfig, run_docker_on_pod
>>> from eformer.executor.ray import TpuAcceleratorConfig
>>>
>>> config = DockerConfig(
...     image="my-ml-image:latest",
...     command="python train.py",
...     volumes={"/data": "/data"},
...     environment={"MODEL_NAME": "bert-base"}
... )
>>>
>>> tpu_config = TpuAcceleratorConfig(type="v4-8")
>>> output = run_docker_on_pod(config, tpu_config)

Multi-slice execution:

>>> outputs = run_docker_multislice(
...     config,
...     tpu_config,
...     num_slices=4
... )
class eformer.executor.ray.docker_executor.DockerConfig(image: str, command: str | list[str], volumes: dict[str, str] = None, environment: dict[str, str] = None, network: str = 'host', privileged: bool = False, gpus: str | None = None, shm_size: str | None = None, remove: bool = True, workdir: str | None = None, user: str | None = None)[source]#

Bases: object

Configuration for Docker container execution.

Encapsulates all settings needed to run a Docker container in a distributed environment.

image#

Docker image to use (e.g., “python:3.9”).

Type

str

command#

Command to run in the container.

Type

str | list[str]

volumes#

Volume mappings from host to container paths (e.g., {“/host/data”: “/container/data”}).

Type

dict[str, str] | None

environment#

Environment variables to pass to the container.

Type

dict[str, str] | None

network#

Network mode for the container. Defaults to “host”.

Type

str

privileged#

Whether to run in privileged mode. Defaults to False.

Type

bool

gpus#

GPU configuration (e.g., “all”, “0,1”, or device IDs).

Type

str | None

shm_size#

Shared memory size (e.g., “2g”, “512m”).

Type

str | None

remove#

Whether to remove the container after execution. Defaults to True.

Type

bool

workdir#

Working directory inside the container.

Type

str | None

user#

User to run the container as (e.g., “1000:1000”).

Type

str | None

Example

>>> config = DockerConfig(
...     image="tensorflow/tensorflow:latest-gpu",
...     command=["python", "train.py", "--epochs", "10"],
...     volumes={"/data": "/data", "/models": "/models"},
...     environment={"TF_CPP_MIN_LOG_LEVEL": "2"},
...     gpus="all",
...     shm_size="4g"
... )
command: str | list[str]#
environment: dict[str, str] = None#
gpus: str | None = None#
image: str#
network: str = 'host'#
privileged: bool = False#
remove: bool = True#
shm_size: str | None = None#
user: str | None = None#
volumes: dict[str, str] = None#
workdir: str | None = None#
eformer.executor.ray.docker_executor.build_and_push_docker_image(dockerfile_path: str, image_name: str, registry: str | None = None, build_args: dict[str, str] | None = None) str[source]#

Build a Docker image and optionally push to a registry.

Builds a Docker image from a Dockerfile and optionally pushes it to a container registry for use in distributed execution.

Parameters
  • dockerfile_path (str) – Path to the Dockerfile.

  • image_name (str) – Name for the Docker image (e.g., “my-app:v1.0”).

  • registry (str | None) – Optional registry URL to push to (e.g., “gcr.io/my-project” or “docker.io/myuser”).

  • build_args (dict[str, str] | None) – Optional build arguments to pass to docker build.

Returns

Full image name with registry prefix if applicable

(e.g., “gcr.io/my-project/my-app:v1.0”).

Return type

str

Raises

RuntimeError – If Docker build or push fails.

Example

>>> image = build_and_push_docker_image(
...     "./Dockerfile",
...     "training-image:latest",
...     registry="gcr.io/my-project",
...     build_args={"PYTHON_VERSION": "3.9"}
... )
>>> print(f"Built and pushed: {image}")
eformer.executor.ray.docker_executor.make_docker_run_command(config: DockerConfig) list[str][source]#

Construct a docker run command from configuration.

Converts a DockerConfig object into a list of command-line arguments suitable for subprocess execution.

Parameters

config (DockerConfig) – Docker configuration object containing all container settings.

Returns

List of command arguments for subprocess execution.

Return type

list[str]

Example

>>> config = DockerConfig(image="python:3.9", command="python app.py")
>>> cmd = make_docker_run_command(config)
>>>
eformer.executor.ray.docker_executor.run_docker_multislice(docker_config: DockerConfig, accelerator_config: eformer.executor.ray.resource_manager.TpuAcceleratorConfig | eformer.executor.ray.resource_manager.GpuAcceleratorConfig | eformer.executor.ray.resource_manager.CpuAcceleratorConfig, capture_output: bool = True, **executor_kwargs) list[Any][source]#

Run Docker containers across multiple slices.

Executes Docker containers in parallel across all hosts in the compute slices, typically used for distributed training or inference on TPU pods.

Parameters
  • docker_config (DockerConfig) – Base Docker container configuration. Each slice will receive a copy with slice-specific environment variables added.

  • accelerator_config (AcceleratorConfigType) – Accelerator configuration with multi-slice support (e.g., TPU v4-32 with 4 slices).

  • capture_output (bool) – Whether to capture and return container output. Defaults to True.

  • **executor_kwargs – Additional arguments passed to RayExecutor.autoscale_execute_resumable().

Returns

List of outputs from each host’s Docker container.

Length equals the number of hosts across all slices. Each element is stdout if capture_output is True, None otherwise.

Return type

list[Any]

Raises

RuntimeError – If any Docker container exits with a non-zero status.

Example

>>> from eformer.executor.ray import TpuAcceleratorConfig
>>>
>>> tpu_config = TpuAcceleratorConfig(type="v4-32", num_slices=4)
>>> outputs = run_docker_multislice(
...     docker_config,
...     tpu_config,
...     max_retries=3
... )
>>> print(f"Got {len(outputs)} outputs from all hosts")
eformer.executor.ray.docker_executor.run_docker_on_pod(docker_config: DockerConfig, accelerator_config: eformer.executor.ray.resource_manager.TpuAcceleratorConfig | eformer.executor.ray.resource_manager.GpuAcceleratorConfig | eformer.executor.ray.resource_manager.CpuAcceleratorConfig, capture_output: bool = True, **executor_kwargs) Any[source]#

Run a Docker container on a compute pod (TPU/GPU).

Executes a Docker container on a specific accelerator-enabled pod using Ray for resource allocation and fault tolerance.

Parameters
  • docker_config (DockerConfig) – Docker container configuration.

  • accelerator_config (AcceleratorConfigType) – Accelerator configuration specifying TPU or GPU resources.

  • capture_output (bool) – Whether to capture and return container output. Defaults to True.

  • **executor_kwargs – Additional arguments passed to RayExecutor.execute_resumable(), such as max_retries, retry_exceptions, etc.

Returns

The stdout output from the container if capture_output is True,

None otherwise.

Return type

Any

Raises

RuntimeError – If the Docker container exits with a non-zero status.

Example

>>> from eformer.executor.ray import GpuAcceleratorConfig
>>>
>>> gpu_config = GpuAcceleratorConfig(count=2, type="v100")
>>> output = run_docker_on_pod(
...     docker_config,
...     gpu_config,
...     max_retries=3
... )