eformer.executor.ray.resource_manager#
Resource management system for Ray-based distributed computing.
This module provides comprehensive resource management capabilities for Ray execution, including hardware accelerator configurations (CPU, GPU, TPU), resource allocation, and runtime environment management. It serves as the foundation for distributed training and inference workloads in the eFormer framework.
- Key Components:
RayResources: Core resource specification and management
HardwareType: Constants for various accelerator types
Accelerator Configs: CPU, GPU, and TPU configuration classes
Protocols: Interfaces for compute resource configurations
Example
- Basic GPU configuration:
>>> config = GpuAcceleratorConfig( ... device_count=2, ... gpu_model=HardwareType.NVIDIA_A100, ... cpu_count=4 ... ) >>> remote_fn = config.create_remote_decorator()(my_function) >>> result = ray.get(remote_fn.remote())
- TPU configuration:
>>> config = TpuAcceleratorConfig( ... tpu_version=HardwareType.GOOGLE_TPU_V4, ... pod_count=1 ... ) >>> resources = config.to_ray_resources()
- class eformer.executor.ray.resource_manager.ComputeResourceConfig(*args, **kwargs)[source]#
Bases:
ProtocolProtocol defining the interface for hardware resource configurations.
This protocol establishes a standardized contract that all resource configuration classes must implement. It ensures consistency across different accelerator types (CPU, GPU, TPU) and provides the necessary methods for Ray task and actor deployment. The implementations are primarily used for distributed training and inference workloads.
The protocol defines both required attributes and methods that enable: - Resource specification conversion to Ray formats - Runtime environment management - Hardware-specific configuration handling - Remote function decoration with appropriate resources
- execution_env#
Ray runtime environment configuration.
- Type
RuntimeEnv
- head_name#
Optional identifier for the head node.
- Type
str | None
- head_workers#
Number of workers on the head node.
- Type
int
Example
>>> def deploy_model(config: ComputeResourceConfig): ... remote_options = config.get_remote_options() ... @ray.remote(**remote_options) ... class ModelServer: ... def predict(self, data): pass ... return ModelServer
- create_remote_decorator() Callable[[Any], Any][source]#
Create a ray.remote decorator with this resource configuration.
This convenience method creates a pre-configured Ray remote decorator that includes all the resource specifications from this configuration. The decorator can then be applied to functions or classes to make them Ray remote operations.
- Returns
- A ray.remote decorator that can be applied to
functions or classes. The decorator includes all resource requirements from this configuration.
- Return type
Callable[[Any], Any]
Example
>>> config = GpuAcceleratorConfig(device_count=1) >>> remote_decorator = config.create_remote_decorator() >>> @remote_decorator ... def train_model(): ... return "Training complete" >>> result = ray.get(train_model.remote())
- execution_env: RuntimeEnv#
- get_remote_options() dict[str, Any][source]#
Get keyword arguments for ray.remote() based on this resource configuration.
This method converts the resource configuration into a dictionary format that can be directly passed to Ray’s remote decorator. It handles the translation between high-level resource specifications and Ray’s internal resource format.
- Returns
- Dictionary of arguments suitable for passing to ray.remote().
Common keys include num_cpus, num_gpus, resources, runtime_env, and accelerator_type.
- Return type
dict[str, Any]
Example
>>> config = GpuAcceleratorConfig(device_count=2, cpu_count=4) >>> options = config.get_remote_options() >>> @ray.remote(**options) ... def gpu_task(): ... return "Task complete"
- hardware_identifier() str | None[source]#
Get the identifier for the hardware accelerator being used.
This method returns a string identifier that specifies the exact hardware accelerator type or model required for the computation. The identifier typically corresponds to values from the HardwareType class.
- Returns
- String identifier for the hardware accelerator (e.g., “A100”,
”TPU-V4”) or None if no specific accelerator is required.
- Return type
str | None
Example
>>> config = GpuAcceleratorConfig(gpu_model="A100") >>> print(config.hardware_identifier())
- head_name: str | None = None#
- head_workers: int = 1#
- to_ray_resources() RayResources[source]#
Convert this configuration to a RayResources object.
This method transforms the configuration into a RayResources instance, which provides a standardized representation of resource requirements that can be used across different parts of the system.
- Returns
- RayResources instance representing the hardware resources.
Contains all necessary information for Ray task/actor deployment.
- Return type
Example
>>> config = CpuAcceleratorConfig(core_count=4) >>> resources = config.to_ray_resources() >>> print(resources.num_cpus)
- with_environment_variables(env_vars: dict[str, str] | None = None, /, **kwargs) ComputeResourceConfig[source]#
Create a new resource configuration with additional environment variables.
This method allows for adding or overriding environment variables without modifying other aspects of the resource configuration. It creates a new configuration instance with updated environment variables, preserving immutability of the original configuration.
- Parameters
env_vars (dict[str, str] | None) – Dictionary of environment variables to add or override. If None, only kwargs are used.
**kwargs – Additional environment variables as keyword arguments. These are merged with env_vars if provided.
- Returns
- A new ComputeResourceConfig instance with the
combined environment variables from existing config, env_vars, and kwargs.
- Return type
Example
>>> config = CpuAcceleratorConfig() >>> new_config = config.with_environment_variables( ... {'CUDA_VISIBLE_DEVICES': '0,1'}, ... OMP_NUM_THREADS='4' ... )
- class eformer.executor.ray.resource_manager.CpuAcceleratorConfig(core_count: int = <factory>, execution_env: ~ray.runtime_env.runtime_env.RuntimeEnv = <factory>, resource_name: str = 'CPU', runtime_name: str = <factory>, worker_count: int = 1)[source]#
Bases:
ComputeResourceConfigResource configuration for CPU-only workloads.
This configuration is designed for computational tasks that rely solely on CPU processing power. It’s suitable for local development, batch processing, data preprocessing, or any tasks that don’t require specialized hardware acceleration like GPUs or TPUs.
The configuration automatically detects available CPU cores and provides sensible defaults for CPU-bound workloads. It can be used for distributed CPU computing across multiple nodes in a Ray cluster.
- core_count#
Number of CPU cores to allocate. Defaults to all available cores.
- Type
int
- execution_env#
Ray runtime environment for dependencies and setup.
- Type
RuntimeEnv
- resource_name#
Name identifier for the resource type (default: “CPU”).
- Type
str
- runtime_name#
Unique runtime identifier for this configuration.
- Type
str
- worker_count#
Number of worker processes to spawn.
- Type
int
Example
>>> config = CpuAcceleratorConfig(core_count=4, worker_count=2) >>> @config.create_remote_decorator() ... def cpu_intensive_task(data): ... return process_data_on_cpu(data) >>> result = ray.get(cpu_intensive_task.remote(my_data))
- core_count: int#
- execution_env: RuntimeEnv#
- get_remote_options() dict[str, Any][source]#
Get Ray remote options for CPU-only execution.
This method returns the Ray remote options specifically configured for CPU-only tasks, including the CPU core count and runtime environment. GPU allocation is explicitly set to 0.
- Returns
- Dictionary of options for ray.remote() containing:
num_cpus: Number of CPU cores to allocate
runtime_env: Runtime environment configuration
- Return type
dict[str, Any]
Example
>>> config = CpuAcceleratorConfig(core_count=2) >>> options = config.get_remote_options() >>> print(options) {'num_cpus': 2, 'runtime_env': {}}
- hardware_identifier() str | None[source]#
Get the hardware identifier (none for CPU-only configuration).
For CPU-only configurations, there is no specialized hardware accelerator, so this method always returns None to indicate that any available CPU resources can be used.
- Returns
Always None since no specialized hardware accelerator is used.
- Return type
None
Example
>>> config = CpuAcceleratorConfig() >>> print(config.hardware_identifier())
- redecorate_remote_fn_for_call(remote_fn: Union[RemoteFunction, Callable], **extra_envs)[source]#
Prepare a remote function for CPU execution with merged runtime environment.
Wraps a remote function with CPU-specific resource requirements and runtime environment configuration. The function is forkified to run in a separate process for isolation.
- Parameters
remote_fn – The remote function or callable to configure.
**extra_envs – Additional environment variables to merge into the runtime environment.
- Returns
- Configured remote function with CPU resources and
merged runtime environment.
- Return type
RemoteFunction
- resource_name: str = 'CPU'#
- runtime_name: str#
- to_ray_resources() RayResources[source]#
Convert to Ray resource specifications for CPU-only allocation.
This method creates a RayResources object that represents CPU-only resource allocation with no GPU or specialized accelerator requirements.
- Returns
- RayResources object representing CPU-only allocation
with the specified core count, zero GPUs, and runtime environment.
- Return type
Example
>>> config = CpuAcceleratorConfig(core_count=8) >>> resources = config.to_ray_resources() >>> print(f"CPUs: {resources.num_cpus}, GPUs: {resources.num_gpus}") CPUs: 8, GPUs: 0
- worker_count: int = 1#
- class eformer.executor.ray.resource_manager.GpuAcceleratorConfig(device_count: int = 1, execution_env: ~ray.runtime_env.runtime_env.RuntimeEnv = <factory>, gpu_model: str | None = None, cpu_count: int = 1, chips_per_host: int = <factory>, runtime_name: str = <factory>, worker_count: int = 1, resource_name: str = 'GPU')[source]#
Bases:
ComputeResourceConfigResource configuration for GPU-accelerated workloads.
This configuration specifies GPU requirements for computationally intensive tasks such as neural network training, inference, and other parallel computing workloads that benefit from GPU acceleration. It supports both generic GPU allocation and specific GPU model requirements.
The configuration automatically detects available GPU resources on the current node and provides flexible options for multi-GPU setups. It’s designed to work with NVIDIA GPUs through Ray’s GPU resource management system.
- device_count#
Number of GPU devices to allocate per task/actor.
- Type
int
- execution_env#
Ray runtime environment for CUDA/GPU dependencies.
- Type
RuntimeEnv
- gpu_model#
Specific GPU model identifier (e.g., “A100”, “V100”).
- Type
str | None
- cpu_count#
Number of CPU cores to allocate alongside GPUs.
- Type
int
- chips_per_host#
Number of GPU devices available per host node.
- Type
int
- runtime_name#
Unique runtime identifier for this configuration.
- Type
str
- worker_count#
Number of worker processes to spawn.
- Type
int
- resource_name#
Name identifier for the resource type.
- Type
str
Example
>>> config = GpuAcceleratorConfig( ... device_count=2, ... gpu_model=HardwareType.NVIDIA_A100, ... cpu_count=8 ... ) >>> @config.create_remote_decorator() ... def train_model(data): ... return gpu_training_function(data) >>> result = ray.get(train_model.remote(training_data))
- chips_per_host: int#
- cpu_count: int = 1#
- device_count: int = 1#
- execution_env: RuntimeEnv#
- get_remote_options() dict[str, Any][source]#
Get Ray remote options for GPU-accelerated execution.
This method constructs the Ray remote options dictionary for GPU-accelerated tasks, including CPU and GPU allocation, runtime environment, and optionally specific accelerator type requirements.
- Returns
- Dictionary of options for ray.remote() containing:
num_cpus: Number of CPU cores to allocate
num_gpus: Number of GPU devices to allocate
runtime_env: Runtime environment configuration
accelerator_type: Specific GPU model (if specified)
- Return type
dict[str, Any]
Example
>>> config = GpuAcceleratorConfig(device_count=1, cpu_count=4, gpu_model="A100") >>> options = config.get_remote_options() >>> print(options) {'num_cpus': 4, 'num_gpus': 1, 'runtime_env': {}, 'accelerator_type': 'A100'}
- gpu_model: str | None = None#
- hardware_identifier() str | None[source]#
Get the hardware identifier for the GPU model.
This method returns the specific GPU model identifier if one was specified during configuration. This is used by Ray’s accelerator management system to ensure tasks are scheduled on nodes with the required GPU hardware.
- Returns
- String identifier for the GPU model (e.g., “A100”, “V100”)
or None if any available GPU is acceptable.
- Return type
str | None
Example
>>> config = GpuAcceleratorConfig(gpu_model="A100") >>> print(config.hardware_identifier()) >>> generic_config = GpuAcceleratorConfig() >>> print(generic_config.hardware_identifier())
- redecorate_remote_fn_for_call(remote_fn: Union[RemoteFunction, Callable], **extra_envs)[source]#
Prepare a remote function for GPU execution with merged runtime environment.
Wraps a remote function with GPU-specific resource requirements and runtime environment configuration. The function is forkified to run in a separate process for isolation.
- Parameters
remote_fn – The remote function or callable to configure.
**extra_envs – Additional environment variables to merge into the runtime environment.
- Returns
- Configured remote function with GPU resources,
optional accelerator type, and merged runtime environment.
- Return type
RemoteFunction
- resource_name: str = 'GPU'#
- runtime_name: str#
- to_ray_resources() RayResources[source]#
Convert to Ray resource specifications for GPU allocation.
This method creates a RayResources object that represents GPU resource allocation with the specified number of GPUs, CPUs, accelerator type, and runtime environment configuration.
- Returns
- RayResources object representing GPU resource allocation
with specified device count, CPU count, and optional accelerator type.
- Return type
Example
>>> config = GpuAcceleratorConfig(device_count=2, cpu_count=8) >>> resources = config.to_ray_resources() >>> print(f"CPUs: {resources.num_cpus}, GPUs: {resources.num_gpus}") CPUs: 8, GPUs: 2
- worker_count: int = 1#
- class eformer.executor.ray.resource_manager.HardwareType[source]#
Bases:
objectConstants representing known accelerator and hardware types.
This class provides standardized identifiers for various hardware accelerators and compute devices that can be requested in Ray resource configurations. The constants ensure consistent naming across the application and provide a centralized reference for supported hardware types.
The identifiers correspond to actual hardware accelerator names and models that may be available in cloud platforms, data centers, or local systems. They are used in resource configurations to specify hardware requirements for compute-intensive tasks.
- Categories:
NVIDIA GPUs: Tesla series, A-series, H-series (V100, A100, H100, etc.)
Intel: GPU Max series and Gaudi accelerators
AMD: Instinct series and Radeon GPUs
Google TPUs: Various TPU versions (V2, V3, V4, V5, V6)
AWS: Neuron cores for machine learning
Huawei: NPU accelerators (Ascend series)
Example
>>> config = GpuAcceleratorConfig( ... gpu_model=HardwareType.NVIDIA_A100, ... device_count=2 ... ) >>> tpu_config = TpuAcceleratorConfig( ... tpu_version=HardwareType.GOOGLE_TPU_V4 ... )
- AMD_INSTINCT_MI100 = 'AMD-Instinct-MI100'#
- AMD_INSTINCT_MI210 = 'AMD-Instinct-MI210'#
- AMD_INSTINCT_MI250 = 'AMD-Instinct-MI250X-MI250'#
- AMD_INSTINCT_MI250x = 'AMD-Instinct-MI250X'#
- AMD_INSTINCT_MI300x = 'AMD-Instinct-MI300X-OAM'#
- AMD_RADEON_HD_7900 = 'AMD-Radeon-HD-7900'#
- AMD_RADEON_R9_200_HD_7900 = 'AMD-Radeon-R9-200-HD-7900'#
- AWS_NEURON_CORE = 'aws-neuron-core'#
- GOOGLE_TPU_V2 = 'TPU-V2'#
- GOOGLE_TPU_V3 = 'TPU-V3'#
- GOOGLE_TPU_V4 = 'TPU-V4'#
- GOOGLE_TPU_V5LITEPOD = 'TPU-V5LITEPOD'#
- GOOGLE_TPU_V5P = 'TPU-V5P'#
- GOOGLE_TPU_V6E = 'TPU-V6E'#
- HUAWEI_NPU_910B = 'Ascend910B'#
- HUAWEI_NPU_910B4 = 'Ascend910B4'#
- INTEL_GAUDI = 'Intel-GAUDI'#
- INTEL_MAX_1100 = 'Intel-GPU-Max-1100'#
- INTEL_MAX_1550 = 'Intel-GPU-Max-1550'#
- NVIDIA_A100 = 'A100'#
- NVIDIA_A100_40G = 'A100-40G'#
- NVIDIA_A100_80G = 'A100-80G'#
- NVIDIA_H100 = 'H100'#
- NVIDIA_H20 = 'H20'#
- NVIDIA_H200 = 'H200'#
- NVIDIA_L4 = 'L4'#
- NVIDIA_L40S = 'L40S'#
- NVIDIA_TESLA_A10G = 'A10G'#
- NVIDIA_TESLA_K80 = 'K80'#
- NVIDIA_TESLA_P100 = 'P100'#
- NVIDIA_TESLA_P4 = 'P4'#
- NVIDIA_TESLA_T4 = 'T4'#
- NVIDIA_TESLA_V100 = 'V100'#
- class eformer.executor.ray.resource_manager.RayResources(num_cpus: int = 1, num_gpus: int = 0, resources: dict[str, float] = <factory>, runtime_env: ~ray.runtime_env.runtime_env.RuntimeEnv = <factory>, accelerator_type: str | None = None)[source]#
Bases:
objectA representation of resource requirements for Ray tasks and actors.
This dataclass encapsulates all resource specifications needed when creating Ray tasks or actors, allowing for easy conversion between different resource representation formats used by Ray. It provides methods for converting between Ray’s internal resource representation and user-friendly specifications.
- num_cpus#
Number of CPU cores to allocate for the task/actor.
- Type
int
- num_gpus#
Number of GPU devices to allocate for the task/actor.
- Type
int
- resources#
Custom resource requirements as name-value pairs.
- Type
dict[str, float]
- runtime_env#
Ray runtime environment configuration for dependencies.
- Type
ray.runtime_env.runtime_env.RuntimeEnv
- accelerator_type#
Specific accelerator type identifier (e.g., “A100”).
- Type
str | None
Example
>>> resources = RayResources( ... num_cpus=4, ... num_gpus=2, ... accelerator_type="A100" ... ) >>> kwargs = resources.to_kwargs() >>> @ray.remote(**kwargs) ... def my_task(): ... return "Hello from Ray!"
- accelerator_type: str | None = None#
- static cancel_all_futures(futures)[source]#
Cancel all Ray futures in the provided collection.
This utility method attempts to cancel all Ray futures/ObjectRefs in the given iterable, providing error handling for individual cancellation failures. Useful for cleanup operations when a batch of tasks needs to be terminated.
- Parameters
futures (Iterable[ray.ObjectRef]) – Collection of Ray futures to cancel.
Note
Individual cancellation failures are logged but do not stop the cancellation of remaining futures.
Example
>>> futures = [my_remote_fn.remote(i) for i in range(10)] >>> >>> RayResources.cancel_all_futures(futures)
- static forkify_remote_fn(remote_fn: ray.remote_function.RemoteFunction | collections.abc.Callable)[source]#
Wrap a remote function to execute in a separate process.
This method transforms a Ray remote function or callable to execute in an isolated subprocess, providing additional process isolation and error handling capabilities. Useful for functions that may cause memory leaks or require process-level isolation.
- Parameters
remote_fn (RemoteFunction | Callable) – The remote function or callable to be wrapped with process isolation.
- Returns
- The wrapped function that will
execute in a separate process.
- Return type
RemoteFunction | functools.partial
Example
>>> @ray.remote ... def my_function(x): ... return x * 2 >>> forked_fn = RayResources.forkify_remote_fn(my_function) >>> result = ray.get(forked_fn.remote(5))
- static from_resource_dict(resource_spec: dict[str, float]) RayResources[source]#
Create a RayResources instance from a resource dictionary.
This factory method reconstructs a RayResources object from a flattened resource specification dictionary, handling the reverse transformation of to_resource_dict().
- Parameters
resource_spec (dict[str, float]) – Dictionary mapping resource names to quantities. Expected keys include “CPU”, “GPU”, custom resources, and optionally “accelerator_type:<type>” for accelerator specifications.
- Returns
A new RayResources instance representing the specified resources.
- Return type
Example
>>> resource_dict = {'CPU': 4, 'GPU': 2, 'accelerator_type:A100': 0.001} >>> resources = RayResources.from_resource_dict(resource_dict) >>> print(f"CPUs: {resources.num_cpus}, GPUs: {resources.num_gpus}") CPUs: 4, GPUs: 2
- num_cpus: int = 1#
- num_gpus: int = 0#
- resources: dict[str, float]#
- runtime_env: RuntimeEnv#
- static separate_process_fn(underlying_function, args, kwargs)[source]#
Execute a function in a separate subprocess with error handling.
This method runs the specified function in an isolated subprocess, capturing results or exceptions and handling process lifecycle management. It provides robust error handling and timeout protection.
- Parameters
underlying_function (Callable) – The function to execute in subprocess.
args (tuple) – Positional arguments to pass to the function.
kwargs (dict) – Keyword arguments to pass to the function.
- Returns
The return value from the function execution.
- Return type
Any
- Raises
RuntimeError – If the subprocess times out.
ValueError – If the subprocess execution fails with an exception.
Example
>>> def add(x, y): ... return x + y >>> result = RayResources.separate_process_fn(add, (2, 3), {}) >>> print(result)
- to_kwargs() dict[str, Any][source]#
Convert resource specifications to kwargs for ray.remote() decorator.
This method transforms the resource specifications into a format directly compatible with Ray’s remote decorator, handling all the necessary parameter mapping and filtering.
- Returns
- Dictionary of keyword arguments compatible with ray.remote().
Includes num_cpus, num_gpus, resources, runtime_env, and optionally accelerator_type if specified.
- Return type
dict[str, Any]
Example
>>> resources = RayResources(num_cpus=2, num_gpus=1) >>> kwargs = resources.to_kwargs() >>> print(kwargs) {'num_cpus': 2, 'num_gpus': 1, 'resources': {}, 'runtime_env': {}}
- to_resource_dict() dict[str, float][source]#
Convert resource specifications to a dictionary format for resource reporting.
This method creates a flattened view of all resource requirements, suitable for monitoring, logging, and resource visualization tools. It standardizes resource names and handles accelerator type encoding.
Note
This is primarily for resource visualization and reporting, not for direct use with ray.remote(). For ray.remote(), use to_kwargs() instead.
- Returns
- Dictionary mapping resource names to quantities.
Standard keys include “CPU”, “GPU”, and any custom resources. Accelerator types are encoded as “accelerator_type:<type>”.
- Return type
dict[str, float]
Example
>>> resources = RayResources(num_cpus=4, num_gpus=2, accelerator_type="A100") >>> resource_dict = resources.to_resource_dict() >>> print(resource_dict) {'CPU': 4, 'GPU': 2, 'accelerator_type:A100': 0.001}
- static update_fn_resource_env(remote_fn: Union[RemoteFunction, Callable], runtime_env: dict[str, str] | dict[str, dict[str, str]], **extra_env)[source]#
Merge runtime environment configurations for a remote function.
This method combines multiple sources of runtime environment configuration, including the function’s existing environment, provided runtime_env, and additional environment variables. Uses deep merging to handle nested configurations properly.
- Parameters
remote_fn (RemoteFunction | tp.Callable) – The remote function whose runtime environment will be updated.
runtime_env (dict[str, str] | dict[str, dict[str, str]]) – Runtime environment configuration to merge.
**extra_env – Additional environment variables as keyword arguments.
- Returns
Merged runtime environment configuration.
- Return type
dict
Example
>>> @ray.remote ... def my_fn(): ... return os.getenv('MY_VAR') >>> new_env = RayResources.update_fn_resource_env( ... my_fn, ... {'env_vars': {'MY_VAR': 'value1'}}, ... MY_OTHER_VAR='value2' ... )
- class eformer.executor.ray.resource_manager.TpuAcceleratorConfig(tpu_version: str, pod_count: int = 1, execution_env: ~ray.runtime_env.runtime_env.RuntimeEnv = <factory>, cpu_count: int = 2, chips_per_host: int = <factory>, worker_count: int = <factory>, runtime_name: str = <factory>, resource_name: str = 'TPU')[source]#
Bases:
ComputeResourceConfigResource configuration for TPU-accelerated workloads.
This configuration is designed for large-scale machine learning tasks using Google’s Tensor Processing Units (TPUs). TPUs are particularly well-suited for transformer models, large neural networks, and other matrix-heavy computations that benefit from TPU’s specialized architecture.
The configuration handles TPU pod management, resource allocation, and integration with Ray’s distributed computing framework. It supports various TPU versions and pod configurations for different computational requirements.
- tpu_version#
TPU version identifier (e.g., “TPU-V4”, “TPU-V5P”).
- Type
str
- pod_count#
Number of TPU pods to allocate for the task.
- Type
int
- execution_env#
Ray runtime environment for TPU dependencies.
- Type
RuntimeEnv
- cpu_count#
Number of CPU cores to allocate alongside TPUs.
- Type
int
- chips_per_host#
Number of TPU chips available per host.
- Type
int
- worker_count#
Number of worker processes for distributed TPU training.
- Type
int
- runtime_name#
TPU pod name identifier.
- Type
str
- resource_name#
Resource type identifier for TPU resources.
- Type
str
Example
>>> config = TpuAcceleratorConfig( ... tpu_version=HardwareType.GOOGLE_TPU_V4, ... pod_count=1, ... cpu_count=4 ... ) >>> @config.create_remote_decorator() ... def train_transformer(model_config): ... return tpu_training_loop(model_config) >>> result = ray.get(train_transformer.remote(config))
- chips_per_host: int#
- cpu_count: int = 2#
- execution_env: RuntimeEnv#
- get_remote_options() dict[str, Any][source]#
Get Ray remote options for TPU-accelerated execution.
This method constructs the Ray remote options dictionary for TPU-accelerated tasks. TPU resources are specified using Ray’s custom resource system, with the TPU version as the resource name and pod count as the quantity.
- Returns
- Dictionary of options for ray.remote() containing:
num_cpus: Number of CPU cores to allocate
resources: Custom resource specification for TPU (version: count)
runtime_env: Runtime environment configuration
- Return type
dict[str, Any]
Example
>>> config = TpuAcceleratorConfig(tpu_version="TPU-V4", pod_count=1, cpu_count=2) >>> options = config.get_remote_options() >>> print(options) {'num_cpus': 2, 'resources': {'TPU-V4': 1}, 'runtime_env': {}}
- hardware_identifier() str[source]#
Get the hardware identifier for the TPU configuration.
This method returns the TPU version identifier that specifies the exact TPU hardware generation and capabilities required for the computation. This ensures tasks are scheduled on appropriate TPU resources.
- Returns
- String identifier for the TPU version (e.g., “TPU-V4”, “TPU-V5P”).
This corresponds to the tpu_version attribute.
- Return type
str
Example
>>> config = TpuAcceleratorConfig(tpu_version="TPU-V4") >>> print(config.hardware_identifier())
- pod_count: int = 1#
- redecorate_remote_fn_for_call(remote_fn: Union[RemoteFunction, Callable], **extra_envs)[source]#
Redecorate a remote function with TPU-specific resource requirements.
This method applies TPU-specific configuration to a remote function, including process isolation (forkification), TPU pod resource allocation, and runtime environment updates. It’s specifically designed for TPU workloads that require special handling.
- Parameters
remote_fn (RemoteFunction | tp.Callable) – The remote function or callable to be configured for TPU execution.
**extra_envs – Additional environment variables to merge into the runtime environment.
- Returns
- A reconfigured remote function with TPU resource
requirements and updated runtime environment.
- Return type
RemoteFunction
Example
>>> config = TpuAcceleratorConfig(tpu_version="TPU-V4") >>> def my_tpu_function(): ... return "TPU computation" >>> tpu_fn = config.redecorate_remote_fn_for_call( ... my_tpu_function, ... JAX_PLATFORMS='tpu' ... ) >>> result = ray.get(tpu_fn.remote())
- resource_name: str = 'TPU'#
- runtime_name: str#
- to_ray_resources() RayResources[source]#
Convert to Ray resource specifications for TPU resources.
This method creates a RayResources object that represents TPU resource allocation using Ray’s custom resource system. TPU resources are specified as custom resources with the TPU version as the key.
- Returns
- RayResources object representing TPU resource allocation
with specified CPU count, TPU version and pod count as custom resources, and runtime environment.
- Return type
Example
>>> config = TpuAcceleratorConfig(tpu_version="TPU-V4", pod_count=2) >>> resources = config.to_ray_resources() >>> print(resources.resources) {'TPU-V4': 2.0}
- tpu_version: str#
- worker_count: int#
- eformer.executor.ray.resource_manager.available_cpu_cores() int[source]#
Determine the number of logical CPU cores available on the current system.
This function checks for SLURM environment variables first (common in HPC clusters), then falls back to the system’s reported CPU count. It provides a reliable way to determine available compute capacity across different deployment environments.
- Returns
- Number of available logical CPU cores. Returns 1 as fallback
if the system doesn’t support CPU count detection.
- Return type
int
Example
>>> cores = available_cpu_cores() >>> print(f"Available CPU cores: {cores}") Available CPU cores: 8