eformer.optimizers._builders

eformer.optimizers._builders#

class eformer.optimizers._builders.AdafactorOptimizer(config: AdafactorConfig)[source]#

Bases: OptimizerBuilder

Builder for Adafactor optimizer.

Adafactor is a memory-efficient adaptive learning rate optimizer designed for training large models. It uses factored second-moment estimation to reduce memory usage while maintaining adaptive learning rate capabilities.

This optimizer is particularly useful for training large language models where memory constraints are significant.

config#

Configuration object containing Adafactor hyperparameters including factorization settings, decay rates, and clipping thresholds.

Type: AdafactorConfig

Example

>>> from eformer.optimizers import AdafactorConfig
>>> import optax
>>> config = AdafactorConfig(decay_rate=0.8, factored=True)
>>> builder = AdafactorOptimizer(config=config)
>>> scheduler = optax.constant_schedule(1e-4)
>>> optimizer = builder.build(scheduler)

build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) → GradientTransformation[source]#

Build the Adafactor optimizer transformation.

Parameters

scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.

Returns

The Adafactor optimizer transformation: configured with factored second-moment estimation for memory efficiency.

Return type

optax.GradientTransformation

config: AdafactorConfig#

class eformer.optimizers._builders.AdamWOptimizer(config: AdamWConfig)[source]#

Bases: OptimizerBuilder

Builder for AdamW optimizer.

AdamW is a variant of Adam that decouples weight decay from the gradient update, which often leads to better generalization. It is one of the most widely used optimizers for training transformers and other deep learning models.

config#

Configuration object containing AdamW hyperparameters including momentum coefficients (b1, b2), epsilon values, and data type.

Type: AdamWConfig

Example

>>> from eformer.optimizers import AdamWConfig
>>> import optax
>>> config = AdamWConfig(b1=0.9, b2=0.999, eps=1e-8)
>>> builder = AdamWOptimizer(config=config)
>>> scheduler = optax.constant_schedule(1e-4)
>>> optimizer = builder.build(scheduler)

build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) → GradientTransformation[source]#

Build the AdamW optimizer transformation.

Parameters

scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.

Returns

The AdamW optimizer transformation that can: be used with optax.apply_updates to update model parameters.

Return type

optax.GradientTransformation

config: AdamWConfig#

class eformer.optimizers._builders.ConstantSchedulerBuilder(config: SchedulerConfig)[source]#

Bases: SchedulerBuilder

Builder for constant learning rate schedule.

This builder creates a scheduler that maintains a fixed learning rate throughout training.

config#

Configuration object containing the learning rate.

Type: SchedulerConfig

Example

>>> from eformer.optimizers import SchedulerConfig
>>> config = SchedulerConfig(learning_rate=1e-4)
>>> builder = ConstantSchedulerBuilder(config=config)
>>> scheduler = builder.build()
>>> scheduler(0)  # Returns 1e-4

build() → Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]][source]#

Build a constant learning rate schedule.

Returns

A schedule function that returns the configured: learning rate regardless of the step count.

Return type

optax.Schedule

config: SchedulerConfig#

class eformer.optimizers._builders.CosineSchedulerBuilder(config: SchedulerConfig)[source]#

Bases: SchedulerBuilder

Builder for cosine learning rate schedule with optional warmup.

This builder creates a scheduler that decays the learning rate following a cosine curve. This is a popular choice for training neural networks as it provides smooth decay with a “warm restart” capability.

config#

Configuration object containing learning rate parameters, steps, warmup settings, and cosine decay exponent.

Type: SchedulerConfig

Example

>>> from eformer.optimizers import SchedulerConfig
>>> config = SchedulerConfig(
...     scheduler_type="cosine",
...     learning_rate=1e-4,
...     learning_rate_end=1e-6,
...     steps=10000,
...     warmup_steps=1000,
...     exponent=1.0,
... )
>>> builder = CosineSchedulerBuilder(config=config)
>>> scheduler = builder.build()

build() → Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]][source]#

Build a cosine learning rate schedule with optional warmup.

Creates a cosine decay schedule that smoothly decreases the learning rate from the peak value to the end value. If warmup_steps is specified, includes a linear warmup phase from a very small value (1e-8) to the peak learning rate before the cosine decay begins.

Returns

A schedule function that returns the learning rate: for a given step count, following a cosine decay pattern.

Return type

optax.Schedule

config: SchedulerConfig#

class eformer.optimizers._builders.LinearSchedulerBuilder(config: SchedulerConfig)[source]#

Bases: SchedulerBuilder

Builder for linear learning rate schedule with optional warmup.

This builder creates a scheduler that linearly decays the learning rate from an initial value to an end value over a specified number of steps. Optionally, a warmup phase can be added at the beginning of training.

config#

Configuration object containing learning rate parameters, steps, and optional warmup settings.

Type: SchedulerConfig

Example

>>> from eformer.optimizers import SchedulerConfig
>>> config = SchedulerConfig(
...     scheduler_type="linear",
...     learning_rate=1e-4,
...     learning_rate_end=1e-6,
...     steps=10000,
...     warmup_steps=1000,
... )
>>> builder = LinearSchedulerBuilder(config=config)
>>> scheduler = builder.build()

build() → Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]][source]#

Build a linear learning rate schedule with optional warmup.

Creates a linear decay schedule from learning_rate to learning_rate_end. If warmup_steps is specified, prepends a linear warmup phase from a very small value (1e-8) to the initial learning rate.

Returns

A schedule function that returns the learning rate: for a given step count.

Return type

optax.Schedule

Raises

ValueError – If learning_rate_end is not specified in the config.

config: SchedulerConfig#

class eformer.optimizers._builders.LionOptimizer(config: LionConfig)[source]#

Bases: OptimizerBuilder

Builder for Lion (Evolved Sign Momentum) optimizer.

Lion is an optimizer discovered through neural architecture search that uses sign-based updates with momentum. It often achieves better generalization than Adam with fewer hyperparameters to tune.

Reference: https://arxiv.org/abs/2302.06675

config#

Configuration object containing Lion hyperparameters including momentum coefficients (b1, b2) and data type for momentum.

Type: LionConfig

Example

>>> from eformer.optimizers import LionConfig
>>> import optax
>>> config = LionConfig(b1=0.9, b2=0.99)
>>> builder = LionOptimizer(config=config)
>>> scheduler = optax.constant_schedule(1e-4)
>>> optimizer = builder.build(scheduler)

build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) → GradientTransformation[source]#

Build the Lion optimizer transformation.

Parameters

scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.

Returns

The Lion optimizer transformation that uses: sign-based updates with momentum for efficient parameter updates.

Return type

optax.GradientTransformation

config: LionConfig#

class eformer.optimizers._builders.MarsOptimizer(config: MarsConfig)[source]#

Bases: OptimizerBuilder

Builder for Mars (Matrix-wise Adaptive Regularized Scaling) optimizer.

Mars improves upon Adam by using a variance reduction technique with gradient momentum from the previous step. This can lead to improved convergence and better generalization, particularly for training large language models.

Reference: https://arxiv.org/abs/2411.10438

config#

Configuration object containing Mars hyperparameters including beta coefficients, gamma for gradient momentum, epsilon for numerical stability, and gradient clipping threshold.

Type: MarsConfig

Example

>>> from eformer.optimizers import MarsConfig
>>> import optax
>>> config = MarsConfig(beta1=0.95, beta2=0.99, gamma=0.025)
>>> builder = MarsOptimizer(config=config)
>>> scheduler = optax.constant_schedule(1e-4)
>>> optimizer = builder.build(scheduler)

build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) → GradientTransformation[source]#

Build the Mars optimizer transformation.

Parameters

scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.

Returns

The Mars optimizer transformation that uses: variance reduction with gradient momentum for improved convergence.

Return type

optax.GradientTransformation

config: MarsConfig#

class eformer.optimizers._builders.MuonOptimizer(config: MuonConfig)[source]#

Bases: OptimizerBuilder

Builder for Muon (Momentum Orthogonalized by Newton-schulz) optimizer.

Muon is designed specifically for 2D parameters (matrices) and uses the Newton-Schulz method to orthogonalize momentum. Non-2D parameters are processed through an Adam optimizer fallback. This makes it particularly effective for training models with large matrix parameters.

The optimizer maintains orthogonality of the momentum, which can lead to more stable training and better convergence for certain architectures.

config#

Configuration object containing Muon hyperparameters including Newton-Schulz coefficients, number of steps, momentum parameters, and Adam fallback settings.

Type: MuonConfig

Example

>>> from eformer.optimizers import MuonConfig
>>> import optax
>>> config = MuonConfig(ns_steps=5, beta=0.95, nesterov=True)
>>> builder = MuonOptimizer(config=config)
>>> scheduler = optax.constant_schedule(1e-4)
>>> optimizer = builder.build(scheduler)

build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) → GradientTransformation[source]#

Build the Muon optimizer transformation.

Parameters

scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.

Returns

The Muon optimizer transformation that uses: Newton-Schulz orthogonalization for 2D parameters and Adam for others.

Return type

optax.GradientTransformation

config: MuonConfig#

class eformer.optimizers._builders.QuadOptimizer(config: WhiteKronConfig)[source]#

Bases: OptimizerBuilder

Builder for Quad (White Kron with QUAD update) optimizer.

Quad is a Kronecker-factored preconditioned optimizer that uses the QUAD preconditioner update style. It provides efficient second-order optimization by approximating the inverse Fisher information matrix using Kronecker products.

This optimizer is particularly effective for training deep neural networks, especially transformers, where second-order information can significantly improve convergence.

config#

Configuration object containing Quad optimizer hyperparameters including preconditioner settings, block size, sharding configurations, and numerical stability parameters.

Type: WhiteKronConfig

Example

>>> from eformer.optimizers import WhiteKronConfig
>>> import optax
>>> config = WhiteKronConfig(b1=0.95, preconditioner_lr=0.7)
>>> builder = QuadOptimizer(config=config)
>>> scheduler = optax.constant_schedule(1e-4)
>>> optimizer = builder.build(scheduler)

build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) → GradientTransformation[source]#

Build the Quad optimizer transformation.

Parameters

scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.

Returns

The Quad optimizer transformation using: QUAD-style Kronecker-factored preconditioning for efficient second-order optimization.

Return type

optax.GradientTransformation

config: WhiteKronConfig#

class eformer.optimizers._builders.RMSPropOptimizer(config: RMSPropConfig)[source]#

Bases: OptimizerBuilder

Builder for RMSProp (Root Mean Square Propagation) optimizer.

RMSProp is an adaptive learning rate optimizer that divides the gradient by a running average of the magnitude of recent gradients. It is effective for training recurrent neural networks and other models with non-stationary objectives.

config#

Configuration object containing RMSProp hyperparameters including decay rate, epsilon, momentum, and Nesterov momentum settings.

Type: RMSPropConfig

Example

>>> from eformer.optimizers import RMSPropConfig
>>> import optax
>>> config = RMSPropConfig(decay=0.9, eps=1e-8, momentum=0.9)
>>> builder = RMSPropOptimizer(config=config)
>>> scheduler = optax.constant_schedule(1e-4)
>>> optimizer = builder.build(scheduler)

build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) → GradientTransformation[source]#

Build the RMSProp optimizer transformation.

Parameters

scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.

Returns

The RMSProp optimizer transformation that: adapts the learning rate based on a moving average of squared gradients.

Return type

optax.GradientTransformation

config: RMSPropConfig#

class eformer.optimizers._builders.SkewOptimizer(config: WhiteKronConfig)[source]#

Bases: OptimizerBuilder

Builder for Skew (White Kron with skew update) optimizer.

Skew is a Kronecker-factored preconditioned optimizer that uses the skew preconditioner update style. It provides efficient second-order optimization with a different update rule compared to the QUAD variant.

The skew update uses a Procrustes step to maintain orthogonality of the preconditioner, which can lead to more stable training in certain scenarios.

config#

Configuration object containing Skew optimizer hyperparameters including preconditioner settings, block size, sharding configurations, and numerical stability parameters.

Type: WhiteKronConfig

Example

>>> from eformer.optimizers import WhiteKronConfig
>>> import optax
>>> config = WhiteKronConfig(b1=0.95, preconditioner_lr=0.7)
>>> builder = SkewOptimizer(config=config)
>>> scheduler = optax.constant_schedule(1e-4)
>>> optimizer = builder.build(scheduler)

build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) → GradientTransformation[source]#

Build the Skew optimizer transformation.

Parameters

scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.

Returns

The Skew optimizer transformation using: skew-style Kronecker-factored preconditioning with Procrustes orthogonalization for efficient second-order optimization.

Return type

optax.GradientTransformation

config: WhiteKronConfig#

eformer.optimizers._builders

Contents

eformer.optimizers._builders#