eformer.optimizers._builders#
- class eformer.optimizers._builders.AdafactorOptimizer(config: AdafactorConfig)[source]#
Bases:
OptimizerBuilderBuilder for Adafactor optimizer.
Adafactor is a memory-efficient adaptive learning rate optimizer designed for training large models. It uses factored second-moment estimation to reduce memory usage while maintaining adaptive learning rate capabilities.
This optimizer is particularly useful for training large language models where memory constraints are significant.
- config#
Configuration object containing Adafactor hyperparameters including factorization settings, decay rates, and clipping thresholds.
- Type
Example
>>> from eformer.optimizers import AdafactorConfig >>> import optax >>> config = AdafactorConfig(decay_rate=0.8, factored=True) >>> builder = AdafactorOptimizer(config=config) >>> scheduler = optax.constant_schedule(1e-4) >>> optimizer = builder.build(scheduler)
- build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) GradientTransformation[source]#
Build the Adafactor optimizer transformation.
- Parameters
scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.
- Returns
- The Adafactor optimizer transformation
configured with factored second-moment estimation for memory efficiency.
- Return type
optax.GradientTransformation
- config: AdafactorConfig#
- class eformer.optimizers._builders.AdamWOptimizer(config: AdamWConfig)[source]#
Bases:
OptimizerBuilderBuilder for AdamW optimizer.
AdamW is a variant of Adam that decouples weight decay from the gradient update, which often leads to better generalization. It is one of the most widely used optimizers for training transformers and other deep learning models.
- config#
Configuration object containing AdamW hyperparameters including momentum coefficients (b1, b2), epsilon values, and data type.
- Type
Example
>>> from eformer.optimizers import AdamWConfig >>> import optax >>> config = AdamWConfig(b1=0.9, b2=0.999, eps=1e-8) >>> builder = AdamWOptimizer(config=config) >>> scheduler = optax.constant_schedule(1e-4) >>> optimizer = builder.build(scheduler)
- build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) GradientTransformation[source]#
Build the AdamW optimizer transformation.
- Parameters
scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.
- Returns
- The AdamW optimizer transformation that can
be used with optax.apply_updates to update model parameters.
- Return type
optax.GradientTransformation
- config: AdamWConfig#
- class eformer.optimizers._builders.ConstantSchedulerBuilder(config: SchedulerConfig)[source]#
Bases:
SchedulerBuilderBuilder for constant learning rate schedule.
This builder creates a scheduler that maintains a fixed learning rate throughout training.
- config#
Configuration object containing the learning rate.
- Type
Example
>>> from eformer.optimizers import SchedulerConfig >>> config = SchedulerConfig(learning_rate=1e-4) >>> builder = ConstantSchedulerBuilder(config=config) >>> scheduler = builder.build() >>> scheduler(0) # Returns 1e-4
- build() Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]][source]#
Build a constant learning rate schedule.
- Returns
- A schedule function that returns the configured
learning rate regardless of the step count.
- Return type
optax.Schedule
- config: SchedulerConfig#
- class eformer.optimizers._builders.CosineSchedulerBuilder(config: SchedulerConfig)[source]#
Bases:
SchedulerBuilderBuilder for cosine learning rate schedule with optional warmup.
This builder creates a scheduler that decays the learning rate following a cosine curve. This is a popular choice for training neural networks as it provides smooth decay with a “warm restart” capability.
- config#
Configuration object containing learning rate parameters, steps, warmup settings, and cosine decay exponent.
- Type
Example
>>> from eformer.optimizers import SchedulerConfig >>> config = SchedulerConfig( ... scheduler_type="cosine", ... learning_rate=1e-4, ... learning_rate_end=1e-6, ... steps=10000, ... warmup_steps=1000, ... exponent=1.0, ... ) >>> builder = CosineSchedulerBuilder(config=config) >>> scheduler = builder.build()
- build() Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]][source]#
Build a cosine learning rate schedule with optional warmup.
Creates a cosine decay schedule that smoothly decreases the learning rate from the peak value to the end value. If warmup_steps is specified, includes a linear warmup phase from a very small value (1e-8) to the peak learning rate before the cosine decay begins.
- Returns
- A schedule function that returns the learning rate
for a given step count, following a cosine decay pattern.
- Return type
optax.Schedule
- config: SchedulerConfig#
- class eformer.optimizers._builders.LinearSchedulerBuilder(config: SchedulerConfig)[source]#
Bases:
SchedulerBuilderBuilder for linear learning rate schedule with optional warmup.
This builder creates a scheduler that linearly decays the learning rate from an initial value to an end value over a specified number of steps. Optionally, a warmup phase can be added at the beginning of training.
- config#
Configuration object containing learning rate parameters, steps, and optional warmup settings.
- Type
Example
>>> from eformer.optimizers import SchedulerConfig >>> config = SchedulerConfig( ... scheduler_type="linear", ... learning_rate=1e-4, ... learning_rate_end=1e-6, ... steps=10000, ... warmup_steps=1000, ... ) >>> builder = LinearSchedulerBuilder(config=config) >>> scheduler = builder.build()
- build() Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]][source]#
Build a linear learning rate schedule with optional warmup.
Creates a linear decay schedule from learning_rate to learning_rate_end. If warmup_steps is specified, prepends a linear warmup phase from a very small value (1e-8) to the initial learning rate.
- Returns
- A schedule function that returns the learning rate
for a given step count.
- Return type
optax.Schedule
- Raises
ValueError – If learning_rate_end is not specified in the config.
- config: SchedulerConfig#
- class eformer.optimizers._builders.LionOptimizer(config: LionConfig)[source]#
Bases:
OptimizerBuilderBuilder for Lion (Evolved Sign Momentum) optimizer.
Lion is an optimizer discovered through neural architecture search that uses sign-based updates with momentum. It often achieves better generalization than Adam with fewer hyperparameters to tune.
Reference: https://arxiv.org/abs/2302.06675
- config#
Configuration object containing Lion hyperparameters including momentum coefficients (b1, b2) and data type for momentum.
- Type
Example
>>> from eformer.optimizers import LionConfig >>> import optax >>> config = LionConfig(b1=0.9, b2=0.99) >>> builder = LionOptimizer(config=config) >>> scheduler = optax.constant_schedule(1e-4) >>> optimizer = builder.build(scheduler)
- build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) GradientTransformation[source]#
Build the Lion optimizer transformation.
- Parameters
scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.
- Returns
- The Lion optimizer transformation that uses
sign-based updates with momentum for efficient parameter updates.
- Return type
optax.GradientTransformation
- config: LionConfig#
- class eformer.optimizers._builders.MarsOptimizer(config: MarsConfig)[source]#
Bases:
OptimizerBuilderBuilder for Mars (Matrix-wise Adaptive Regularized Scaling) optimizer.
Mars improves upon Adam by using a variance reduction technique with gradient momentum from the previous step. This can lead to improved convergence and better generalization, particularly for training large language models.
Reference: https://arxiv.org/abs/2411.10438
- config#
Configuration object containing Mars hyperparameters including beta coefficients, gamma for gradient momentum, epsilon for numerical stability, and gradient clipping threshold.
- Type
Example
>>> from eformer.optimizers import MarsConfig >>> import optax >>> config = MarsConfig(beta1=0.95, beta2=0.99, gamma=0.025) >>> builder = MarsOptimizer(config=config) >>> scheduler = optax.constant_schedule(1e-4) >>> optimizer = builder.build(scheduler)
- build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) GradientTransformation[source]#
Build the Mars optimizer transformation.
- Parameters
scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.
- Returns
- The Mars optimizer transformation that uses
variance reduction with gradient momentum for improved convergence.
- Return type
optax.GradientTransformation
- config: MarsConfig#
- class eformer.optimizers._builders.MuonOptimizer(config: MuonConfig)[source]#
Bases:
OptimizerBuilderBuilder for Muon (Momentum Orthogonalized by Newton-schulz) optimizer.
Muon is designed specifically for 2D parameters (matrices) and uses the Newton-Schulz method to orthogonalize momentum. Non-2D parameters are processed through an Adam optimizer fallback. This makes it particularly effective for training models with large matrix parameters.
The optimizer maintains orthogonality of the momentum, which can lead to more stable training and better convergence for certain architectures.
- config#
Configuration object containing Muon hyperparameters including Newton-Schulz coefficients, number of steps, momentum parameters, and Adam fallback settings.
- Type
Example
>>> from eformer.optimizers import MuonConfig >>> import optax >>> config = MuonConfig(ns_steps=5, beta=0.95, nesterov=True) >>> builder = MuonOptimizer(config=config) >>> scheduler = optax.constant_schedule(1e-4) >>> optimizer = builder.build(scheduler)
- build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) GradientTransformation[source]#
Build the Muon optimizer transformation.
- Parameters
scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.
- Returns
- The Muon optimizer transformation that uses
Newton-Schulz orthogonalization for 2D parameters and Adam for others.
- Return type
optax.GradientTransformation
- config: MuonConfig#
- class eformer.optimizers._builders.QuadOptimizer(config: WhiteKronConfig)[source]#
Bases:
OptimizerBuilderBuilder for Quad (White Kron with QUAD update) optimizer.
Quad is a Kronecker-factored preconditioned optimizer that uses the QUAD preconditioner update style. It provides efficient second-order optimization by approximating the inverse Fisher information matrix using Kronecker products.
This optimizer is particularly effective for training deep neural networks, especially transformers, where second-order information can significantly improve convergence.
- config#
Configuration object containing Quad optimizer hyperparameters including preconditioner settings, block size, sharding configurations, and numerical stability parameters.
- Type
Example
>>> from eformer.optimizers import WhiteKronConfig >>> import optax >>> config = WhiteKronConfig(b1=0.95, preconditioner_lr=0.7) >>> builder = QuadOptimizer(config=config) >>> scheduler = optax.constant_schedule(1e-4) >>> optimizer = builder.build(scheduler)
- build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) GradientTransformation[source]#
Build the Quad optimizer transformation.
- Parameters
scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.
- Returns
- The Quad optimizer transformation using
QUAD-style Kronecker-factored preconditioning for efficient second-order optimization.
- Return type
optax.GradientTransformation
- config: WhiteKronConfig#
- class eformer.optimizers._builders.RMSPropOptimizer(config: RMSPropConfig)[source]#
Bases:
OptimizerBuilderBuilder for RMSProp (Root Mean Square Propagation) optimizer.
RMSProp is an adaptive learning rate optimizer that divides the gradient by a running average of the magnitude of recent gradients. It is effective for training recurrent neural networks and other models with non-stationary objectives.
- config#
Configuration object containing RMSProp hyperparameters including decay rate, epsilon, momentum, and Nesterov momentum settings.
- Type
Example
>>> from eformer.optimizers import RMSPropConfig >>> import optax >>> config = RMSPropConfig(decay=0.9, eps=1e-8, momentum=0.9) >>> builder = RMSPropOptimizer(config=config) >>> scheduler = optax.constant_schedule(1e-4) >>> optimizer = builder.build(scheduler)
- build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) GradientTransformation[source]#
Build the RMSProp optimizer transformation.
- Parameters
scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.
- Returns
- The RMSProp optimizer transformation that
adapts the learning rate based on a moving average of squared gradients.
- Return type
optax.GradientTransformation
- config: RMSPropConfig#
- class eformer.optimizers._builders.SkewOptimizer(config: WhiteKronConfig)[source]#
Bases:
OptimizerBuilderBuilder for Skew (White Kron with skew update) optimizer.
Skew is a Kronecker-factored preconditioned optimizer that uses the skew preconditioner update style. It provides efficient second-order optimization with a different update rule compared to the QUAD variant.
The skew update uses a Procrustes step to maintain orthogonality of the preconditioner, which can lead to more stable training in certain scenarios.
- config#
Configuration object containing Skew optimizer hyperparameters including preconditioner settings, block size, sharding configurations, and numerical stability parameters.
- Type
Example
>>> from eformer.optimizers import WhiteKronConfig >>> import optax >>> config = WhiteKronConfig(b1=0.95, preconditioner_lr=0.7) >>> builder = SkewOptimizer(config=config) >>> scheduler = optax.constant_schedule(1e-4) >>> optimizer = builder.build(scheduler)
- build(scheduler: Callable[[Union[Array, ndarray, bool, number, float, int]], Union[Array, ndarray, bool, number, float, int]]) GradientTransformation[source]#
Build the Skew optimizer transformation.
- Parameters
scheduler (optax.Schedule) – Learning rate schedule to use for the optimizer.
- Returns
- The Skew optimizer transformation using
skew-style Kronecker-factored preconditioning with Procrustes orthogonalization for efficient second-order optimization.
- Return type
optax.GradientTransformation
- config: WhiteKronConfig#