fishy.engine.muon¶

Muon Optimizer: Orthogonalized gradient descent. Adapted from the “Parameter Golf” challenge (modded-nanogpt).

Classes

class fishy.engine.muon.Muon(params, lr: float = 0.02, momentum: float = 0.95, backend_steps: int = 5, nesterov: bool = True)[source]¶

Bases: Optimizer

Muon: An optimizer that orthogonalizes updates for matrix parameters. Typically used for Linear layer weights, while AdamW/SGD is used for vectors.

__init__(params, lr: float = 0.02, momentum: float = 0.95, backend_steps: int = 5, nesterov: bool = True)[source]¶

step(closure=None)[source]¶

Perform a single optimization step to update parameter.

Parameters:: closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

Functions

fishy.engine.muon.zeropower_via_newtonschulz5(G: Tensor, steps: int = 10, eps: float = 1e-07) → Tensor[source]¶: Orthogonalize a 2D update matrix with a fast Newton-Schulz iteration. Muon uses this to normalize matrix-shaped gradients before applying them.