fishy.engine.muon¶
Muon Optimizer: Orthogonalized gradient descent. Adapted from the “Parameter Golf” challenge (modded-nanogpt).
Classes
- class fishy.engine.muon.Muon(params, lr: float = 0.02, momentum: float = 0.95, backend_steps: int = 5, nesterov: bool = True)[source]¶
Bases:
OptimizerMuon: An optimizer that orthogonalizes updates for matrix parameters. Typically used for Linear layer weights, while AdamW/SGD is used for vectors.
Functions
- fishy.engine.muon.zeropower_via_newtonschulz5(G: Tensor, steps: int = 10, eps: float = 1e-07) Tensor[source]¶
Orthogonalize a 2D update matrix with a fast Newton-Schulz iteration. Muon uses this to normalize matrix-shaped gradients before applying them.
s