-
Igor Shmukler authored
* futher lowered LR multiplier, improved LR calculation on resume * updated config again * pass RoPE relative position type to the decoder init * fixed regex * fixed OneCycle LR internal step counter * fixed weight decay on biases * tuning model configuration * fixed sample length desync bug * isolated stop-head optimization pressure * extended training analysis script * renamed analysis script * updated README * minor enhancements for training report * spread heavy batches more evenly across the epoch to prevent clustering * fixed a bug in dynamic batching code * stop token parameters tuning * lowered encoder FFN spike clip norm * enhanced training analysis script * updates to diagnostics script * training analysis script improvements * added post-step max weight-norm clamp for decoder.layers.0.ff.linear1.weight * bumped patch version number * fixed tests * removed short analysis script, full one is better
Loading