I made a kernel 2.2x faster. It made my training loop 3x slower

Software Development, Performance Engineering, AI & Machine Learning(kyrieblunders.bearblog.dev)view on HackerNews
PyTorchPyTorch ProfilerCUDAkernelsattentionKV cachetensor operations

Author: vishal-padia

Date: 6/2/2026

Article Summary:
The author describes their experience optimizing a PyTorch-based implementation of a reinforcement learning (RL) algorithm called GRPO (Generalized Relay Policy Optimization) for a language model. They focus on improving the performance of the generate step, which is the slowest part of the RL pipeline, and achieve a 4.8× speedup by using a static KV cache.