Zhihu Frontier
🚀 "Quantization is not a compromise — it's the next paradigm." After K2-Thinking's release, many developers have been curious about its native INT4 quantization format. 刘少伟, infra engineer at @Kimi_Moonshot and Zhihu contributor, shares an insider's view on why this choice matters — and why quantization today isn't just about sacrificing precision for speed. 💡 Key idea In the context of LLMs, quantization is no longer a trade-off. With the evolution of param-scaling and test-time-scaling, native low-bit quantization will become a standard paradigm for large model training. 🤔 Why Low-bit Quantization Matters In modern LLM inference, there are two distinct optimization goals: • High throughput (cost-oriented): maximize GPU utilization via large batch sizes. • Low latency (user-oriented): minimize per-query response time. For Kimi-K2's MoE structure (with 1/48 sparsity), decoding is memory-bound — the smaller the model weights, the faster the compute. FP8 weights (≈1 TB) already hit the limit of what a single high-speed interconnect GPU node can handle. ⚠️ By switching to W4A16, latency drops sharply while maintaining quality — a perfect fit for low-latency inference. 🔍 Why QAT over PTQ Post-training quantization (PTQ) worked well for shorter generations, but failed in longer reasoning chains: • Error accumulation during long decoding degraded precision. • Dependence on calibration data caused "expert distortion" in sparse MoE layers. ‼️Thus, K2-Thinking adopted QAT for minimal loss and more stable long-context reasoning. 🧠 How it works K2-Thinking uses a weight-only QAT with fake quantization + STE (straight-through estimator). The pipeline was fully integrated in just days — from QAT training → INT4 inference → RL rollout — enabling near lossless results without extra tokens or retraining. ⚡ INT4's hidden advantage in RL Few people mention this: native INT4 doesn't just speed up inference — it accelerates RL training itself. Because RL rollouts often suffer from "long-tail" inefficiency, INT4's low-latency profile makes those stages much faster. In practice, each RL iteration runs 10-20% faster end-to-end. Moreover, quantized RL brings stability: smaller representational space reduces accumulation error, improving learning robustness. 🔩 Why INT4, not MXFP4 Kimi chose INT4 over "fancier" MXFP4/NVFP4 to better support non-Blackwell GPUs, with strong existing kernel support (e.g., Marlin). At a quant scale of 1×32, INT4 matches FP4 formats in expressiveness while being more hardware-adaptable. 🧭 Looking forward W4A16 is just the beginning — W4A8 and even W4A4 are on the horizon. As new chips roll out with FP4-native operators, Kimi's quantization path will continue evolving. "In the LLM age, quantization stands alongside SOTA and Frontier. It's not a patch — it's how we'll reach the frontier faster." 📖 Full article (in Chinese): https://www.zhihu.com/question/1969558404759544488/answer/1970539327902679960 #KimiK2Thinking #INT4 #Quantization #LLM #Infra #RLHF