INFO:
练习两天半,从零实现DeepSeek-R1(基于Qwen2.5-0.5B和规则奖励模型,GRPO),从原理讲解到代码实现,解开DeepSeek-R1的神秘面纱