参考

  1. [Done]Jay Mody. GPT in 60 Lines of NumPy. https://jaykmody.com/blog/gpt-from-scratch/. 原文代码 https://github.com/jaymody/picoGPT , 中文翻译版 https://jiqihumanr.github.io/2023/04/13/gpt-from-scratch/ . 2023.
  2. [Done]Jay Alammar. The Illustrated GPT-2 (Visualizing Transformer Language Models). http://jalammar.github.io/illustrated-gpt2/
  3. [Done]Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/ . 2020
  4. [Done]Andrej Karpathy. Let's build GPT. https://www.youtube.com/watch?v=kCc8FmEb1nY . 2023 . Code https://github.com/karpathy/nanoGPT .
  5. [Done]GPT/Transformer 中的一些算法的使用原因. https://zhuanlan.zhihu.com/p/559495068 .
  6. Transformer 内存、计算量定量估计.
    1. 分析transformer模型的参数量、计算量、中间激活、KV cache. https://zhuanlan.zhihu.com/p/624740065
    2. Transformer Math 101. https://blog.eleuther.ai/transformer-math/
    3. Transformer Inference Arithmetic. https://kipp.ly/transformer-inference-arithmetic/

进一步参考

  1. The Annotated Transformer. http://nlp.seas.harvard.edu/annotated-transformer/
  2. Transformers Explained Visually (Part 3): Multi-head Attention, deep dive. https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853 .
  3. Why multi-head self attention works. ****https://theaisummer.com/self-attention/ .
  4. 梁德澎. RoPE 一文看懂 LLaMA 中的旋转式位置编码**.** https://mp.weixin.qq.com/s/0peSNWN0ypMopPR0Q_pujQ
  5. 大模型推理:从模型分析到计算优化(一). https://mp.weixin.qq.com/s/VaRvrtcNRLzDntE6fPJSIw
  6. 大模型推理:从模型分析到计算优化(二). https://mp.weixin.qq.com/s/tlGtr1fOTFElTuGHKyHKgQ
  7. vllm. https://github.com/vllm-project/vllm
  8. 4-bit Quantization with GPTQ. ****
    1. https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34
    2. GPTQ paper, ****https://arxiv.org/abs/2210.17323. paper, https://github.com/IST-DASLab/gptq
  9. Large Transformer Model Inference Optimization. https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
  10. FlashAttention2(加速 Attention). paper, https://tridao.me/publications/flash2/flash2.pdf . -
    1. FlashAttention 优化的实现. https://www.zhihu.com/question/611236756/answer/3132304304
    2. blog, https://princeton-nlp.github.io/flash-atttention-2/ .
    3. blog 2, https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad . 202307.
  11. Grouped-query attention(GQA). llama2 中使用的替换 MHA,不减弱效果的情况下可以提升推理速度. Paper https://arxiv.org/abs/2305.13245 .
    1. flash-attention 中的相关实现. https://github.com/Dao-AILab/flash-attention/blob/d1a3b52f17b914c93bf740654387b566a7330687/flash_attn/flash_attn_interface.py#L385 .
    2. GQA 在 torch 中的实现. https://github.com/Oneflow-Inc/megatron-lm/pull/73/files