参考
- [Done]Jay Mody. GPT in 60 Lines of NumPy. https://jaykmody.com/blog/gpt-from-scratch/. 原文代码 https://github.com/jaymody/picoGPT , 中文翻译版 https://jiqihumanr.github.io/2023/04/13/gpt-from-scratch/ . 2023.
- [Done]Jay Alammar. The Illustrated GPT-2 (Visualizing Transformer Language Models). http://jalammar.github.io/illustrated-gpt2/
- [Done]Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/ . 2020
- [Done]Andrej Karpathy. Let's build GPT. https://www.youtube.com/watch?v=kCc8FmEb1nY . 2023 . Code https://github.com/karpathy/nanoGPT .
- [Done]GPT/Transformer 中的一些算法的使用原因. https://zhuanlan.zhihu.com/p/559495068 .
- Transformer 内存、计算量定量估计.
- 分析transformer模型的参数量、计算量、中间激活、KV cache. https://zhuanlan.zhihu.com/p/624740065
- Transformer Math 101. https://blog.eleuther.ai/transformer-math/
- Transformer Inference Arithmetic. https://kipp.ly/transformer-inference-arithmetic/
进一步参考
- The Annotated Transformer. http://nlp.seas.harvard.edu/annotated-transformer/
- Transformers Explained Visually (Part 3): Multi-head Attention, deep dive. https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853 .
- Why multi-head self attention works. ****https://theaisummer.com/self-attention/ .
- 梁德澎. RoPE 一文看懂 LLaMA 中的旋转式位置编码**.** https://mp.weixin.qq.com/s/0peSNWN0ypMopPR0Q_pujQ
- 大模型推理:从模型分析到计算优化(一). https://mp.weixin.qq.com/s/VaRvrtcNRLzDntE6fPJSIw
- 大模型推理:从模型分析到计算优化(二). https://mp.weixin.qq.com/s/tlGtr1fOTFElTuGHKyHKgQ
- vllm. https://github.com/vllm-project/vllm
- 4-bit Quantization with GPTQ. ****
- https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34
- GPTQ paper, ****https://arxiv.org/abs/2210.17323. paper, https://github.com/IST-DASLab/gptq
- Large Transformer Model Inference Optimization. https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
- FlashAttention2(加速 Attention). paper, https://tridao.me/publications/flash2/flash2.pdf . -
- FlashAttention 优化的实现. https://www.zhihu.com/question/611236756/answer/3132304304
- blog, https://princeton-nlp.github.io/flash-atttention-2/ .
- blog 2, https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad . 202307.
- Grouped-query attention(GQA). llama2 中使用的替换 MHA,不减弱效果的情况下可以提升推理速度. Paper https://arxiv.org/abs/2305.13245 .
- flash-attention 中的相关实现. https://github.com/Dao-AILab/flash-attention/blob/d1a3b52f17b914c93bf740654387b566a7330687/flash_attn/flash_attn_interface.py#L385 .
- GQA 在 torch 中的实现. https://github.com/Oneflow-Inc/megatron-lm/pull/73/files