GPT 算子及其优化调研

参考

[Done]Jay Mody. GPT in 60 Lines of NumPy. https://jaykmody.com/blog/gpt-from-scratch/. 原文代码 https://github.com/jaymody/picoGPT , 中文翻译版 https://jiqihumanr.github.io/2023/04/13/gpt-from-scratch/ . 2023.
[Done]Jay Alammar. The Illustrated GPT-2 (Visualizing Transformer Language Models). http://jalammar.github.io/illustrated-gpt2/
[Done]Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/ . 2020
[Done]Andrej Karpathy. Let's build GPT. https://www.youtube.com/watch?v=kCc8FmEb1nY . 2023 . Code https://github.com/karpathy/nanoGPT .
[Done]GPT/Transformer 中的一些算法的使用原因. https://zhuanlan.zhihu.com/p/559495068 .
Transformer 内存、计算量定量估计.
1. 分析transformer模型的参数量、计算量、中间激活、KV cache. https://zhuanlan.zhihu.com/p/624740065
2. Transformer Math 101. https://blog.eleuther.ai/transformer-math/
3. Transformer Inference Arithmetic. https://kipp.ly/transformer-inference-arithmetic/

进一步参考

The Annotated Transformer. http://nlp.seas.harvard.edu/annotated-transformer/
Transformers Explained Visually (Part 3): Multi-head Attention, deep dive. https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853 .
Why multi-head self attention works. ****https://theaisummer.com/self-attention/ .
梁德澎. RoPE 一文看懂 LLaMA 中的旋转式位置编码**.** https://mp.weixin.qq.com/s/0peSNWN0ypMopPR0Q_pujQ
大模型推理：从模型分析到计算优化（一）. https://mp.weixin.qq.com/s/VaRvrtcNRLzDntE6fPJSIw
大模型推理：从模型分析到计算优化（二）. https://mp.weixin.qq.com/s/tlGtr1fOTFElTuGHKyHKgQ
vllm. https://github.com/vllm-project/vllm
4-bit Quantization with GPTQ. ****
1. https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34
2. GPTQ paper, ****https://arxiv.org/abs/2210.17323. paper, https://github.com/IST-DASLab/gptq
Large Transformer Model Inference Optimization. https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
FlashAttention2（加速 Attention）. paper, https://tridao.me/publications/flash2/flash2.pdf . -
1. FlashAttention 优化的实现. https://www.zhihu.com/question/611236756/answer/3132304304
2. blog, https://princeton-nlp.github.io/flash-atttention-2/ .
3. blog 2, https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad . 202307.
Grouped-query attention(GQA). llama2 中使用的替换 MHA，不减弱效果的情况下可以提升推理速度. Paper https://arxiv.org/abs/2305.13245 .
1. flash-attention 中的相关实现. https://github.com/Dao-AILab/flash-attention/blob/d1a3b52f17b914c93bf740654387b566a7330687/flash_attn/flash_attn_interface.py#L385 .
2. GQA 在 torch 中的实现. https://github.com/Oneflow-Inc/megatron-lm/pull/73/files