GitHub - kuprel/minbpe-pytorch: Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization, with PyTorch/CUDA

The article discusses the minbpe-pytorch, a minimal and clean code for the Byte Pair Encoding (BPE) algorithm used in LLM tokenization. The BPE algorithm operates on UTF-8 encoded strings, and this code adds PyTorch/CUDA training support to Andrej Karpathy's minbpe. The code is efficient, taking 148 seconds to train the BasicTokenizer on an RTX4090, a 55x speedup compared to the original code. The script is contained in `train_pytorch.py` and includes instructions for usage, including how to train a vocab of 512 tokens, create a directory for models, and test the model.

The article also highlights a bug in the `merge` method, which results in undesired behavior when a character is repeated more than twice. For example, 'aaa' is not handled properly since there are two pairs of 'aa' in the triple. However, this bug doesn't seem to have a significant effect on training the vocab. The author also lists several tasks for future development, including training on Project Gutenberg, adding PyTorch support for the `encode` method, adding MPS device support for MacBooks, and possibly fixing the repeated characters bug. The code is licensed under MIT.

Key takeaways:

The minbpe-pytorch code provides a minimal, clean implementation of the Byte Pair Encoding (BPE) algorithm used in LLM tokenization, with added PyTorch/CUDA training support.
It offers a significant speedup in training time compared to the original code, taking only 2min 28sec on an RTX4090 to train the BasicTokenizer with a vocab_size of 512 on 307MB of Enron emails.
The script contains a bug where repeated characters are not handled properly, replacing all repeated characters with one token, but this doesn't seem to significantly affect vocab training.
Future improvements include training on Project Gutenberg, adding PyTorch support for the encode method, adding MPS device support for MacBooks, and potentially fixing the repeated characters bug.

GitHub - kuprel/minbpe-pytorch: Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization, with PyTorch/CUDA

Key takeaways:

Comments (0)

Newsletter