The article also highlights a bug in the `merge` method, which results in undesired behavior when a character is repeated more than twice. For example, 'aaa' is not handled properly since there are two pairs of 'aa' in the triple. However, this bug doesn't seem to have a significant effect on training the vocab. The author also lists several tasks for future development, including training on Project Gutenberg, adding PyTorch support for the `encode` method, adding MPS device support for MacBooks, and possibly fixing the repeated characters bug. The code is licensed under MIT.
Key takeaways:
- The minbpe-pytorch code provides a minimal, clean implementation of the Byte Pair Encoding (BPE) algorithm used in LLM tokenization, with added PyTorch/CUDA training support.
- It offers a significant speedup in training time compared to the original code, taking only 2min 28sec on an RTX4090 to train the BasicTokenizer with a vocab_size of 512 on 307MB of Enron emails.
- The script contains a bug where repeated characters are not handled properly, replacing all repeated characters with one token, but this doesn't seem to significantly affect vocab training.
- Future improvements include training on Project Gutenberg, adding PyTorch support for the encode method, adding MPS device support for MacBooks, and potentially fixing the repeated characters bug.