MinGPT
This is a walkthrough of KArpathy's MinGPT implementation
bpe.py
This file contains code to implement a Byte Pair Encoding encoder. It does not contain code for training, just loads openai's GPT-2 bpe encoding for inference.
Most text is represented in UTF-8 encoding, which is just a sequence of bytes (values 0 to 255
). For example, 0xf0
corresponds to 33
in decimal which corresponds to the character !
. This means that all text can be treated as a sequence of byte values.
As a fallback, we first need token representations for individual byte values (in case we encounter out of vocab tokens with unknown byte sequences).
First, two files are downloaded in get_encoder
, from https://openaipublic.blob.core.windows.net/gpt-2/models/124M/
:
encoder.json
is a dict of len50257
mapping from a token to its index. This represents the entirety of the vocabulary- The first
256
tokens represents the256
byte values. Each token is some arbitrarily chosen character (just need to make sure it is printable) - These tokens are a fallback to ensure that we can encode any text sequence. For example, if we encounter a new emoji with unknown byte sequence, at the very least we can encode each byte separately.
- The next
50k
tokens map from a byte sequence of length2
and above to an index, these are the BPE mined sequences of merged bytes - The last token is
<|endoftext|>
which is a special token
- The first
vocab.bpe
is a\n
separated list of byte sequences that should be merged (50k
of them)- In contrast to the above, these sequences are not merged yet (e.g. a line is
R ocket
) - We store these as a
list[tuple]
inbpe_merges
- In contrast to the above, these sequences are not merged yet (e.g. a line is
These two data are passed into the Encoder
main class.