Line-By-Line, Let’s Reproduce GPT-2: Section 1This blog post will go line-by-line through the code in Section 1 of Andrej Karpathy’s “Let’s reproduce GPT-2 (124M)”Image by Author — SDXLAndrej Karpathy is one of the foremost Artificial Intelligence (AI) researchers out there. He is a founding member of OpenAI, previously led AI at Tesla, and continues to be at the forefront of the AI community. He recently released an incredible 4 hour video walking through how to build a high-quality LLM model from scratch.In that video, we go through all of the major parts of training an LLM, from coding the architecture to speeding up its training time to adjusting the hyperparameters for better results. There’s an incredible amount of knowledge there, so I wanted to expand upon it by going line-by-line through the code Karpathy creates and explaining how it is working. This blog post will be part of a series I do covering each section of Karpathy’s video.In section one, we focus on implementing the architecture of GPT-2. While GPT-2 was open-sourced by OpenAI in 2018, it was written in Tensor Flow, which is a harder framework to debug than PyTorch. Consequently, we are going to recreate GPT-2 using more commonly used tools. Using only the code we are going to create today, you can create a LLM of your own!Let’s dive in!High Level VocabularyBefore we begin, let’s get on the same page about some terminology. While there may be some naming collisions with other sources, I’ll try to be consistent within these blog posts.Block Size — tells us how many positions in the input length our Transformer can process. Once you go over this limit, performance degrades as you have to wrap around (you can learn more about how we expand this without training a new model from scratch in my Long RoPE Blog)Vocabulary Size — tells us how many unique tokens the model will be able to understand and use. In general, researchers have found that larger vocabulary sizes allow models to be more precise with their language and to capture more nuances in their responses.Layer — part of the hidden layers of our neural network. Specifically here we refer to how many times we repeat the calculations shown in the grey box below:A layer in our model from “Attention is All You Need”Embedding — a vector representation of data we pass to the model.Multi-Head Attention — rather than running attention once, we run it n-times and then concatenate all of the results together to get the final result.Let’s go into the code!GPT Class & Its Parameters@dataclassclass GPTConfig: block_size : int = 1024 vocab_size : int = 50257 n_layer : int = 12 n_head : int = 12 n_embd : int = 768To begin, we are setting 5 hyper-parameters in the GPTConfig class. block_size appears to be somewhat arbitrary along with n_layerand n_head. Put differently, these values were chosen empirically based on what the researchers saw had the best performance. Moreover, we choose 786 for n_embd as this is the value chosen for the GPT-2 paper, which we’ve decided to emulate.However, vocab_size is set based off the tiktoken gpt-2 tokenizer that we will use. The GPT-2 tokenizer was created by using the Byte-Pair Encoding algorithm (read more here). This starts off with an initial set of vocab (in our case 256) and then goes through the training data creating new vocab based on the frequency it sees the new vocabulary appearing in the training set. It keeps doing this until it has hit a limit (in our case 50,000). Finally, we have vocab set aside for internal use (in our case the end token character). Adding these up we get 50,257.class GPT(nn.Module): def __init__(self, config): super().__init__() self.config = config # …With our configs set, we create a GPT class which is an instance of the torch nn.Module class. This is the base class for all PyTorch neural networks, and so by using this we get access to all of the optimizations that PyTorch has for these types of models. Each nn.Module will have a forward function that defines what happens during a forward pass of the model (more on these in a moment).We begin by running the super constructor in the base class and then create a transformer object as a ModuleDict. This was created because it allows us to index into transformer like an object, which will come in handy both when we want to load in weights from HuggingFace and when we want to debug and quickly go through our model.class GPT(nn.Module): def __init__(self, config): # … self.transformer = nn.ModuleDict(dict( wte = nn.Embedding(config.vocab_size, config.n_embd), wpe = nn.Embedding(config.block_size, config.n_embd), h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]), ln_f = nn.LayerNorm(config.n_embd) ))Our transformer here has 4 major pieces we are going to load in: the weights of the token embeddings (wte), the weights of the positional encodings (wpe), the hidden layers (h), and the layer normalization (ln_f). This setup is following mostly the decoder part of the Transformer architecture from “Attention is All You Need” (output embeddings ~ wte, positional encoding ~ wte, hidden layers ~h ). One key difference is that we have an additional normalization layer ln_f done after all of the hidden layers have finished in our architecture.Decoder Half of the Architecture shown in “Attention is All You Need”The wte and the wpe are both embeddings so naturally we use the nn.Embedding class to represent them. Our hidden layers are where we will have most of the logic for the Transformer, so I will go into this more later. For now, just note that we are creating a loop of the object Block so that we have n.layer‘s of them. Finally, we use the built-in nn.LayerNorm for ln_f , which will normalize our output based on the equation below (where x and y are input and output, E[x] is the mean value, and γ and β are learnable weights).Equation for Layer Normalization in PyTorchclass GPT(nn.Module): def __init__(self, config): # … self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False) # weight sharing scheme self.transformer.wte.weight = self.lm_head.weight # initalize weights self.apply(self._init_weights)Next, we setup the final linear layer of our network which will generate the logits of the model. Here we are projecting from the embedding dimension of our model (768) to the vocabulary size of our model (50,257). The idea here is that we have taken the hidden state and expanded it to map onto our vocabulary so that our decoder head can use the values on each vocab to figure out what the next token should be.Finally in our constructor, we have an interesting optimization where we tell the model to make the tokenizer weights the same as the linear layer weights. This is done because we want the linear layer and the tokenizer to have the same understanding of the tokens (if two tokens are similar when being input into the model, the same two tokens should be similar when being output by the model). Finally, we initialize the weights for the model so we can start training.class GPT(nn.Module):# … def forward(self, idx, targets=None): B, T = idx.size() assert T <= self.config.block_size, f"maximum sequence length breached" pos = torch.arange(0, T, dtype=torch.long, device=idx.device) pos_emb = self.transformer.wpe(pos) tok_emb = self.transformer.wte(idx) x = tok_emb + pos_emb # hidden broadcast for block in self.transformer.h: x = block(x) x = self.transformer.ln_f(x) logits = self.lm_head(x) loss = None if targets is not None: loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1)) return logits, lossOur forward function is where we lay out exactly how our model will behave during a forward pass. We start off by verifying that our sequence length is not greater than our configured max value (block_size). Once that’s true, we create a tensor with values of 0 to T-1 (for example if T = 4, we’d have tensor([0, 1, 2, 3]) and run them through our positional embedding weights. Once that’s complete, we run the input tensor through the token embedding weights.We combine both the token and the positional embeddings into x, requiring a broadcast to combine them. As the tok_emb are bigger than the pos_emb (in our example 50257 vs 1024), x will have the dimensions of tok_emb . x is now our hidden state, which we will pass through the hidden layers via the for loop. We are careful to update x after each time through a Block.Next, we normalize x via our LayerNormalization ln_f and then do our linear projection to get the logits necessary to predict the next token. If we are training the model (which we signal via the targets parameter), we will then compute cross entropy between the logits we have just produced and the ground truth values held in our targets variable. We accomplish this via our cross_entropy loss function. To do this right, we need to convert our logits and target to the right shape via .view(). We ask pytorch to infer the correct size when we pass through -1.There’s one more function in this class, the initialization function, but we’ll get to the initialization logic a little later. For now, let’s dive into the Block logic that will help us implement our multi-head attention and MLPs.Block Classclass Block(nn.Module): def __init__(self, config): super().__init__() self.ln_1 = nn.LayerNorm(config.n_embd) self.attn = CausalSelfAttention(config) self.ln_2 = nn.LayerNorm(config.n_embd) self.mlp = MLP(config)# …Block is instantiated as a nn.Module , so we also call the super constructor at the beginning for its optimizations. Next, we setup the same calculations as set out in the “Attention is All You Need” paper — 2 layer normalizations, an attention calculation, and a feed forward layer via MLPs.A Hidden Layer from “Attention is All You Need”class Block(nn.Module):# … def forward(self, x): x = x + self.attn(self.ln_1(x)) x = x + self.mlp(self.ln_2(x)) return xWe then define our forward function which PyTorch will call for every forward pass of the model. Note that this is where we do something different than Attention is All You Need. We setup the layer normalizations to happen before attention and the feedforward respectively. This is part of the insights from GPT-2 paper, and you can see how making little changes like this can make a big difference. Note the addition to the original tensor remains in the corresponding same position. These 2 additions will be important when we setup our weight initialization function.This class is a nice abstraction, as it lets us swap out implementations of attention or choose another type of feed forward function other than MLP without having to majorly refactor the code.CausalSelfAttention Classclass CausalSelfAttention(nn.Module): def __init__(self, config): super().__init__() assert config.n_embd % config.n_head == 0 self.c_attn = nn.Linear(config.n_embd, 3*config.n_embd) self.c_proj = nn.Linear(config.n_embd, config.n_embd) self.c_proj.NANOGPT_SCALE_INIT = 1 self.n_head = config.n_head self.n_embd = config.n_embd self.register_buffer('bias', torch.tril(torch.ones(config.block_size, config.block_size)) .view(1,1, config.block_size, config.block_size))# …Attention is an important part of our model, so naturally there are a number of configurations here. We have the assert statement as a debugging tool to make sure that the configuration dimensions we pass through are compatible. Then we create some helper functions that will assist us when we do our self-attention. First, we have our c_attn and c_proj which are linear projections that convert our hidden state into new dimensions needed for the attention calculation. The c_proj.NANOGPT_SCALE_INIT is a flag we set here and in the MLP that will help us with the weight initialization later (in truth this could be named anything).Finally, we tell torch to create a buffer that will not be updated during training called bias. Bias will be a lower triangular matrix of dimensions block_size x block_size that we will then turn into a 4D tensor with dimensions 1 x 1 x block_size x block_size . The 1 x 1 is done so that we can compute these in a batch in a single channel. This buffer will be used to apply a mask on our multi-headed attention.class CausalSelfAttention(nn.Module):# … def forward(self, x): B, T, C = x.size() # batch size, sequence length, channels qkv = self.c_attn(x) q, k, v = qkv.split(self.n_embd, dim=2) # transpose is done for efficiency optimization k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) att = (q @ k.transpose(-2,-1)) * (1.0 / math.sqrt(k.size(-1))) att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf")) att = F.softmax(att, dim=-1) y = att @ v y = y.transpose(1,2).contiguous().view(B, T, C) y = self.c_proj(y) return yNow comes the implementation of attention, with a focus on making this performant in torch. Going line by line, we begin by finding the batch size, sequence length, and channels in our input tensor x. We then will call our c_attn from before to project our hidden state into the dimensions we’ll need. We then split that result into 3 tensors of (B, T, C) shape (specifically one for query, one for key, and one for value).We then adjust the dimensions of q, k, and v so that we can do multi-head attention on these performantly. By changing the dimensions from (B, T, C) to (B, T, self.n_head, C // self.n_head), we are dividing up the data so that each head gets its own unique data to operate on. We transpose our view so that we can make T the third dimension and self.n_head the second dimension, allowing us to more easily concatenate the heads.Attention equation from “Attention is All You Need”Now that we have our values, we can start to calculate. We perform a matrix multiplication between query and key (making sure to transpose key so that it is in the proper direction), then divide by the square root of the size of k. After this calculation, we then apply the bias from our register so that the attention data from tokens in the future cannot impact tokens in the present (hence why we apply the mask only for tokens greater than T for the time and channel dimension). Once that is complete, we apply the softmax to only pass through certain information through.Once the mask is on, we multiply the values by v, and then transpose our values back to (B, T, self.n_head, C // self.n_head) setup. We call .contiguous() to ensure that in memory all of the data is laid out next to each other, and finally convert our tensor back to the (B, T, C) dimensions it came in with (thus, concatenating our attention heads in this step).Finally, we use our linear projection c_proj to convert back to the original dimensions of the hidden state.MLP Classclass MLP(nn.Module): def __init__(self, config): super().__init__() self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd) self.gelu = nn.GELU(approximate="tanh") self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd) self.c_proj.NANOGPT_SCALE_INIT = 1# …Like all the classes before, MLP inherits from nn.Module. We begin by setting some helper functions — specifically the c_fc and c_proj linear projection layers, expanding from our embedding to 4 times the size and then back again respectively. Next, we have GELU. Karpathy makes a point to say that the approximate parameter here is only set so that we can closely match the GPT-2 paper. While at the time, the approximation of GELU was necessary, now a days we no longer need to approximate — we can calculate precisely.class MLP(nn.Module):# … def forward(self, x): x = self.c_fc(x) x = self.gelu(x) x = self.c_proj(x) return xOur forward pass then is relatively straight forward. We call each function on our input tensor and return the final result.Hugging Face Connection CodeBecause GPT-2 is open-source, it is available on Hugging Face. While our goal here is to train our own model, it is nice to be able to compare what our results will be with the ones OpenAI found in their training. To allow us to do so, we have the below function that pulls in the weights and populates them into our GPT class.This code also allows us to reuse this code to pull in foundation models from Hugging Face and fine-tune them (with some modifications as right now it’s optimized only for gpt-2).class GPT(nn.Module):# … @classmethod def from_pretrained(cls, model_type): """Loads pretrained GPT-2 model weights from huggingface""" assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'} from transformers import GPT2LMHeadModel print("loading weights from pretrained gpt: %s" % model_type) # n_layer, n_head and n_embd are determined from model_type config_args = { 'gpt2': dict(n_layer=12, n_head=12, n_embd=768), # 124M params 'gpt2-medium': dict(n_layer=24, n_head=16, n_embd=1024), # 350M params 'gpt2-large': dict(n_layer=36, n_head=20, n_embd=1280), # 774M params 'gpt2-xl': dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params }[model_type] config_args['vocab_size'] = 50257 # always 50257 for GPT model checkpoints config_args['block_size'] = 1024 # always 1024 for GPT model checkpoints # create a from-scratch initialized minGPT model config = GPTConfig(**config_args) model = GPT(config) sd = model.state_dict() sd_keys = sd.keys() sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')] # discard this mask / buffer, not a param# …Starting from the top, we bring in HuggingFace’s transformers library and setup the hyperparameters that vary between different variants of the GPT-2 model. As the vocab_size and block_size don’t change, you can see we hard-code them in. We then pass these variables into the GPTConfig class from before, and then instantiate the model object (GPT). Finally, we remove all keys from the model that end with .attn.bias , as these are not weights, but rather the register we setup to help with our attention function before.class GPT(nn.Module):# … @classmethod def from_pretrained(cls, model_type):# … model_hf = GPT2LMHeadModel.from_pretrained(model_type) sd_hf = model_hf.state_dict() # copy while ensuring all of the parameters are aligned and match in names and shapes sd_keys_hf = sd_hf.keys() sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')] # ignore these, just a buffer sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')] # same, just the mask (buffer) transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight'] # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla Linear # this means that we have to transpose these weights when we import them assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"Next, we load in the model from the HuggingFace class GPT2LMHeadModel. We take the keys out from this model and likewise ignore the attn.masked_bias and attn.bias keys. We then have an assert to make sure that we have the same number of keys in the hugging face model as we do in our model.class GPT(nn.Module):# … @classmethod def from_pretrained(cls, model_type):# … for k in sd_keys_hf: if any(k.endswith(w) for w in transposed): # special treatment for the Conv1D weights we need to transpose assert sd_hf[k].shape[::-1] == sd[k].shape with torch.no_grad(): sd[k].copy_(sd_hf[k].t()) else: # vanilla copy over the other parameters assert sd_hf[k].shape == sd[k].shape with torch.no_grad(): sd[k].copy_(sd_hf[k]) return modelTo round out the function, we loop through every key in the Hugging Face model and add its weights to the corresponding key in our model. There are certain keys that need to be manipulated so that they fit the data structure we’re using. We run the function .t() to transpose the hugging face matrix into the dimensions we need. For the rest, we copy them over directly. You’ll notice we are using torch.no_grad() . This is telling torch that it doesn’t need to cache the values for a backward propagation of the model, another optimization to make this run faster.Generating Our First Predictions (Sampling Loop)With the classes we have now, we can run the model and have it give us output tokens (just make sure if you’re following this sequentially that you comment out the _init_weights call in the GPT constructor). The below code shows how we would do that.device = "cpu"if torch.cuda.is_available(): device = "cuda"elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available(): device = "mps"print(f"device {device}")torch.manual_seed(1337)model = GPT(GPTConfig())model.eval()model.to(device)We start off by determining what devices we have access to. Cuda is NVIDIA’s platform that runs extremely fast GPU calculations, so if we have access to chips that use CUDA we will use them. If we don’t have access but we’re on Apple Silicon, then we will use that. Finally, if we have neither, then we fall back to CPU (this will be the slowest, but every computer has one so we know we can still train on it).Then, we instantiate our model using the default configurations, and put the model into ‘eval’ mode — (this does a number of things, like disabling dropout, but from a high level it makes sure that our model is more consistent during inferencing). Once set, we move the model onto our device. Note that if we wanted to use the HuggingFace weights instead of our training weights, we would modify the third-to-last-line to read: model = GPT.from_pretrained(‘gpt2’)import tiktokenenc = tiktoken.get_encoding('gpt2')tokens = enc.encode("Hello, I'm a language model,")tokens = torch.tensor(tokens, dtype=torch.long)tokens = tokens.unsqueeze(0).repeat(num_return_sequences, 1)x = tokens.to(device)We now bring in tiktoken using the gpt2 encodings and have it tokenize our prompt. We take these tokens and put them into a tensor, which we then convert to batches in the below line. unsqueeze() will add a new first dimension of size 1 to the tensor, and repeat will repeat the entire tensor num_return_sequences times within the first dimension and once within the second dimension. What we’ve done here is formatted our data to fit the batched schema our model is expecting. Specifically we now match the (B, T) format: num_return_sequences x encoded length of prompt. Once we pass through the input tensor into the beginning of the model, our wte and wpe will create the C dimension.while x.size(1) < max_length: with torch.no_grad(): logits, _ = model(x) logits = logits[:, -1, :] probs = F.softmax(logits, dim=-1) topk_probs, topk_indices = torch.topk(probs, 50, dim=-1) ix = torch.multinomial(topk_probs, 1) xcol = torch.gather(topk_indices, -1, ix) x = torch.cat((x, xcol), dim=1)Now that they’re ready, we send them to the device and begin our sampling loop. The loop will be exclusively a forward pass, so we wrap it in the torch.no_grad to stop it from caching for any backward propagation. Our logits come out with shape (batch_size, seq_len, vocab_size) — (B,T,C) with C coming after a forward pass of the model.We only need the last item in the sequence to predict the next token, so we pull out [:, -1, :] We then take those logits and run it through a softmax to get the token probabilities. Taking the top 50, we then choose a random index of the top 50 and pick that one as our predicted token. We then get the information about that and add it to our tensor x. By concatenating xcol to x, we set ourselves up to go into the next token given what we just predicted. This is how we code up autoregression.for i in range(num_return_sequences): tokens = x[i, :max_length].tolist() decoded = enc.decode(tokens) print(f">> {decoded}")After the sampling loop is done, we can go through each of the selected tokens and decode them, showing the response to the user. We grab data from the i-th in our batch and decode it to get the next token.If you run the sampling loop on our initial model, you will notice that the output leaves a lot to be desired. This is because we haven’t trained any of the weights. The next few classes show how we can begin a naive training of the model.DataLoaderLiteAll training requires high quality data. For Karpathy’s videos, he likes to use public domain Shakespeare text (find it here).class DataLoaderLite: def __init__(self, B, T): self.B = B self.T = T with open('shakespeare.txt', "r") as f: text = f.read() enc = tiktoken.get_encoding('gpt2') tokens = enc.encode(text) self.tokens = torch.tensor(tokens) print(f"1 epoch = {len(self.tokens) // B * T} batches") self.current_position = 0We begin by simply opening the file and reading in the text. This data source is ASCII only, so we don’t need to worry about any unexpected binary characters. We use tiktoken to get the encodings for the body, and then convert these tokens into a tensor. We then create a variable called current_position, which will let us know where in the token tensor we are currently training from (naturally, this is initialized to the beginning). Note, this class is not inheriting from nn.Module, mainly because we have no need for the forward function here. Just as with the prompt part of the sampling loop, our DataLoaderLite class only needs to generate tensors of shape (B, T).class DataLoaderLite:# … def next_batch(self): B, T = self.B, self.T buf = self.tokens[self.current_position: self.current_position+(B*T + 1)] x = (buf[:-1]).view(B, T) y = (buf[1:]).view(B,T) self.current_position += B * T if self.current_position + (B*T+1) > len(self.tokens): self.current_position = 0 return x,yIn the above we define the function next_batch to help with training. To make programs run faster, we like to run the calculations in batches. We use the B and T fields to determine the batch size (B) and sequence length (T) we’ll be training on. Using these variables, we create a buffer that holds the tokens we are going to train with, setting the dimensions to be of rows B and columns T. Note that we read from current_position to current_position + (B*T + 1) , where the +1 is to make sure we have all of the ground truth values for our B*T batch.We then setup our model input (x) and our expected output (y) along the same lines. x is the entire buffer except for the last character, and y is the entire buffer except for the first. The basic idea is that given the first value in token buffer, we expect to get back the second token in the token buffer from our model.Finally, we update the current_position and return x and y.Weight InitializationAs we are dealing with probabilities, we’d like to pick initial values for our weights that are likely to require fewer epochs to get right. Our _init_weights function helps us do so, by initializing the weights with either zeroes or with a normal distribution.class GPT(nn.Module):# … def _init_weights(self, module): # layer norm is by default set to what we want, no need to adjust it if isinstance(module, nn.Linear): std = 0.02 if hasattr(module, "NANOGPT_SCALE_INIT"): std *= (2 * self.config.n_layer) ** -0.5 # 2 * for 2 additions (attention & mlp) torch.nn.init.normal_(module.weight, mean=0.0, std=std) # reasonable values are set based off a certain equation if module.bias is not None: torch.nn.init.zeros_(module.bias) elif isinstance(module, nn.Embedding): torch.nn.init.normal_(module.weight, mean=0.0, std=0.02 )If you remember from before, we’re passing in every field of the GPT class into _init_weights, so we’re processing nn.Modules. We are using the Xavier method to initialize our weights, which means we set the standard deviation of our sampling distribution equal to 1 / sqrt(hidden_layers) . You will notice that in the code, we are often using the hardcoded 0.02 as the standard deviation. While this might seem arbitrary, from the below table you can see that as the hidden dimensions GPT-2 uses are all roughly 0.02, this is a fine-approximation.https://medium.com/media/50d118c8522b2c32f224b3bad3a9e5df/hrefGoing through the code, we start off by checking which subtype of nn.Module the module we’re operating on is.If the module is Linear, then we will check if it is one of our projections from MLP or CasualSelfAttention classes (by checking if it has the NANO_GPT_INIT flag set). If it is, then our 0.02 approximation won’t work because the number of hidden layers in these modules is increasing (this is a function of our addition of the tensors in the Block class). Consequently, the GPT-2 paper uses a scaling function to account for this: 1/sqrt(2 * self.config.n_layer). The 2* is because our Block has 2 places where we are adding the tensors.If we have a bias in the Linear module, we will start by initializing these all to zero.If we have an Embedding Module (like the Token or Positional Encoding pieces), we will initialize this with the same normal distribution with standard deviation of 0.02.If you remember, we have another subtype of module that is in our model: nn.LayerNorm . This class already is initialized with a normal distribution and so we decide that this is good enough to not need any changes.Training LoopNow that we have the training fundamentals setup, let’s put together a quick training loop to train our model.device = "cpu"if torch.cuda.is_available(): device = "cuda"elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available(): device = "mps"print(f"device {device}")num_return_sequences = 5max_length = 30torch.manual_seed(1337)train_loader = DataLoaderLite(B=4, T=32)model = GPT(GPTConfig())model.to(device)You can see that we repeat our device calculations to get optimal performance. We then set our data loader to use batch sizes of 4 and sequence lengths of 32 (set arbitrarily, although powers of 2 are best for memory efficiency).optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)for i in range(50): x, y = train_loader.next_batch() x, y = x.to(device), y.to(device) optimizer.zero_grad() #have to start with a zero gradient logits, loss = model(x, y) loss.backward() #adds to the gradient (+=, which is why they must start as 0) optimizer.step() print(f"loss {loss.item()}, step {i}")Now we have the optimizer, which will help us train our model. The optimizer is a PyTorch class that takes in the parameters it should be training (in our case the ones given from the GPT class) and then the learning rate which is a hyperparameter during training determining how quickly we should be adjusting parameters — a higher learning rate means more drastic changes to the weights after each run. We chose our value based off of Karpathy’s recommendation.We then use 50 training steps to train the model. We start by getting the training batch and moving them onto our device. We set the optimizer’s gradients to zero (gradients in pytorch are sums, so if we don’t zero it out we will be carrying information over from the last batch). We calculate the logits and loss from our model, and then run backwards propagation to figure out what the new weight models should be. Finally, we run optimizer.step() to update all of our model parameters.Sanity CheckTo see how all of the above code runs, you can check out my Google Colab where I combine all of it and run it on the NVIDIA T4 GPU. Running our training loop, we see that the loss starts off at ~11. To sanity test this, we expect that at the beginning the odds of predicting the right token is (1/vocab_size). Taking this through a simplified loss function of -ln, we get ~10.88, which is just about where we begin!Image by AuthorClosingThanks for reading through to the end!I tried to include as much detail as I could in this blog post, but naturally there were somethings I had to leave out. If you enjoyed the blog post or see anything you think should be modified / expanded upon, please let me know!It’s an exciting time to be building![1] Karpathy, A., “Let’s reproduce GPT-2 (124M)” (2024), YouTube[2] Radford, A., et al., “Language Models are Unsupervised Multitask Learners” (2018), Papers With Code[3] Vaswani, A., et al., “Attention Is All You Need” (2017), arXivLine By Line, Let’s Reproduce GPT-2: Section 1 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.