Building your own GPT code assistant

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article, we have explored how one can build their own GPT code assistant that is Code generation using GPT model architecture.

Table of contents:

I. Introduction to GPT
II. Transformers and NLP
III. Build your own Python code generator
IV. References

Introduction to GPT

"GPT" has quickly achieved the buzzword status in the AI field, but what is GPT? Why is the developer community going crazy over it? How has it changed machine learning in recent years? Can you train GPT on your own data? These are a few questions that require our attention before moving on to build a very basic version of your own GPT code assistant.

What is GPT? and why is there so much hype surrounding it?
GPT stands for "Generative Pre-trained Transformer." It's a type of artificial intelligence model that's designed to understand and generate text similar to human minds.
Simply put:
GPT is a "Generative" machine learning model that has been "Pre-trained" on a massive volume of data using a specific type of neural network architecture known as the "Transformer".
What do GPT's bring to the table of AI?
Generative Pre-Trained Transformers have become the leading architecture for addressing sequence-to-sequence problems, with widely adopted state-of-the-art models like BERT and GPT-3 utilizing transformers internally. The origins of transformers can be traced back to Google's influential paper titled Attention Is All You Need, which introduced the transformer architecture.
Now, we need to understand that this is just a new type of machine learning model and does not possess actual understanding or consciousness. So, although disappointing, there won't be a Skynet rising anytime soon. GPT simply uses statistical patterns from the training data it has been given to generate responses, so it might not always provide accurate and reliable information.

Transformers and Natural Language Processing

Now let's see what makes Transformers so efficient for Natural language processing. We already had pretty good Generative Adversarial Networks, right? right...? Not really.

Generative Adversarial Networks use a generator-discriminator pair where the generator aims to generate realistic data to fool the discriminator, and the discriminator strives to correctly identify real and generated instances. Now this works pretty good for images but caused instability and incoherence when processing contextual data. Transformers on the other hand introduced attention based mechanism to capture long-range dependencies to model sequential data. This mechanism became a huge leap for text based machine learning models as Transformers produce more coherent data compared to GAN's by focusing the attention on contextual information in text.
Finally, the most important question of it all. Can we build our own Transformer and train it on our own data? Ofcourse we can! To explore that, we need to get some basic understanding of the internal workings of a Transformer network.

Python code generator using GPT architecture

Transformers can be conceptualized in terms of three main components: an Encoder, an Attention mechanism, and a Decoder.

Encoder: The Encoder component of a Transformer takes an input sequence and encodes it into a set of state representation vectors. These vectors capture the contextual information of the input sequence, understanding the relationships between different elements of the sequence. The Encoder tries understand the input and create a meaningful representation that can be further processed.
Attention: The Attention mechanism is a fundamental part of Transformers and plays a crucial role in capturing dependencies within the input sequence. It allows the Transformer model to selectively focus on relevant parts of the input sequence during encoding and decoding. By assigning different weights to different positions in the sequence, the Attention mechanism helps the model understand the context between the elements, making the model more effective in processing sequential data.
Decoder: The Decoder component of a Transformer takes the encoded state representation vectors and decodes them to generate the desired output sequence. It uses the attention mechanism to selectively attend to relevant parts of the encoded input while generating the output. The Decoder incorporates the information from the Encoder and the attention mechanism to produce a coherent and contextually appropriate output sequence.

The dataset:

The dataset we will be using for this mini project is this open source dataset built by "The School of AI". This dataset contains around 5000 datapoints. Each datapoint is a question-solution pair of Problem statements and the corresponding Python code. We will train our model on this dataset to understand text queries and generate Python code in response.

The code assistant:

Before we begin, the full project is available here in my GIT repo with the complete code and training instructions.
I will be walking you through the steps here and you can use my project repo mentioned above as reference.

Step 1: Converting data into tokens.

A machine learning model can only understand data in the form of machine tokens. Hence, we need to convert the Input(SRC) and the Target(TRG) into tokens that can be utilised by the Transformer. To do this, we can build our own tokenizer or use a built-in tokenizer from PyTorch known as Spacy.

Input = data.Field(tokenize = 'spacy',
            init_token='<sos>', 
            eos_token='<eos>', 
            lower=True)

Step 2: Augmenting the data.

Creating different versions of existing data can help us scale the dataset to a larger volume and will eliminate the chances of overfitting. We will be focusing on augmenting the variable names in the code which will enable the Transformer to understand the inherent logic and ignore the variables as hardcoded values.
Go through this code to further understand augmentation and tokenization:

#Function to augment (randomly mask variables) and tokenize python code
def augment_tokenize_py_code(py_code_string, mask_factor = 0.3):

    var_dict = {} #Dict to store masked vars

    # Creating a list for keywords that are not variables and need to be skipped from variable masking
    skip_list = ['range', 'extend', 'enumerate', 'print', 'input', 'ord', 'int', 'float', 'type', 'zip', 'char', 'list', 'dict', 'tuple', 'set', 'len', 'sum', 'and', 'or', 'min', 'max']
    skip_list.extend(keyword.kwlist)

    var_counter = 1
    py_tokens = py_tokens = list(tokenize(io.BytesIO(py_code_string.encode('utf-8')).readline))
    tokenized_output = []

    for i in range(0, len(py_tokens)):
        if py_tokens[i].type == 1 and py_tokens[i].string not in skip_list:

            if i>0 and py_tokens[i-1].string in ['def', '.', 'import', 'raise', 'except', 'class']: #avoiding masking modules
                skip_list.append(py_tokens[i].string)
                tokenized_output.append((py_tokens[i].type, py_tokens[i].string))
            elif py_tokens[i].string in var_dict:                                                   #if variable is already masked
                tokenized_output.append((py_tokens[i].type, var_dict[py_tokens[i].string]))
            elif random.uniform(0,1) > 1-mask_factor:                                               #randomly mask variables that are not masked
                var_dict[py_tokens[i].string] = 'var_' + str(var_counter)
                var_counter += 1
                tokenized_output.append((py_tokens[i].type, var_dict[py_tokens[i].string]))
            else:
                skip_list.append(py_tokens[i].string)
                tokenized_output.append((py_tokens[i].type, py_tokens[i].string))
        
        else:
            tokenized_output.append((py_tokens[i].type, py_tokens[i].string))
    
    return tokenized_output

Step 3: Building the Transformer.

In order to prepare the data for our model, we utilize PyTorch's torchtext.data.BucketIterator to create batches. This ensures that inputs with similar lengths are grouped together within a single batch, which facilitates the training process. The tokenized English inputs (SRC) are then passed into the encoder, while the tokenized Python target outputs (TRG) are used in the decoder.
The goal is to leverage the encoder's understanding of the English inputs to generate predictions for the tokenized Python outputs using the decoder. Finally, the tokenized predictions are converted back to their original form using Python's source code tokenizer's untokenize function.

Step 4: Regularisation

To enhance the robustness of our dataset, we have applied augmentations that involve masking variable literals. This enables our model to predict various valid values for a specific variable, as long as the predictions remain consistent within the code. Consequently, our training labels become less certain, leading us to employ label smoothing.
Label smoothing treats the training labels as correct with a probability of 1 minus the smooth_eps value, and as incorrect otherwise. By incorporating label smoothing into the Cross-Entropy loss function, we ensure that the model does not overly rely on predicting specific variables that can be substituted through augmentations. This approach promotes a more balanced and cautious prediction behavior within the model.

Now that all the parts of the model have been explained, we can train the model using backpropogation by splitting the dataset into training and testing data. The model needs to be trained until the testing loss (Validation loss) does not improve any further as described by the following piece of code:

if valid_loss < best_valid_loss:
       best_valid_loss = valid_loss
       torch.save(model.state_dict(), 'path to save the model')

Sample output:

Sample 1:

Input: “sort a list of dictionaries by key”

Output:

var_1 ={'Name1':{'roll':25 ,'marks':50 },
'Name2':{'roll':26 ,'marks':67 },
'Name3':{'roll':30 },'marks':48 }}
var_key = 'marks'
res = 'marks'
res = var_2 (test_dict .items (),key =lambda x :x [1 ][var_key ])
print ("The sorted dictionary by marks is : " + str(res))

Sample 2:

Input: “reverse given string”