Coding LLMs from scratch!

Today I’m sharing some results of my experiments with coding LLMs from scratch in python/ PyTorch without any extra libraries.

The purpose of this is obviously educational. For me the true value of this is in struggling, running into issues/ bugs and correcting them by myself; however I’m sharing the code because sometimes it can be helpful to solve some frustrating issue by looking at other implementations.

Probably, like many other I got inspired by fantastic tutorial by Andrej Karpathy on GPT-2 from scratch. The material he prepared is a true treasure trove but it’s biggest value for me was the idea/realisation that I can code and train the modern LLM architectures (albeit smaller size) without any major problems or costs fully on my own!

Working as data scientist for many past years, I used multiple Transformer based models, including most of the LLMs out there; for some I modified or extended the code, then also I did the basic transformer implementation multiple times on my own to learn how it works but I never really did the entire language model from start to end with all the training on my own … until now.

I first started following Karapathy tutorial but then inspired by it, I just moved on to all famous LLMs shared by other major companies. As up to the date of publication of this post I implemented:

  1. GPT-2 – as an example of decoder based LLM
  2. BERT – opposed to GPT-2, example of encoder based Language Model
  3. Llama2 – similar to GPT-2 but having some interesting modifications (like rotary positional embeddings which are used by pretty much everybody nowadays).
  4. T5 – example of encoder-decoder based model like the one from original Attention Is All You Need paper

I think the most I got out of my BERT implementation – that is because there aren’t many open / tutorial like implementations out there. I found few but usually they don’t do the full pre-training so it’s hard to compare my own results. Therefore, many problems and issues I had to solve myself, basically following similar methodology as Karapathy for his GPT-2 tutorial – reading the BERT and subsequent RoBERTA papers to figure out all the tiny implementation details. I also peaked into HuggingFace implementation but as noted also in GPT-2 tutorial, its super convoluted. I when ran into multiple frustrating issues, I basically ripped the entire code apart line-by-line. Time consuming but ended being quite satisfactory (after all started finally working ;D).

Example results of pre-training accuracy from my BERT implementation.

Overall, I can recommend the entire exercise even for those who have nothing to do with pre/post training those huge LLMs. I don’t expect myself to do that either during my every day regular work any soon, however I fine-tune on smaller datasets all those major models with billions of parameters pretty much on a daily basis. Understanding with great detail how they are implemented and what are the differences between them makes me feel a lot more comfortable working with all this. Also, doing it all from scratch in PyTorch is quite a good exercise.

For anybody interested, I share all my code and all pre-training results/accuracies/settings in a new repository on my Github account.

IMDB text classification: keras vs tensorflow 1.x vs tensorflow 2.0 vs pytorch

When trying to learn coding some deep learning problems, you can bump across quite a few (python) frameworks. Personally, I started with Keras, being the easiest, but then I was curious how others work.

So, as an exercise, I implemented the same problem in several frameworks trying to make it work exactly the same (as in take same input, give same output and implement the same kind of neural network inside).

The starting point for me with Deep Learning in general was quite a good introductory book “Deep Learning for Python” by Keras author Francois Chollet. Therefore, my Keras code follows instructions from the book, while the subsequent implementations in other frameworks I did myself trying to mimic what’s done already in Keras – small exercise in getting to know different APIs 🙂

I used the IMDB review classification problem (trying to figure out positive/negative reviews given their text). Implementing a simple MLP network with 2 hidden layers.

I tested the following, implementing the same IMDB problem in all:

All source code is available in Jupyter notebooks on GitHub, I ran it using python 3.7.

My general personal thoughts are that Keras is by far best for quick prototyping and doing many many experiments after which you might need to revisit some again (ie. code is simple and readable). At work, I often need to test many cases/architectures/configurations for a client, jump to another project and maybe few weeks later come back to my old experiments to continue or pick something for final deliverable.. so readability of code and being able to quickly recall what I actually did is nice.

Of course, I looked at other frameworks for a reason. Keras is high-level, so when I need more flexibility and control to implement something custom then might need to go elsewhere. The almost legacy now TensorFlow 1.x is truly a pain to debug, very unintuitive in comparison to any other python framework/library. PyTorch has a more interesting approach but I found it not as good documented and after learning both Keras and TensorFlow 1.x quite annoying actually :p Finally, TensorFlow 2.x is basically trying to steal the best tricks from everybody, fully integrated Keras and than for more control it adopted PyTorch-like approach with defining the model through subclassing etc. At the moment of writing this TF2 is still beta / release candidate, so I would say needs time to mature (e.g. I’m getting lots of warnings when running under Windows even the basic TF2 subclassing example, the documentation even tho improved I still find lacking etc.). Overall tho, I think its a good direction.

ps. my opinion could be a bit biased towards Keras because by the time I started with TensorFlow/PyTorch I already went through the entire Chollet book implementing everything and I used Keras for some problems at work with clients too.