How do LLMs Work

Disclaimer

This is a page started by me (Pranay) to document my understanding of LLMs. All comments and suggestions are welcome. This document will remain in work-in-progress mode for a substantial amount of time, and will likely contain a lot of errors. You've been warned.

Prior Knowledge

We know that:

LLMs are trained to predict the next word
LLMs can produce erroneous results
LLMs are trained on the public data available on the internet. Hence they know a lot more potential pre-formed connections any human

What I Understand After the First Pass (31 August 2023)

Each word is converted into a floating point vector. Vectors allow a word to have hundreds of dimensions. And numerical vectors also have the additional advantage that you can perform mathematical operations on them.
The model transforms floating point vectors of similar words such that these vectors will lie closer to each other in the vector space i.e. they will have many of the hundreds of dimensions closer to each other.
Humans can't think of the words in so many dimensions, but computers can!
There are many complications - same words mean completely different things in different contexts (homonyms) or closely related things (polysems). As you would guess, vectors of polysems are closer to each other than vectors for homonyms.

Yella Okay! How Do LLMs Achieve All This Magic?

Abstraction Level 1: Layers of Transformers

WebVectors result for similarwords

Sidebar: To see the raw vectors associated with each word, I went to this model's website, and checked for the semantic associates of the word "India". The results based on two different training sets are in the image on the RHS.

Notes

I began by reading the article that Mihir Mahajan shared on Takshashila's Mattermost server. This was the first article on the subject that made me feel I could go further in understanding LLMs. Concise explainers such as these end up confusing rather than illuminating.