How do LLMs Work

From OpenTakshashila
Revision as of 16:39, 31 August 2023 by Tshila.admin (talk | contribs) (First version)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Disclaimer

This is a page started by me (Pranay) to document my understanding of LLMs. All comments and suggestions are welcome. This document will remain in work-in-progress mode for a substantial amount of time, and will likely contain a lot of errors. You've been warned.

Prior Knowledge

We know that:

  • LLMs are trained to predict the next word
  • LLMs can produce erroneous results
  • LLMs are trained on the public data available on the internet. Hence they know a lot more potential pre-formed connections any human

What I Understand After the First Pass (31 August 2023)

  1. Each word is converted into a floating point vector. Vectors allow a word to have hundreds of dimensions (12000+ in GPT3). And numerical vectors also have the additional advantage that you can perform mathematical operations on them.
  2. The model transforms floating point vectors of similar words such that these vectors will lie closer to each other in the vector space i.e. they will have many of the hundreds of dimensions closer to each other.
  3. Humans can't think of the words in so many dimensions, but computers can!
  4. There are many complications - same words mean completely different things in different contexts (homonyms) or closely related things (polysems). As you would guess, vectors of polysems are closer to each other than vectors for homonyms.

Yella Okay! How Do LLMs Achieve All This Magic?

Abstraction Level 1: Layers of Transformers

"GPT-3 is organized into multiple layers. Each layer takes a sequence of vectors as inputs—one vector for each word in the input text—and adds information to help clarify the meaning of that word and better predict which word might come next." It's like we are triangulating at each stage and adding more refinements about each word vector at each layer.

Research suggests that the first few layers focus on understanding the syntax of the sentence and resolving ambiguities like we’ve shown above. Later layers (which we’re not showing to keep the diagram a manageable size) work to develop a high-level understanding of the passage as a whole.[1]

Abstraction Level 2: What's Inside Each Transformer?

Transformer is a neural network architecture.

The transformer has a two-step process for updating the hidden state for each word of the input passage:

  1. In the attention step, words “look around” for other words that have relevant context and share information with one another.
  2. In the feed-forward step, each word “thinks about” information gathered in previous attention steps and tries to predict the next word.[1]
Attention Step

Unit of analysis is a "word". That's where the parallelisation of GPU helps. Each of the steps happen at the level of words, and can happen in parallel. This step is figuring out word contexts, like which are the homonyms, are there noun pronouns, is "Sachin Tendulkar" one person etc.

Simplified example: If I say: Pranay is a waste fellow but a decent waste fellow, the attention heads might work as follows. It might add information that the "but" is still talking about Pranay. That "decent" and "waste fellow" both refer to the same noun, Pranay. This is actually done through dot products of key vector and query vector. Query vector for "but" might say "I’m seeking: a noun that is relevant to this preposition." Key vector for Pranay might be "I am: a noun describing a male person.” The network would detect this word as a match.

Bindmoggling: The largest version of GPT-3 has 96 layers with 96 attention heads each, so GPT-3 performs 9,216 attention operations each time it predicts a new word.
Feedforward Step is Going to Predict the Next Word

How exactly this is done I'm not yet sure. Will have to study it.

How Does Training Happen?

There's some forward-pass and backward-pass. Yet to fully comprehend.


WebVectors result for similarwords
Sidebar: To see the raw vectors associated with each word, I went to this model's website, and checked for the semantic associates of the word "India". The results based on two different training sets are in the image on the RHS. 

Notes

I began by reading the article that Mihir Mahajan shared on Takshashila's Mattermost server. This was the first article on the subject that made me feel I could go further in understanding LLMs. Concise explainers such as these end up confusing rather than illuminating.

References