Large Language Models (LLMs)

— What is a large language model?
- — AI models that deal with language information trained huge billions of parameters.
- — Transformers with selt attention is the most successful technique
— How they work?
- — Basically matrix multiplication
- — Turn tokens (split words) into vectors, that has meaning in high dimensional space
  - — Word embedding
- — Process those vectors through different neural network layers
- — Logits to output text
  - — Have to output one token at a time
- — Pretraining and fine-tuning
  - — Tranined to predict one word given a list of word
  - — Then some of the layers can be changed and trained with smaller data for different use cases
- — Can be part of a multimodal neural network
— Important concepts
- — Tokenization
  - — Token to ID (number) mapping
  - — The map kept in memory as the same one is used for input and outputs
- — Embedding
  - — Token ID to vector where the vectors have meaning
    - — token("queen") - token("female") + token("male") = token("king")
    - — token("Paris") - token("France") + token("Italy") = token("Rome")
- — Positional encoding
  - — Add to the embedding vector numbers from which transformer knows the Positional
    - — Addition is minimal so that it doesn't affect the meaning of the embedding
  - — Sine and cosine functions with increasing frequency for each dimension of the embedding
    - — The waves rotate at different speeds like how digits in number move
      - — Tens move faster than ones, thousands move faster than hundreds
- — Attention mechanism
  - — Given a list of tokens, generating a table (like a heatamp) of which
- — Transformers
  - — Use words in the context, to enrich each token with more meaning
    - — Done in many iterations
  - — KQV
    - — For each token you get key, query and value transformation
    - — To find the attention score for a token, you get the key for that word and query for the other word and get the product of the vectors
    - — Attention scores can be made to weights by normalization
    - — To transform the word, multiple all the token embeddings in the context with the attention score for that and add them up
  - — Multi-head attention
    - — Multiple attention head. Another dimension is added.
— Types of LLMs
- — Translators
  - — String of tokens as input and string of tokens as output
  - — Encode with in one step with transformers, and decode until "endotext" token
- — Generators
  - — Given a starting piece, predict the next token.
    - — Done recursively to generate long text
- — Classifiers
  - — Like normal feed forward networks on top of attention head
- — Chatbots
  - — Make a next word predictor
  - — Then fine tune with question and answer as one single text
  - — The whole conversation is one file that is recursively put to generate one token at a time
— Pretraining
- — Take a long corpus of text
  - — [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q]
- — Make list of input and output for the context length
  - — [[a, b, c, d], [b, c, d, e]], [[e, f, g, h], [f, g, h, i]]
- — Get [X, y] pairs for neural network
  - — [[a, *, *, *], b], [[a, b, *, *], c], ....
    - —
      - — is a padding to have causal attention
- — Calculate loss with something like categorical crossentropy
- — Backpropogate loss and adjust weights
- — The model then can predict the next word give a list of word
— Finetuning for
— Famous models
- — Word2Vec
- — BART
- — GPT
— Older ideas
- — RNN
- — LLM