Large Language Models (LLMs)
- — What is a large language model?
- — AI models that deal with language information trained huge billions of parameters.
- — Transformers with selt attention is the most successful technique
- — How they work?
- — Basically matrix multiplication
- — Turn tokens (split words) into vectors, that has meaning in high dimensional space
- — Process those vectors through different neural network layers
- — Logits to output text
- — Have to output one token at a time
- — Pretraining and fine-tuning
- — Tranined to predict one word given a list of word
- — Then some of the layers can be changed and trained with smaller data for different use cases
- — Can be part of a multimodal neural network
- — Important concepts
- — Tokenization
- — Token to ID (number) mapping
- — The map kept in memory as the same one is used for input and outputs
- — Embedding
- — Token ID to vector where the vectors have meaning
- — token("queen") - token("female") + token("male") = token("king")
- — token("Paris") - token("France") + token("Italy") = token("Rome")
- — Attention mechanism
- — Given a list of tokens, generating a table (like a heatamp) of which
- — Transformers
- — Use words in the context, to enrich each token with more meaning
- — Done in many iterations
- — KQV
- — For each token you get key, query and value transformation
- — To find the attention score for a token, you get the key for that word and query for the other word and get the product of the vectors
- — Attention scores can be made to weights by normalization
- — To transform the word, multiple all the token embeddings in the context with the attention score for that and add them up
- — Multi-head attention
- — Multiple attention head. Another dimension is added.
- — Types of LLMs
- — Translators
- — String of tokens as input and string of tokens as output
- — Encode with in one step with transformers, and decode until "endotext" token
- — Generators
- — Given a starting piece, predict the next token.
- — Done recursively to generate long text
- — Classifiers
- — Like normal feed forward networks on top of attention head
- — Chatbots
- — Make a next word predictor
- — Then fine tune with question and answer as one single text
- — The whole conversation is one file that is recursively put to generate one token at a time
- — Pretraining
- — Take a long corpus of text
- — [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q]
- — Make list of input and output for the context length
- — [[a, b, c, d], [b, c, d, e]], [[e, f, g, h], [f, g, h, i]]
- — Get [X, y] pairs for neural network
- — [[a, *, *, *], b], [[a, b, *, *], c], ....
- —
- — is a padding to have causal attention
- — Calculate loss with something like categorical crossentropy
- — Backpropogate loss and adjust weights
- — The model then can predict the next word give a list of word
- — Finetuning for
- — Famous models
- — Older ideas