Large Language Models (LLMs)
What is a Large Language Model?
An LLM is a type of AI model that excels at understanding and generating human language. They are trained on vast amounts of text data, allowing them to learn patterns, structure, and even nuance in language. These models typically consist of many millions of parameters.
Most LLMs nowadays are built on the Transformer architecture—a deep learning architecture based on the “Attention” algorithm, that has gained significant interest since the release of BERT from Google in 2018.
Transformer Types
The original Transformer architecture consists of an encoder on the left and a decoder on the right. There are three main types of transformers:
1. Encoders
An encoder-based Transformer takes text (or other data) as input and outputs a dense representation (or embedding) of that text.
- Example: BERT from Google
- Use Cases: Text classification, semantic search, Named Entity Recognition
- Typical Size: Millions of parameters
2. Decoders
A decoder-based Transformer focuses on generating new tokens to complete a sequence, one token at a time.
- Example: Llama from Meta
- Use Cases: Text generation, chatbots, code generation
- Typical Size: Billions (in the US sense, i.e., 10^9) of parameters
3. Seq2Seq (Encoder–Decoder)
A sequence-to-sequence Transformer combines an encoder and a decoder. The encoder first processes the input sequence into a context representation, then the decoder generates an output sequence.
- Example: T5, BART
- Use Cases: Translation, Summarization, Paraphrasing
- Typical Size: Millions of parameters
Popular LLM Models
Although Large Language Models come in various forms, LLMs are typically decoder-based models with billions of parameters. Here are some of the most well-known LLMs:
| Model | Provider |
|---|---|
| Deepseek-R1 | DeepSeek |
| GPT4 | OpenAI |
| Llama 3 | Meta (Facebook AI Research) |
| SmolLM2 | Hugging Face |
| Gemma | |
| Mistral | Mistral |
Understanding Tokens
The underlying principle of an LLM is simple yet highly effective: its objective is to predict the next token, given a sequence of previous tokens. A “token” is the unit of information an LLM works with. While you can think of a “token” as similar to a “word”, LLMs don’t use whole words for efficiency reasons.
For example:
- English has an estimated 600,000 words
- An LLM (like Llama 2) typically has around 32,000 tokens
- Tokens often work on sub-word units that can be combined (e.g., “interest” + “ing” = “interesting”)
Special Tokens
Each LLM has specific special tokens that are used to structure its generation. These tokens indicate things like the start or end of a sequence, message, or response. Here are some examples:
| Model | Provider | EOS Token | Functionality |
|---|---|---|---|
| GPT4 | OpenAI | <endoftext> | End of message text |
| Llama 3 | Meta (Facebook AI Research) | <eot_id> | End of sequence |
| Deepseek-R1 | DeepSeek | <end_of_sentence> | End of message text |
| SmolLM2 | Hugging Face | <im_end> | End of instruction or message |
| Gemma | <end_of_turn> | End of conversation turn | |
| Mistral | Mistral | <end_of_line> | End of line text |
| BERT | [SEP] | End of segment | |
| T5 | <eos> | End of sequence |
Next Token Prediction
LLMs are autoregressive, meaning that the output from one pass becomes the input for the next one. This process continues until the model predicts an EOS (End of Sequence) token. During each decoding loop:
- The input text is tokenized
- The model computes a representation of the sequence
- The model outputs scores ranking the likelihood of each token in its vocabulary
- A token is selected based on these scores using various decoding strategies
The simplest decoding strategy is to always select the token with the maximum score, though more sophisticated strategies exist for different use cases.
LLM Architectures in Detail
Large Language Models leverage various neural network architectures, each with unique strengths for different applications. Here’s a deeper look at the major architectures powering modern LLMs:
Recurrent Neural Networks (RNNs)
RNNs were among the first architectures successfully applied to sequential data like text:
- Mechanism: Uses a feedback loop to maintain information about previous inputs, processing data sequentially
- Variants: Simple RNN, LSTM (Long Short-Term Memory), GRU (Gated Recurrent Units)
- Strengths: Effective for sequential processing and pattern recognition in time-series data
- Limitations: Struggles with long-range dependencies due to vanishing gradient problems
Transformers: The Current Standard
The Transformer architecture revolutionized NLP by enabling parallel processing and effective modeling of long-range dependencies:
- Key Innovation: Self-attention mechanism that weighs the importance of different words in relation to each other
- Processing: Processes entire sequences simultaneously rather than sequentially
- Components: Multi-head attention, feed-forward networks, residual connections, and normalization layers
- Parallelization: Enables efficient training on massive datasets by processing tokens in parallel
Encoder-Only vs. Decoder-Only Architectures
LLMs commonly use variations of the transformer architecture:
| Architecture Type | Example Models | Primary Use Cases | Characteristics |
|---|---|---|---|
| Encoder-Only | BERT, RoBERTa | Text classification, sentiment analysis, NER | Bidirectional attention, understanding context |
| Decoder-Only | GPT, Llama, Mistral | Text generation, creative writing, code completion | Autoregressive generation, unidirectional attention |
| Encoder-Decoder | T5, BART | Translation, summarization, paraphrasing | Combines understanding and generation capabilities |
Generative Adversarial Networks (GANs)
While less common for pure language models, GANs have influenced certain aspects of generative AI:
- Architecture: Two neural networks (generator and discriminator) compete in a game-theoretic scenario
- Training Process: Generator creates outputs while discriminator evaluates authenticity
- Applications: Primarily used in image generation but has applications in text style transfer and data augmentation
Variational Autoencoders (VAEs)
VAEs provide another approach to generative modeling:
- Design: Encoder-decoder framework that compresses input data into a latent space
- Probabilistic Approach: Models data using probability distributions in latent space
- Applications: Text generation with controlled attributes, semantic manipulation of text
Diffusion Models
Emerging as powerful generative models primarily for images but with growing text applications:
- Process: Gradually introduces noise to data, then learns to reverse this process
- Training: Models learn to denoise or reconstruct distorted examples
- Text Applications: Being explored for controlled text generation and editing
Architecture Comparison
| Architecture | Processing Approach | Training Method | Strengths | Limitations |
|---|---|---|---|---|
| RNNs | Sequential | Backpropagation through time | Memory efficient | Limited context window |
| Transformers | Parallel | Self-attention | Captures long-range dependencies | Computationally intensive |
| GANs | Competitive | Adversarial | High-quality outputs | Training instability |
| VAEs | Probabilistic | Variational inference | Controls generation attributes | Less precise than transformers |
| Diffusion | Iterative denoising | Denoising score matching | High-quality generation | Computationally expensive |
LLM Training Approaches
Modern LLMs typically follow a multi-stage training process:
- Pretraining: Learning language patterns from massive text corpora
- Supervised Fine-tuning: Teaching models to follow instructions with labeled examples
- Reinforcement Learning from Human Feedback (RLHF): Refining model outputs based on human preferences
This progression has led to increasingly capable models that can understand context, generate coherent text, and follow nuanced instructions.