Large Language Models (LLMs)

What is a Large Language Model?

An LLM is a type of AI model that excels at understanding and generating human language. They are trained on vast amounts of text data, allowing them to learn patterns, structure, and even nuance in language. These models typically consist of many millions of parameters.

Most LLMs nowadays are built on the Transformer architecture—a deep learning architecture based on the “Attention” algorithm, that has gained significant interest since the release of BERT from Google in 2018.

Transformer Types

The original Transformer architecture consists of an encoder on the left and a decoder on the right. There are three main types of transformers:

1. Encoders

An encoder-based Transformer takes text (or other data) as input and outputs a dense representation (or embedding) of that text.

Example: BERT from Google
Use Cases: Text classification, semantic search, Named Entity Recognition
Typical Size: Millions of parameters

2. Decoders

A decoder-based Transformer focuses on generating new tokens to complete a sequence, one token at a time.

Example: Llama from Meta
Use Cases: Text generation, chatbots, code generation
Typical Size: Billions (in the US sense, i.e., 10^9) of parameters

3. Seq2Seq (Encoder–Decoder)

A sequence-to-sequence Transformer combines an encoder and a decoder. The encoder first processes the input sequence into a context representation, then the decoder generates an output sequence.

Example: T5, BART
Use Cases: Translation, Summarization, Paraphrasing
Typical Size: Millions of parameters

Popular LLM Models

Although Large Language Models come in various forms, LLMs are typically decoder-based models with billions of parameters. Here are some of the most well-known LLMs:

Model	Provider
Deepseek-R1	DeepSeek
GPT4	OpenAI
Llama 3	Meta (Facebook AI Research)
SmolLM2	Hugging Face
Gemma	Google
Mistral	Mistral

Understanding Tokens

The underlying principle of an LLM is simple yet highly effective: its objective is to predict the next token, given a sequence of previous tokens. A “token” is the unit of information an LLM works with. While you can think of a “token” as similar to a “word”, LLMs don’t use whole words for efficiency reasons.

For example:

English has an estimated 600,000 words
An LLM (like Llama 2) typically has around 32,000 tokens
Tokens often work on sub-word units that can be combined (e.g., “interest” + “ing” = “interesting”)

Special Tokens

Each LLM has specific special tokens that are used to structure its generation. These tokens indicate things like the start or end of a sequence, message, or response. Here are some examples:

Model	Provider	EOS Token	Functionality
GPT4	OpenAI	`<endoftext>`	End of message text
Llama 3	Meta (Facebook AI Research)	`<eot_id>`	End of sequence
Deepseek-R1	DeepSeek	`<end_of_sentence>`	End of message text
SmolLM2	Hugging Face	`<im_end>`	End of instruction or message
Gemma	Google	`<end_of_turn>`	End of conversation turn
Mistral	Mistral	`<end_of_line>`	End of line text
BERT	Google	`[SEP]`	End of segment
T5	Google	`<eos>`	End of sequence

Next Token Prediction

LLMs are autoregressive, meaning that the output from one pass becomes the input for the next one. This process continues until the model predicts an EOS (End of Sequence) token. During each decoding loop:

The input text is tokenized
The model computes a representation of the sequence
The model outputs scores ranking the likelihood of each token in its vocabulary
A token is selected based on these scores using various decoding strategies

The simplest decoding strategy is to always select the token with the maximum score, though more sophisticated strategies exist for different use cases.

LLM Architectures in Detail

Large Language Models leverage various neural network architectures, each with unique strengths for different applications. Here’s a deeper look at the major architectures powering modern LLMs:

Recurrent Neural Networks (RNNs)

RNNs were among the first architectures successfully applied to sequential data like text:

Mechanism: Uses a feedback loop to maintain information about previous inputs, processing data sequentially
Variants: Simple RNN, LSTM (Long Short-Term Memory), GRU (Gated Recurrent Units)
Strengths: Effective for sequential processing and pattern recognition in time-series data
Limitations: Struggles with long-range dependencies due to vanishing gradient problems

Transformers: The Current Standard

The Transformer architecture revolutionized NLP by enabling parallel processing and effective modeling of long-range dependencies:

Key Innovation: Self-attention mechanism that weighs the importance of different words in relation to each other
Processing: Processes entire sequences simultaneously rather than sequentially
Components: Multi-head attention, feed-forward networks, residual connections, and normalization layers
Parallelization: Enables efficient training on massive datasets by processing tokens in parallel

Encoder-Only vs. Decoder-Only Architectures

LLMs commonly use variations of the transformer architecture:

Architecture Type	Example Models	Primary Use Cases	Characteristics
Encoder-Only	BERT, RoBERTa	Text classification, sentiment analysis, NER	Bidirectional attention, understanding context
Decoder-Only	GPT, Llama, Mistral	Text generation, creative writing, code completion	Autoregressive generation, unidirectional attention
Encoder-Decoder	T5, BART	Translation, summarization, paraphrasing	Combines understanding and generation capabilities

Generative Adversarial Networks (GANs)

While less common for pure language models, GANs have influenced certain aspects of generative AI:

Architecture: Two neural networks (generator and discriminator) compete in a game-theoretic scenario
Training Process: Generator creates outputs while discriminator evaluates authenticity
Applications: Primarily used in image generation but has applications in text style transfer and data augmentation

Variational Autoencoders (VAEs)

VAEs provide another approach to generative modeling:

Design: Encoder-decoder framework that compresses input data into a latent space
Probabilistic Approach: Models data using probability distributions in latent space
Applications: Text generation with controlled attributes, semantic manipulation of text

Diffusion Models

Emerging as powerful generative models primarily for images but with growing text applications:

Process: Gradually introduces noise to data, then learns to reverse this process
Training: Models learn to denoise or reconstruct distorted examples
Text Applications: Being explored for controlled text generation and editing

Architecture Comparison

Architecture	Processing Approach	Training Method	Strengths	Limitations
RNNs	Sequential	Backpropagation through time	Memory efficient	Limited context window
Transformers	Parallel	Self-attention	Captures long-range dependencies	Computationally intensive
GANs	Competitive	Adversarial	High-quality outputs	Training instability
VAEs	Probabilistic	Variational inference	Controls generation attributes	Less precise than transformers
Diffusion	Iterative denoising	Denoising score matching	High-quality generation	Computationally expensive

LLM Training Approaches

Modern LLMs typically follow a multi-stage training process:

Pretraining: Learning language patterns from massive text corpora
Supervised Fine-tuning: Teaching models to follow instructions with labeled examples
Reinforcement Learning from Human Feedback (RLHF): Refining model outputs based on human preferences

This progression has led to increasingly capable models that can understand context, generate coherent text, and follow nuanced instructions.