Skip to content

Large Language Models (LLMs)

What is a Large Language Model?

An LLM is a type of AI model that excels at understanding and generating human language. They are trained on vast amounts of text data, allowing them to learn patterns, structure, and even nuance in language. These models typically consist of many millions of parameters.

Most LLMs nowadays are built on the Transformer architecture—a deep learning architecture based on the “Attention” algorithm, that has gained significant interest since the release of BERT from Google in 2018.

Transformer Types

The original Transformer architecture consists of an encoder on the left and a decoder on the right. There are three main types of transformers:

1. Encoders

An encoder-based Transformer takes text (or other data) as input and outputs a dense representation (or embedding) of that text.

  • Example: BERT from Google
  • Use Cases: Text classification, semantic search, Named Entity Recognition
  • Typical Size: Millions of parameters

2. Decoders

A decoder-based Transformer focuses on generating new tokens to complete a sequence, one token at a time.

  • Example: Llama from Meta
  • Use Cases: Text generation, chatbots, code generation
  • Typical Size: Billions (in the US sense, i.e., 10^9) of parameters

3. Seq2Seq (Encoder–Decoder)

A sequence-to-sequence Transformer combines an encoder and a decoder. The encoder first processes the input sequence into a context representation, then the decoder generates an output sequence.

  • Example: T5, BART
  • Use Cases: Translation, Summarization, Paraphrasing
  • Typical Size: Millions of parameters

Although Large Language Models come in various forms, LLMs are typically decoder-based models with billions of parameters. Here are some of the most well-known LLMs:

ModelProvider
Deepseek-R1DeepSeek
GPT4OpenAI
Llama 3Meta (Facebook AI Research)
SmolLM2Hugging Face
GemmaGoogle
MistralMistral

Understanding Tokens

The underlying principle of an LLM is simple yet highly effective: its objective is to predict the next token, given a sequence of previous tokens. A “token” is the unit of information an LLM works with. While you can think of a “token” as similar to a “word”, LLMs don’t use whole words for efficiency reasons.

For example:

  • English has an estimated 600,000 words
  • An LLM (like Llama 2) typically has around 32,000 tokens
  • Tokens often work on sub-word units that can be combined (e.g., “interest” + “ing” = “interesting”)

Special Tokens

Each LLM has specific special tokens that are used to structure its generation. These tokens indicate things like the start or end of a sequence, message, or response. Here are some examples:

ModelProviderEOS TokenFunctionality
GPT4OpenAI<endoftext>End of message text
Llama 3Meta (Facebook AI Research)<eot_id>End of sequence
Deepseek-R1DeepSeek<end_of_sentence>End of message text
SmolLM2Hugging Face<im_end>End of instruction or message
GemmaGoogle<end_of_turn>End of conversation turn
MistralMistral<end_of_line>End of line text
BERTGoogle[SEP]End of segment
T5Google<eos>End of sequence

Next Token Prediction

LLMs are autoregressive, meaning that the output from one pass becomes the input for the next one. This process continues until the model predicts an EOS (End of Sequence) token. During each decoding loop:

  1. The input text is tokenized
  2. The model computes a representation of the sequence
  3. The model outputs scores ranking the likelihood of each token in its vocabulary
  4. A token is selected based on these scores using various decoding strategies

The simplest decoding strategy is to always select the token with the maximum score, though more sophisticated strategies exist for different use cases.

LLM Architectures in Detail

Large Language Models leverage various neural network architectures, each with unique strengths for different applications. Here’s a deeper look at the major architectures powering modern LLMs:

Recurrent Neural Networks (RNNs)

RNNs were among the first architectures successfully applied to sequential data like text:

  • Mechanism: Uses a feedback loop to maintain information about previous inputs, processing data sequentially
  • Variants: Simple RNN, LSTM (Long Short-Term Memory), GRU (Gated Recurrent Units)
  • Strengths: Effective for sequential processing and pattern recognition in time-series data
  • Limitations: Struggles with long-range dependencies due to vanishing gradient problems

Transformers: The Current Standard

The Transformer architecture revolutionized NLP by enabling parallel processing and effective modeling of long-range dependencies:

  • Key Innovation: Self-attention mechanism that weighs the importance of different words in relation to each other
  • Processing: Processes entire sequences simultaneously rather than sequentially
  • Components: Multi-head attention, feed-forward networks, residual connections, and normalization layers
  • Parallelization: Enables efficient training on massive datasets by processing tokens in parallel

Encoder-Only vs. Decoder-Only Architectures

LLMs commonly use variations of the transformer architecture:

Architecture TypeExample ModelsPrimary Use CasesCharacteristics
Encoder-OnlyBERT, RoBERTaText classification, sentiment analysis, NERBidirectional attention, understanding context
Decoder-OnlyGPT, Llama, MistralText generation, creative writing, code completionAutoregressive generation, unidirectional attention
Encoder-DecoderT5, BARTTranslation, summarization, paraphrasingCombines understanding and generation capabilities

Generative Adversarial Networks (GANs)

While less common for pure language models, GANs have influenced certain aspects of generative AI:

  • Architecture: Two neural networks (generator and discriminator) compete in a game-theoretic scenario
  • Training Process: Generator creates outputs while discriminator evaluates authenticity
  • Applications: Primarily used in image generation but has applications in text style transfer and data augmentation

Variational Autoencoders (VAEs)

VAEs provide another approach to generative modeling:

  • Design: Encoder-decoder framework that compresses input data into a latent space
  • Probabilistic Approach: Models data using probability distributions in latent space
  • Applications: Text generation with controlled attributes, semantic manipulation of text

Diffusion Models

Emerging as powerful generative models primarily for images but with growing text applications:

  • Process: Gradually introduces noise to data, then learns to reverse this process
  • Training: Models learn to denoise or reconstruct distorted examples
  • Text Applications: Being explored for controlled text generation and editing

Architecture Comparison

ArchitectureProcessing ApproachTraining MethodStrengthsLimitations
RNNsSequentialBackpropagation through timeMemory efficientLimited context window
TransformersParallelSelf-attentionCaptures long-range dependenciesComputationally intensive
GANsCompetitiveAdversarialHigh-quality outputsTraining instability
VAEsProbabilisticVariational inferenceControls generation attributesLess precise than transformers
DiffusionIterative denoisingDenoising score matchingHigh-quality generationComputationally expensive

LLM Training Approaches

Modern LLMs typically follow a multi-stage training process:

  1. Pretraining: Learning language patterns from massive text corpora
  2. Supervised Fine-tuning: Teaching models to follow instructions with labeled examples
  3. Reinforcement Learning from Human Feedback (RLHF): Refining model outputs based on human preferences

This progression has led to increasingly capable models that can understand context, generate coherent text, and follow nuanced instructions.