top of page

34 LLM Training and Fine-tuning Topics

  • Jul 23, 2024
  • 10 min read

01. Principles of common fine-tuning methods for large models: LORA and P-tuning

LORA (Low-Rank Adaptation) 

The core of the LORA method is to add additional low-rank matrices to specified parameters on top of large language models. This essentially adds a bypass next to the original pre-trained language model (PLM), performing a dimensionality reduction and expansion operation. During model training, the parameters of the PLM are fixed, and only the dimensionality reduction matrix A and expansion matrix B are trained.

P-tuning The core of the P-tuning method is to use differentiable virtual tokens to replace the original discrete tokens. These are added only to the input layer, and a prompt encoder (BiLSTM+MLP) is used to encode and learn the virtual tokens.

02. Introduction to the principles of stable diffusion

Stable Diffusion consists of three main components, each with its own independent neural network:

Architect of Stable Diffusion
  1. Clip Text for text encoding

    1. Input: Text

    2. Output: 77 token embedding vectors, each containing 768 dimensions

  2. UNet + Scheduler processes/diffuses information in the latent space step by step

    1. Input: Text embeddings and an initial multidimensional array of noise (structured list of numbers, also called a tensor)

    2. Output: A processed information array

  3. Autoencoder Decoder uses the processed information matrix to draw the final image

    1. Input: Processed information matrix, dimensions (4, 64, 64)

    2. Output: Resulting image, dimensions (3, 512, 512)

03. Why are most current large models Decoder-only structures?

Large models are mainly divided into three types of model architectures:

  1. Only-encoder: For example, BERT. It has strong language understanding and representation capabilities through pre-training on large-scale unlabeled text and then fine-tuning on downstream tasks.

  2. Only-Decoder: For example, GPT. It has strong generation and language understanding capabilities through pre-training on large-scale unlabeled text and then fine-tuning on specific tasks.

  3. Encoder-Decoder: For example, T5 (Text-to-Text Transfer Transformer). It can be used for various natural language processing tasks such as text classification, machine translation, question answering, etc.

The reason why LLMs mainly use Decoder-only architecture:

  • Advantages in training efficiency and engineering implementation

  • Theoretically, the bidirectional attention in Encoders has a low-rank problem, which may weaken the model's expressive ability

  • For generation tasks, introducing bidirectional attention has no substantial benefit

  • Encoder-Decoder architecture may perform better in some scenarios probably just because it has twice as many parameters

  • Under the same number of parameters and inference cost, Decoder-only architecture is the optimal choice

04. How to alleviate the repetition problem in LLMs

  1. Diverse training data: Use diverse corpora to train the model during the training phase to avoid data bias and repetitive text problems.

  2. Introduce noise: Introduce randomness or noise when generating text, for example, by sampling different words or phrases, or introducing random transformation operations to increase diversity.

  3. Temperature parameter adjustment: Control the diversity of generated text by adjusting the temperature parameter value.

  4. Post-processing and filtering: Perform post-processing and filtering on the generated text, removing repetitive sentences or phrases to improve quality and diversity.

  5. Beam search adjustment: Adjust the parameters of the Beam search algorithm when generating text. Control diversity and creativity by adjusting Beam size and search width.

  6. Human intervention and control: For critical tasks or sensitive scenarios, introduce human review and screening mechanisms to ensure accuracy and diversity of generation results.

05. Why transformer blocks use LayerNorm instead of BatchNorm

Batch Normalization vs Layer Normalization

  • Batch Normalization normalizes the same dimension features for a batch of samples

  • Layer Normalization normalizes all dimension features for a single sample

  • LN doesn't depend on batch size or input sequence length, so it can be used for normalize operations in batchsize of 1 and RNN sequences

Why BN performs poorly in NLP

  1. BN calculates feature means and variances in the batch_size dimension, representing features like height, weight, skin color. Applying this to NLP would process each word as a feature in MLP, which is counter-intuitive.

  2. BN scales words, essentially scaling word vectors in NLP. Word vectors are learned parameters representing word semantics, not real existing features.

Why LayerNorm is effective for scaling all words of a sample

  1. Layer-norm scales features for each sample, preserving the N dimension and scaling in C/H/W dimensions.

  2. Layer-norm normalizes elements under the same feature, corresponding to text length rather than batch size.

06. Why transformer uses multi-head attention mechanism

Multi-head attention ensures that transformer can:

  • Pay attention to information in different subspaces

  • Capture richer feature information

The original authors found this approach to be effective in practice.

07. Reasons for performance degradation of LLM after supervised fine-tuning (SFT)

SFT (Supervised Fine-Tuning) is a common fine-tuning technique that improves model performance by training on labeled data for specific tasks. However, it may lead to decreased generalization ability due to:

  • Overfitting to fine-tuning data

  • Ignoring knowledge learned during pre-training phase

This phenomenon is called catastrophic forgetting. Strategies to mitigate it include:

  1. Use a smaller learning rate for fine-tuning to reduce forgetting of pre-training knowledge

  2. Use regularization techniques like weight decay or early stopping to prevent overfitting

  3. Apply techniques like Elastic Weight Consolidation (EWC) to retain important pre-training knowledge during fine-tuning

08. OOM errors caused by increased sample size during fine-tuning phase

Full parameter fine-tuning memory requirements depend on:

  • Model size (number of parameters)

  • Batch size

  • Sequence length

  • Use of mixed precision training

For large models like GPT-3, full parameter fine-tuning on a single GPU may need tens or even hundreds of GB of memory.

Solutions for OOM (Out of Memory) errors:

  1. Reduce batch size: Decreases amount of data processed each training step

  2. Use gradient accumulation: Reduces memory usage without reducing batch size

  3. Use model parallelism: Distributes model parts across different devices, reducing per-device memory needs

09. Introduction to the CLIP architecture connecting text and images

CLIP brings natural language-level abstract concepts into computer vision:

  1. Determines a series of queries

  2. Collects images through search engines

  3. Obtains 400 million image-text pairs through 500,000 queries

  4. Matches and trains semantic features extracted by Text Decoder from text and Image Decoder from images

10. BERT's advantages for classification tasks and subsequent improvements

Advantages of BERT for classification: BERT's structure includes a bidirectional Transformer encoder, enabling better capture of bidirectional context information in text, thus performing better in text classification tasks.

Subsequent improvements:

  1. Improved pre-training models based on BERT: RoBERTa, ALBERT, etc.

  2. Optimized model performance by adjusting BERT's architecture and hyperparameters: Electra, DeBERTa, etc.

  3. Improved BERT's application methods for specific tasks: ERNIE, MT-DNN, etc.

11. Introduction to the transformer algorithm

Transformer is a typical encoder-decoder model:

  • Both Encoder and Decoder have 6 Blocks

  • Encoder Block includes:

  1. Multi-head self-attention module

  2. Feed-forward neural network module

  • Decoder Block includes:

  1. Multi-head self-attention module

  2. Multi-head Encoder-Decoder attention interaction module

  3. Feed-forward neural network module

  • Each module on Encoder and Decoder sides has residual layers and Layer Normalization layers

12. Strategies to reduce hallucinations in large language models (LLMs)

  1. DoLa: Improve authenticity through contrast layer decoding

  2. Fine-tune on high-quality data: Shown promising results in reducing hallucinations

  3. Context learning: Provide better context for the model

  4. Output limitations: Restrict output to constrained lists instead of free-form text

13. Overview of ChatGPT's training process

  1. Pre-training:

    1. Train on extensive internet datasets using Transformer architecture

    2. Goal: Enable model to predict next word in given text sequence

    3. Outcome: Model understands language patterns but not instructions/questions

  2. Supervised fine-tuning or instruction fine-tuning:

    1. Model takes user messages as input

    2. Learns to generate responses by minimizing prediction-response differences

    3. Marks transition from understanding patterns to understanding/responding to instructions

  3. Reinforcement Learning with Human Feedback (RLHF):

    1. Used as a subsequent fine-tuning step

14. Tokens in the context of large language models (LLMs)

  • Input text is broken down into multiple fragments

  • Each fragment is about the size of a word, called subword tokens

  • This process is called tokenization

  • Tokens can be words or character blocks

15. How text controls generation in Stable Diffusion

Stable Diffusion is a latent diffusion model with three core components:

  1. Autoencoder (VAE)

  2. U-Net

  3. Text encoder

Process:

  • Unet's Attention module takes Latent Feature and Context Embedding as inputs

  • Performs Cross Attention operation to fuse image and text information

  • Overall, it follows a classic Transformer workflow

16. Main problem Stable Diffusion solves compared to Diffusion

Diffusion's disadvantage:

  • Requires input of full-sized images into U-Net during reverse diffusion

  • Makes Diffusion very slow when image size and time step t are large enough

Stable Diffusion addresses this issue, improving efficiency.

17. Why Diffusion chooses a random time step for each training sample

Training process includes:

  1. Selecting a random time step for each training sample

  2. Applying Gaussian noise corresponding to the time step to the image

  3. Converting time step into corresponding embedding

Rationale:

  • During training, loss gradually decreases, with smaller changes towards the end

  • If time steps were incremental, the model would focus too much on earlier time steps (due to larger early losses)

  • Random time steps help balance focus across all time steps

18. Mitigating loss of general ability after domain-specific training

Problem: After training on domain data, general ability often declines

Solution: Include general datasets during domain training

Recommended ratio:

  • No definitive answer, depends on domain data amount

  • When data amount is limited, domain to general data ratio of 1:5 to 1:10 is generally appropriate

19. How reinforcement learning works (example)

Reinforcement learning (RL) system components:

  1. Agent

  2. Environment

Process:

  1. Environment sends a state to the agent

  2. Agent takes action based on observations

  3. Environment sends next state and reward to agent

  4. Agent updates knowledge using reward to evaluate last action

  5. Loop continues until environment sends termination state

Example: Learning to play Counter-Strike

  1. RL agent (Player 1) collects state S⁰ from environment (Counter-Strike game)

  2. Based on S⁰, agent takes action A⁰ (e.g., move left/right). Initially, actions are random

  3. Environment enters new state S1 (new game stage)

  4. Agent receives reward R1 (e.g., extra points or coins)

  5. RL loop continues until agent dies or reaches destination, outputting series of states, actions, and rewards

20. Understanding reward maximization in reinforcement learning

Key concept: AI agent tries to maximize long-term rewards from its environment

Process:

Reinforcement learning process
  • Agent learns from rewards/punishments

  • Changes behavior to better complete tasks

  • Goal: Make agent smart enough to learn from experience and make choices that achieve long-term goals quickly

21. Improving prompt generalization in large language models

Strategies:

  1. Use multiple different prompts to increase sample diversity for model learning

  2. Add random noise or transformations to prompts to enrich dataset and improve generalization

  3. Apply transfer learning or meta-learning to extract knowledge from previous tasks and apply to new ones

22. Difference between Instruction Tuning and Prompt tuning

Prompt tuning:

  • Generates prompt template (hard/soft prompt) for each task

  • Performs full-shot fine-tuning and evaluation on each task

  • Pre-trained model parameters are frozen

  • Aims to stimulate language model's completion ability (e.g., generating second half of sentence, cloze tasks)

Instruction Tuning:

  • Generates instruction (hard token) for each task

  • Fine-tunes on several full-shot tasks

  • Evaluates generalization ability on specific tasks (zero-shot)

  • Pre-trained model parameters are unfrozen

  • Aims to stimulate language model's understanding ability through clear instructions

23. Improvements for knowledge distillation

Knowledge distillation transfers knowledge from complex to simple models. Improvement points:

  1. Use different types of loss functions and temperature parameters for better distillation effects

  2. Introduce additional information to improve distillation, e.g., adding similarity constraints to model training

  3. Combine distillation with other techniques, e.g., multi-task learning and transfer learning

24. Attention calculation complexity in Transformers and improvements

Standard Transformer attention complexity: O(N^2), where N is input sequence length

Improvement methods:

  1. Use self-attention mechanism to reduce complexity by calculating relationships between each input vector and itself

  2. Use local attention mechanism, only calculating interactions of subsequences related to current position

  3. Adopt approximate methods like randomization and sampling

  4. Use compressed attention mechanisms by mapping input vectors to low-dimensional space (e.g., hash attention, low-rank attention)

25. Choosing between Chat and Base models for SFT (Supervised Fine-Tuning)

Decision based on SFT data amount:

  • Less than 10k data: Use Chat model as base for fine-tuning

  • 100k data: Use Base model for fine-tuning

26. Organizing and processing book and paper data for large model pre-training

For pre-training:

  • Split book and paper text into paragraphs

For supervised fine-tuning:

  • Convert to instruction format dataset

  • Use titles or key phrases as prompts

27. Examples of alignment problems in large language models

Alignment: Degree of consistency between model's goals/behaviors and human values/expectations

Examples of misalignment:

  1. Lack of helpfulness: Model not following explicit user instructions

  2. Hallucinations: Model fabricating non-existent or incorrect facts

  3. Lack of interpretability: Difficulty for humans to understand model's decision-making process

  4. Generating biased or toxic outputs: Model reproducing biased/toxic data in outputs, even without explicit instructions

28. Use of Adaptive Softmax in large language models

Benefits:

  • Enables efficient training and inference when handling large vocabularies

  • Reduces computational cost as vocabulary size grows

How it works:

  • Groups words into clusters based on their frequency

  • Reduces computation required for vocabulary probability distribution

Result:

  • More efficient training and running of large language models

  • Faster experimentation and development

29. Difference in Layer Normalization between GPT3 and LLAMA

GPT3:

  • Uses Post-Layer Normalization

  • Process: Perform self-attention or feed-forward network calculations, then Layer Normalization

  • Benefits: Stabilizes training process, improves model performance

LLAMA:

  • Uses Pre-Layer Normalization

  • Process: Perform Layer Normalization, then self-attention or feed-forward network calculations

  • Benefits: Improves model's generalization ability and robustness

30. Difference between MHA (Multi-Head Attention) and MQA (Multi-Query Attention)

MQA (Multi-Query Attention):

  • All heads share the same Key and Value matrices

  • Each head retains only one set of Query parameters

  • Significantly reduces parameters in Key and Value matrices compared to MHA

31. Role of Flash Attention in inference optimization

Flash Attention:

  • Efficient implementation of attention mechanism

  • Uses shared tensor cores and efficient memory usage

  • Reduces memory consumption and improves computation speed

  • Particularly suitable for scenarios with long sequences and large model parameters (e.g., natural language processing, recommendation systems)

32. Three stages of ZeRO (Zero Redundancy Optimizer)

  1. Split optimizer states across different devices to reduce memory usage

  2. Split model parameters across different devices in addition to optimizer states

  3. Split gradients and optimizer states across different devices for maximum memory savings

33. Core ideas of KV Cache and GQA optimization in multi-head attention mechanism (MHA)

KV Cache (Key-Value Cache):

  • Optimization technique used in autoregressive generation models

  • Caches historical input Keys (K)

KV Cache (Key-Value Cache):

  • Optimization technique used in autoregressive generation models

  • Caches historical input Keys (K) and Values (V) to reduce repeated computations

  • Improves inference efficiency in generative models

  • Each new generated token is used as input along with previous tokens, so KV Cache avoids recomputing for the same tokens

  • Memory usage grows linearly with input and output sequence lengths, requiring optimization for longer sequences

Grouped Query Attention (GQA):

  • A compromise between Multi-Head Attention (MHA) and Multi-Query Attention (MQA)

  • Groups Query Heads and shares one Key Head and one Value Head within each group

  • Retains some of the expressive power of multi-head attention

  • Accelerates inference speed by reducing memory access pressure

  • Can be implemented by fine-tuning already trained models

  • Uses mean pooling to generate shared KV

  • Maintains high model quality while improving inference speed

34. Causes and solutions for loss spikes in pre-training of 100B+ large models

Causes: The occurrence of loss spikes in large-scale model training can be attributed to instability issues, particularly with adaptive optimization algorithms like Adam.

Solutions:

  1. Adjust learning rate: Carefully tune the learning rate to find a balance between training speed and stability.

  2. Gradient clipping: Implement gradient clipping to prevent extremely large gradients from causing instability.

  3. Warm-up strategy: Use a learning rate warm-up strategy to gradually increase the learning rate at the beginning of training.

  4. Optimizer modifications: Consider using modified versions of Adam or alternative optimizers that are more stable for large-scale training.

  5. Mixed precision training: Utilize mixed precision training techniques to improve numerical stability.

  6. Layer-wise Adaptive Rate Scaling (LARS): Implement LARS to adjust learning rates for different layers based on their gradient statistics.

  7. Monitor and adjust: Continuously monitor training metrics and be prepared to adjust hyperparameters if instability occurs.

  8. Regularization techniques: Apply appropriate regularization methods to prevent overfitting and improve stability.

  9. Batch size adjustment: Experiment with different batch sizes to find an optimal balance between computational efficiency and training stability.

  10. Model architecture considerations: Consider architectural choices that promote stability in very large models, such as careful initialization strategies or normalization techniques.

These strategies can help mitigate the occurrence of loss spikes and improve the overall stability of training for extremely large language models.


bottom of page