34 LLM Training and Fine-tuning Topics
- Jul 23, 2024
- 10 min read
01. Principles of common fine-tuning methods for large models: LORA and P-tuning
LORA (Low-Rank Adaptation)
The core of the LORA method is to add additional low-rank matrices to specified parameters on top of large language models. This essentially adds a bypass next to the original pre-trained language model (PLM), performing a dimensionality reduction and expansion operation. During model training, the parameters of the PLM are fixed, and only the dimensionality reduction matrix A and expansion matrix B are trained.
P-tuning The core of the P-tuning method is to use differentiable virtual tokens to replace the original discrete tokens. These are added only to the input layer, and a prompt encoder (BiLSTM+MLP) is used to encode and learn the virtual tokens.
02. Introduction to the principles of stable diffusion
Stable Diffusion consists of three main components, each with its own independent neural network:
Clip Text for text encoding
Input: Text
Output: 77 token embedding vectors, each containing 768 dimensions
UNet + Scheduler processes/diffuses information in the latent space step by step
Input: Text embeddings and an initial multidimensional array of noise (structured list of numbers, also called a tensor)
Output: A processed information array
Autoencoder Decoder uses the processed information matrix to draw the final image
Input: Processed information matrix, dimensions (4, 64, 64)
Output: Resulting image, dimensions (3, 512, 512)
03. Why are most current large models Decoder-only structures?
Large models are mainly divided into three types of model architectures:
Only-encoder: For example, BERT. It has strong language understanding and representation capabilities through pre-training on large-scale unlabeled text and then fine-tuning on downstream tasks.
Only-Decoder: For example, GPT. It has strong generation and language understanding capabilities through pre-training on large-scale unlabeled text and then fine-tuning on specific tasks.
Encoder-Decoder: For example, T5 (Text-to-Text Transfer Transformer). It can be used for various natural language processing tasks such as text classification, machine translation, question answering, etc.
The reason why LLMs mainly use Decoder-only architecture:
Advantages in training efficiency and engineering implementation
Theoretically, the bidirectional attention in Encoders has a low-rank problem, which may weaken the model's expressive ability
For generation tasks, introducing bidirectional attention has no substantial benefit
Encoder-Decoder architecture may perform better in some scenarios probably just because it has twice as many parameters
Under the same number of parameters and inference cost, Decoder-only architecture is the optimal choice
04. How to alleviate the repetition problem in LLMs
Diverse training data: Use diverse corpora to train the model during the training phase to avoid data bias and repetitive text problems.
Introduce noise: Introduce randomness or noise when generating text, for example, by sampling different words or phrases, or introducing random transformation operations to increase diversity.
Temperature parameter adjustment: Control the diversity of generated text by adjusting the temperature parameter value.
Post-processing and filtering: Perform post-processing and filtering on the generated text, removing repetitive sentences or phrases to improve quality and diversity.
Beam search adjustment: Adjust the parameters of the Beam search algorithm when generating text. Control diversity and creativity by adjusting Beam size and search width.
Human intervention and control: For critical tasks or sensitive scenarios, introduce human review and screening mechanisms to ensure accuracy and diversity of generation results.
05. Why transformer blocks use LayerNorm instead of BatchNorm
Batch Normalization vs Layer Normalization
Batch Normalization normalizes the same dimension features for a batch of samples
Layer Normalization normalizes all dimension features for a single sample
LN doesn't depend on batch size or input sequence length, so it can be used for normalize operations in batchsize of 1 and RNN sequences
Why BN performs poorly in NLP
BN calculates feature means and variances in the batch_size dimension, representing features like height, weight, skin color. Applying this to NLP would process each word as a feature in MLP, which is counter-intuitive.
BN scales words, essentially scaling word vectors in NLP. Word vectors are learned parameters representing word semantics, not real existing features.
Why LayerNorm is effective for scaling all words of a sample
Layer-norm scales features for each sample, preserving the N dimension and scaling in C/H/W dimensions.
Layer-norm normalizes elements under the same feature, corresponding to text length rather than batch size.
06. Why transformer uses multi-head attention mechanism
Multi-head attention ensures that transformer can:
Pay attention to information in different subspaces
Capture richer feature information
The original authors found this approach to be effective in practice.
07. Reasons for performance degradation of LLM after supervised fine-tuning (SFT)
SFT (Supervised Fine-Tuning) is a common fine-tuning technique that improves model performance by training on labeled data for specific tasks. However, it may lead to decreased generalization ability due to:
Overfitting to fine-tuning data
Ignoring knowledge learned during pre-training phase
This phenomenon is called catastrophic forgetting. Strategies to mitigate it include:
Use a smaller learning rate for fine-tuning to reduce forgetting of pre-training knowledge
Use regularization techniques like weight decay or early stopping to prevent overfitting
Apply techniques like Elastic Weight Consolidation (EWC) to retain important pre-training knowledge during fine-tuning
08. OOM errors caused by increased sample size during fine-tuning phase
Full parameter fine-tuning memory requirements depend on:
Model size (number of parameters)
Batch size
Sequence length
Use of mixed precision training
For large models like GPT-3, full parameter fine-tuning on a single GPU may need tens or even hundreds of GB of memory.
Solutions for OOM (Out of Memory) errors:
Reduce batch size: Decreases amount of data processed each training step
Use gradient accumulation: Reduces memory usage without reducing batch size
Use model parallelism: Distributes model parts across different devices, reducing per-device memory needs
09. Introduction to the CLIP architecture connecting text and images
CLIP brings natural language-level abstract concepts into computer vision:
Determines a series of queries
Collects images through search engines
Obtains 400 million image-text pairs through 500,000 queries
Matches and trains semantic features extracted by Text Decoder from text and Image Decoder from images
10. BERT's advantages for classification tasks and subsequent improvements
Advantages of BERT for classification: BERT's structure includes a bidirectional Transformer encoder, enabling better capture of bidirectional context information in text, thus performing better in text classification tasks.
Subsequent improvements:
Improved pre-training models based on BERT: RoBERTa, ALBERT, etc.
Optimized model performance by adjusting BERT's architecture and hyperparameters: Electra, DeBERTa, etc.
Improved BERT's application methods for specific tasks: ERNIE, MT-DNN, etc.
11. Introduction to the transformer algorithm
Transformer is a typical encoder-decoder model:
Both Encoder and Decoder have 6 Blocks
Encoder Block includes:
Multi-head self-attention module
Feed-forward neural network module
Decoder Block includes:
Multi-head self-attention module
Multi-head Encoder-Decoder attention interaction module
Feed-forward neural network module
Each module on Encoder and Decoder sides has residual layers and Layer Normalization layers
12. Strategies to reduce hallucinations in large language models (LLMs)
DoLa: Improve authenticity through contrast layer decoding
Fine-tune on high-quality data: Shown promising results in reducing hallucinations
Context learning: Provide better context for the model
Output limitations: Restrict output to constrained lists instead of free-form text
13. Overview of ChatGPT's training process
Pre-training:
Train on extensive internet datasets using Transformer architecture
Goal: Enable model to predict next word in given text sequence
Outcome: Model understands language patterns but not instructions/questions
Supervised fine-tuning or instruction fine-tuning:
Model takes user messages as input
Learns to generate responses by minimizing prediction-response differences
Marks transition from understanding patterns to understanding/responding to instructions
Reinforcement Learning with Human Feedback (RLHF):
Used as a subsequent fine-tuning step
14. Tokens in the context of large language models (LLMs)
Input text is broken down into multiple fragments
Each fragment is about the size of a word, called subword tokens
This process is called tokenization
Tokens can be words or character blocks
15. How text controls generation in Stable Diffusion
Stable Diffusion is a latent diffusion model with three core components:
Autoencoder (VAE)
U-Net
Text encoder
Process:
Unet's Attention module takes Latent Feature and Context Embedding as inputs
Performs Cross Attention operation to fuse image and text information
Overall, it follows a classic Transformer workflow
16. Main problem Stable Diffusion solves compared to Diffusion
Diffusion's disadvantage:
Requires input of full-sized images into U-Net during reverse diffusion
Makes Diffusion very slow when image size and time step t are large enough
Stable Diffusion addresses this issue, improving efficiency.
17. Why Diffusion chooses a random time step for each training sample
Training process includes:
Selecting a random time step for each training sample
Applying Gaussian noise corresponding to the time step to the image
Converting time step into corresponding embedding
Rationale:
During training, loss gradually decreases, with smaller changes towards the end
If time steps were incremental, the model would focus too much on earlier time steps (due to larger early losses)
Random time steps help balance focus across all time steps
18. Mitigating loss of general ability after domain-specific training
Problem: After training on domain data, general ability often declines
Solution: Include general datasets during domain training
Recommended ratio:
No definitive answer, depends on domain data amount
When data amount is limited, domain to general data ratio of 1:5 to 1:10 is generally appropriate
19. How reinforcement learning works (example)
Reinforcement learning (RL) system components:
Agent
Environment
Process:
Environment sends a state to the agent
Agent takes action based on observations
Environment sends next state and reward to agent
Agent updates knowledge using reward to evaluate last action
Loop continues until environment sends termination state
Example: Learning to play Counter-Strike
RL agent (Player 1) collects state S⁰ from environment (Counter-Strike game)
Based on S⁰, agent takes action A⁰ (e.g., move left/right). Initially, actions are random
Environment enters new state S1 (new game stage)
Agent receives reward R1 (e.g., extra points or coins)
RL loop continues until agent dies or reaches destination, outputting series of states, actions, and rewards
20. Understanding reward maximization in reinforcement learning
Key concept: AI agent tries to maximize long-term rewards from its environment
Process:
Agent learns from rewards/punishments
Changes behavior to better complete tasks
Goal: Make agent smart enough to learn from experience and make choices that achieve long-term goals quickly
21. Improving prompt generalization in large language models
Strategies:
Use multiple different prompts to increase sample diversity for model learning
Add random noise or transformations to prompts to enrich dataset and improve generalization
Apply transfer learning or meta-learning to extract knowledge from previous tasks and apply to new ones
22. Difference between Instruction Tuning and Prompt tuning
Prompt tuning:
Generates prompt template (hard/soft prompt) for each task
Performs full-shot fine-tuning and evaluation on each task
Pre-trained model parameters are frozen
Aims to stimulate language model's completion ability (e.g., generating second half of sentence, cloze tasks)
Instruction Tuning:
Generates instruction (hard token) for each task
Fine-tunes on several full-shot tasks
Evaluates generalization ability on specific tasks (zero-shot)
Pre-trained model parameters are unfrozen
Aims to stimulate language model's understanding ability through clear instructions
23. Improvements for knowledge distillation
Knowledge distillation transfers knowledge from complex to simple models. Improvement points:
Use different types of loss functions and temperature parameters for better distillation effects
Introduce additional information to improve distillation, e.g., adding similarity constraints to model training
Combine distillation with other techniques, e.g., multi-task learning and transfer learning
24. Attention calculation complexity in Transformers and improvements
Standard Transformer attention complexity: O(N^2), where N is input sequence length
Improvement methods:
Use self-attention mechanism to reduce complexity by calculating relationships between each input vector and itself
Use local attention mechanism, only calculating interactions of subsequences related to current position
Adopt approximate methods like randomization and sampling
Use compressed attention mechanisms by mapping input vectors to low-dimensional space (e.g., hash attention, low-rank attention)
25. Choosing between Chat and Base models for SFT (Supervised Fine-Tuning)
Decision based on SFT data amount:
Less than 10k data: Use Chat model as base for fine-tuning
100k data: Use Base model for fine-tuning
26. Organizing and processing book and paper data for large model pre-training
For pre-training:
Split book and paper text into paragraphs
For supervised fine-tuning:
Convert to instruction format dataset
Use titles or key phrases as prompts
27. Examples of alignment problems in large language models
Alignment: Degree of consistency between model's goals/behaviors and human values/expectations
Examples of misalignment:
Lack of helpfulness: Model not following explicit user instructions
Hallucinations: Model fabricating non-existent or incorrect facts
Lack of interpretability: Difficulty for humans to understand model's decision-making process
Generating biased or toxic outputs: Model reproducing biased/toxic data in outputs, even without explicit instructions
28. Use of Adaptive Softmax in large language models
Benefits:
Enables efficient training and inference when handling large vocabularies
Reduces computational cost as vocabulary size grows
How it works:
Groups words into clusters based on their frequency
Reduces computation required for vocabulary probability distribution
Result:
More efficient training and running of large language models
Faster experimentation and development
29. Difference in Layer Normalization between GPT3 and LLAMA
GPT3:
Uses Post-Layer Normalization
Process: Perform self-attention or feed-forward network calculations, then Layer Normalization
Benefits: Stabilizes training process, improves model performance
LLAMA:
Uses Pre-Layer Normalization
Process: Perform Layer Normalization, then self-attention or feed-forward network calculations
Benefits: Improves model's generalization ability and robustness
30. Difference between MHA (Multi-Head Attention) and MQA (Multi-Query Attention)
MQA (Multi-Query Attention):
All heads share the same Key and Value matrices
Each head retains only one set of Query parameters
Significantly reduces parameters in Key and Value matrices compared to MHA
31. Role of Flash Attention in inference optimization
Flash Attention:
Efficient implementation of attention mechanism
Uses shared tensor cores and efficient memory usage
Reduces memory consumption and improves computation speed
Particularly suitable for scenarios with long sequences and large model parameters (e.g., natural language processing, recommendation systems)
32. Three stages of ZeRO (Zero Redundancy Optimizer)
Split optimizer states across different devices to reduce memory usage
Split model parameters across different devices in addition to optimizer states
Split gradients and optimizer states across different devices for maximum memory savings
33. Core ideas of KV Cache and GQA optimization in multi-head attention mechanism (MHA)
KV Cache (Key-Value Cache):
Optimization technique used in autoregressive generation models
Caches historical input Keys (K)
KV Cache (Key-Value Cache):
Optimization technique used in autoregressive generation models
Caches historical input Keys (K) and Values (V) to reduce repeated computations
Improves inference efficiency in generative models
Each new generated token is used as input along with previous tokens, so KV Cache avoids recomputing for the same tokens
Memory usage grows linearly with input and output sequence lengths, requiring optimization for longer sequences
Grouped Query Attention (GQA):
A compromise between Multi-Head Attention (MHA) and Multi-Query Attention (MQA)
Groups Query Heads and shares one Key Head and one Value Head within each group
Retains some of the expressive power of multi-head attention
Accelerates inference speed by reducing memory access pressure
Can be implemented by fine-tuning already trained models
Uses mean pooling to generate shared KV
Maintains high model quality while improving inference speed
34. Causes and solutions for loss spikes in pre-training of 100B+ large models
Causes: The occurrence of loss spikes in large-scale model training can be attributed to instability issues, particularly with adaptive optimization algorithms like Adam.
Solutions:
Adjust learning rate: Carefully tune the learning rate to find a balance between training speed and stability.
Gradient clipping: Implement gradient clipping to prevent extremely large gradients from causing instability.
Warm-up strategy: Use a learning rate warm-up strategy to gradually increase the learning rate at the beginning of training.
Optimizer modifications: Consider using modified versions of Adam or alternative optimizers that are more stable for large-scale training.
Mixed precision training: Utilize mixed precision training techniques to improve numerical stability.
Layer-wise Adaptive Rate Scaling (LARS): Implement LARS to adjust learning rates for different layers based on their gradient statistics.
Monitor and adjust: Continuously monitor training metrics and be prepared to adjust hyperparameters if instability occurs.
Regularization techniques: Apply appropriate regularization methods to prevent overfitting and improve stability.
Batch size adjustment: Experiment with different batch sizes to find an optimal balance between computational efficiency and training stability.
Model architecture considerations: Consider architectural choices that promote stability in very large models, such as careful initialization strategies or normalization techniques.
These strategies can help mitigate the occurrence of loss spikes and improve the overall stability of training for extremely large language models.

