Large Language Models | AI Alignment

This Q&A section merely utilizes viewpoints from existing literature in the alignment field to provide informative answers to common questions. The answers to these questions may not be definitive, as future research may present different perspectives.
To quickly find the information you want, use the index in the right column.

Large Language Models (LLMs) represent a significant breakthrough in deep learning technology. As complex AI systems, factors such as hyperparameter settings, training processes, and model size will all affect their alignment with humans.

Is bigger always better?

Simply making language models larger does not fundamentally improve their ability to understand user intent.

Human evaluations of various models on OpenAI API prompt distribution, evaluated by how often outputs from each model were preferred to those from the 175B SFT model. OpenAI InstructGPT models (PPO-ptx), as well as its variant trained without pretraining mix (PPO), significantly outperform the GPT-3 baselines (GPT, GPT prompted); outputs from OpenAI 1.3B PPO-ptx model are preferred to those from the 175B GPT-3. Error bars throughout the paper are 95% confidence intervals.
Training language models to follow instructions with human feedback (Ouyang et al., 2022)

LLMs may generate unrealistic, harmful, or unhelpful outputs to users¹. Additionally, despite their immense scale, smaller models sometimes surpass LLMs. For example, in most classical natural language understanding tasks, ChatGPT/GPT-3.5 often lags behind existing fine-tuned baseline models.

Generality VS Efficiency

The development of LLMs can be approached in two directions:

Being large and general, with models being very large and possessing universal capabilities.
Being small and precise, with models being smaller but highly effective in specific application scenarios.

These two extremes lead models to learn from multiple tasks or handle many downstream tasks.

Architecture of the MT-DNN model for representation learning. The lower layers are shared across all tasks, while the top layers are task-specific. The input X (a sentence or a pair of sentences) is first represented as a sequence of embedding vectors, one for each word, in l1. Then, the Transformer encoder captures the contextual information for each word and generates the shared contextual embedding vectors in l2. Finally, additional task-specific layers generate task-specific representations for each task, followed by operations necessary for classification, similarity scoring, or relevance ranking.
Multi-Task Deep Neural Networks for Natural Language Understanding (Liu et al., 2018)

Based on task-specific heads, this BERT model can learn four types of tasks: single-sentence classification, pairwise text classification, text similarity scoring, and relevance ranking.

Does the training data matter alignment?

The data used to train LLMs is generally scraped from the Internet, often containing noise, social biases, and errors. When combined to maximize the probability of the next token given the previous ones, this might result in a misspecification of target behavior and lead to models that generate toxic, inaccurate, and unhelpful content.

**Examples of bias definitions placed in the data, algorithm, and user interaction feedback loop.**
A Survey on Bias and Fairness in Machine Learning. (Mehrabi et al., 2019)

What are the 3H alignment standards for LLMs?

3H standards are Helpful, Harmless and Honest. In A General Language Assistant as a Laboratory for Alignment, They are defined as:

Helpful.
- The AI should make a clear attempt to perform the task or answer the question posed (as long as this is not harmful).
- When more information is required, the AI should ask relevant follow-up questions and obtain necessary details.
- Ideally, the AI will also re-direct ill-informed requests, e.g., if asked, ‘How can I build a website in assembly language?’ it might suggest a different approach.
Harmless.
- The AI should not be offensive or discriminatory, directly or through subtext or bias.
- When asked to aid in a dangerous act (e.g., building a bomb), the AI should politely refuse.
- To the best of its abilities, the AI should recognize when it may provide sensitive or consequential advice and act with appropriate modesty and care.
Honest.
- At its most basic level, the AI should give accurate information. It should express its uncertainty without misleading human users.
- Crucially, the AI should be honest about its capabilities and levels of knowledge.
- Ideally, the AI would also be honest about itself and its internal state insofar as that information is available.
- Honesty is more objective than helpfulness and harmlessness, so more aspects of honesty training may be possible without human input.

Core Views on AI Safety: When, Why, What, and How ↩︎