In this section, we focus on illustrating the scope of AI alignment: we constructed the alignment process as an alignment cycle and decomposed it into Forward Alignment Process and Backward Alignment Process. Specifically, we discuss the role of human values in alignment and further analyze AI safety problems beyond alignment.
The Forward and Backward Process
We decompose alignment into Forward Alignment (alignment training) and Backward Alignment (alignment refinement). The two phases, forward and backward alignment, form a cycle where each phase produces or updates the input of the next phase.
This cycle, what we call the alignment cycle, is repeated to produce increasingly aligned AI systems. We see alignment as a dynamic process in which all standards and practices should be continually assessed and updated.
Forward Alignment
Forward alignment aims to produce trained systems that follow alignment requirements. We decompose this task into Learning from Feedback and Learning under Distribution Shift.
Learning from Feedback
Learning from Feedback concerns how to provide and use feedback to any given outcomes or behaviors of the trained AI systems with any given input.
Learning under Distribution Shift
Learning under Distribution Shift focuses specifically on the cases where the distribution of input changes, i.e., where distribution shift occurs.
Backward Alignment
Backward Alignment (alignment refinement) ensures the practical alignment of trained systems and revises alignment requirements.
Assurance
Once an AI system has undergone forward alignment, we still need to gain confidence about its alignment before deploying it. Such is the role of Assurance: assessing the alignment of trained AI systems.
Governance
The role of AI Governance is to mitigate this diverse array of risks. This necessitates governance efforts of AI systems that focus on their alignment and safety and cover the entire lifecycle of the system.
Alignment Research Directions | Objectives | |||||
---|---|---|---|---|---|---|
Category | Direction | Method | Robustness | Interpretability | Controllability | Ethicality |
Learning from Feedback | Preference Modeling | ◉ | ◎ | |||
Policy Learning | RL/PbRL/IRL/Imitation Learning | ◎ | ||||
RLHF | ◎ | ◉ | ◉ | |||
Scalable Oversight | RLxF | ◎ | ◉ | ◉ | ||
IDA | ◎ | ◉ | ||||
RRM | ◉ | |||||
Debate | ◎ | ◉ | ||||
CIRL | ◎ | ◎ | ◉ | ◎ | ||
Learning under Distribution Shift | Algorithmic Interventions | DRO | ◉ | |||
IRM/REx | ◉ | |||||
CBFT | ◉ | |||||
Data Distribution Interventions | Adversarial Training | ◉ | ◎ | |||
Cooperative Training | ◉ | ◉ | ||||
Assurance | Safety Evaluations | Social Concern Evaluations | ◎ | ◎ | ◉ | |
Extreme Risk Evaluations | ◎ | ◉ | ◎ | |||
Red Teaming | ◉ | ◎ | ◉ | |||
Interpretability | ◉ | ◎ | ||||
Human Values Verification | Learning/Evaluating Moral Values | ◎ | ◉ | |||
Game Theory for Cooperative AI | ◎ | ◉ | ||||
Governance | Multi-Stakeholder Approach | Government | ◉ | ◉ | ◉ | ◉ |
Industry | ◉ | ◉ | ◉ | ◉ | ||
Third Parties | ◉ | ◉ | ◉ | ◉ | ||
International Governance | ◉ | ◉ | ◉ | ◉ | ||
Open-source Governance | ◉ | ◉ | ◉ | ◉ |