Alignment Scope | AI Alignment

In this section, we focus on illustrating the scope of AI alignment: we constructed the alignment process as an alignment cycle and decomposed it into Forward Alignment Process and Backward Alignment Process. Specifically, we discuss the role of human values in alignment and further analyze AI safety problems beyond alignment.

The Forward and Backward Process

We decompose alignment into Forward Alignment (alignment training) and Backward Alignment (alignment refinement). The two phases, forward and backward alignment, form a cycle where each phase produces or updates the input of the next phase.

**The Alignment Cycle.**
Summarized by Alignment Survey Team. For more details, please refer to our paper.

This cycle, what we call the alignment cycle, is repeated to produce increasingly aligned AI systems. We see alignment as a dynamic process in which all standards and practices should be continually assessed and updated.

Forward Alignment

Forward alignment aims to produce trained systems that follow alignment requirements. We decompose this task into Learning from Feedback and Learning under Distribution Shift.

Learning from Feedback

Learning from Feedback concerns how to provide and use feedback to any given outcomes or behaviors of the trained AI systems with any given input.

Learning under Distribution Shift

Learning under Distribution Shift focuses specifically on the cases where the distribution of input changes, i.e., where distribution shift occurs.

Backward Alignment

Backward Alignment (alignment refinement) ensures the practical alignment of trained systems and revises alignment requirements.

Assurance

Once an AI system has undergone forward alignment, we still need to gain confidence about its alignment before deploying it. Such is the role of Assurance: assessing the alignment of trained AI systems.

Governance

The role of AI Governance is to mitigate this diverse array of risks. This necessitates governance efforts of AI systems that focus on their alignment and safety and cover the entire lifecycle of the system.

Table: Relationships between alignment research directions covered in the survey and the RICE principles, featuring the individual objectives each research direction aims to achieve. Filled circles stand for primary objectives, and unfilled circles stand for secondary objectives.

Summarized by Alignment Survey Team. For more details, please refer to our paper.

Alignment Research Directions			Objectives
Alignment Research Directions			Objectives
Category	Direction	Method	Robustness	Interpretability	Controllability	Ethicality
Category	Direction	Method	Robustness	Interpretability	Controllability	Ethicality
Learning from Feedback	Preference Modeling			◉	◎
	Policy Learning	RL/PbRL/IRL/Imitation Learning			◎
	Policy Learning	RLHF	◎		◉	◉
	Scalable Oversight	RLxF	◎		◉	◉
		IDA		◎	◉
		RRM			◉
		Debate		◎	◉
		CIRL	◎	◎	◉	◎
Learning under Distribution Shift	Algorithmic Interventions	DRO	◉
		IRM/REx	◉
		CBFT	◉
	Data Distribution Interventions	Adversarial Training	◉		◎
	Data Distribution Interventions	Cooperative Training	◉			◉
Assurance	Safety Evaluations	Social Concern Evaluations	◎	◎		◉
		Extreme Risk Evaluations		◎	◉	◎
		Red Teaming	◉		◎	◉
	Interpretability			◉	◎
	Human Values Verification	Learning/Evaluating Moral Values			◎	◉
	Human Values Verification	Game Theory for Cooperative AI	◎			◉
Governance	Multi-Stakeholder Approach	Government	◉	◉	◉	◉
		Industry	◉	◉	◉	◉
		Third Parties	◉	◉	◉	◉
	International Governance		◉	◉	◉	◉
	Open-source Governance		◉	◉	◉	◉