AI Alignment Webpage Contents

In this webpage, we provide, but are not limited to, the following contents:

  1. A carefully curated selection of Materials, encompassing various academic papers, comprehensive datasets, and authoritative websites.
  2. A series of clear and in-depth Tutorials on the core alignment techniques, available for open reading and download.
  3. A collection of External Resources, providing readers with a diverse range of information.
  4. A monthly updated list of Top-10 Papers, including recent research and previous classic works.

If you find our survey helpful, please cite it in your publications by clicking the Copy button below. Also, if you are interested in receiving updates on alignment research, please subscribe to our Substack.

      title={AI Alignment: A Comprehensive Survey}, 
      author={Jiaming Ji and Tianyi Qiu and Boyuan Chen and Borong Zhang and Hantao Lou and Kaile Wang and Yawen Duan and Zhonghao He and Jiayi Zhou and 
      Zhaowei Zhang and Fanzhi Zeng and Kwan Yee Ng and Juntao Dai and Xuehai Pan and Aidan O'Gara and Yingshan Lei and Hua Xu and Brian Tse and Jie Fu and 
      Stephen McAleer and Yaodong Yang and Yizhou Wang and Song-Chun Zhu and Yike Guo and Wen Gao},

What is the Alignment Problem?

Recent advancements in deep learning, including Large Language Models (LLMs) and Reinforcement Learning (RL), have reignited interest in the potential of advanced AI systems, which hold promise to benefit society and further human progress.

The following are some of the alignment problems examples:

Ha et al., 2017Amodei et al., 2016Christiano et al., 2017
A quadrupedal evolved agent, trained to carry a ball on its back, learns a trick that it can place the ball on leg joints and wiggle across the floor without dropping it.
Despite frequently catching fire, colliding with other boats, and going in the wrong direction, the agent achieves a higher score by using this strategy.
A robot arm, trained with human feedback to grasp a ball, learned to position its hand between the ball and the camera, creating a false impression of success.
Langosco et al., 2022Code Bullet, 2019
(Left) The agent learns to reach the coin at the level’s end. (Right) When the coin’s position is randomized, the agent still heads to the level’s end.
A simulated robot designed to learn walking discovered how to interlock its legs and slide along the ground. It obtained a reward but can not be walking.

We will give an overview of alignment failure modes, which can be categorized into reward hacking and goal misgeneralization. For more details, please refer to the above link or our paper.

We also analyzed the relationship between AI systems’ misalignment behaviors and dangerous capabilities, revealing the risks that misaligned AI systems may pose. For more details, please refer to Misalignment Issues

Dangerous Capabilities. The figure is originally from Wikipedia, and we have polished it.

Summarized by Alignment Survey Team. For more details, please refer to our paper.

What is the Alignment Objective?

The RICE principles define four key characteristics that an aligned system should possess:

Overview of the RICE Principles.

Summarized by Alignment Survey Team. For more details, please refer to our paper.

These four principles guide the alignment of an AI system with human intentions and values. They are not end goals in themselves but intermediate objectives in service of alignment.

How Can We Realize Alignment?

The alignment process can be decomposed into Forward Alignment (alignment training) and Backward Alignment (alignment consolidation).

The Alignment Cycle.

Summarized by Alignment Survey Team. For more details, please refer to our paper.

  • Forward Alignment produces trained systems based on alignment requirements.
  • Backward Alignment ensures the practical alignment of trained systems and revises alignment requirements.
    • Assurance ensures the system can be evaluated and verified to meet alignment requirements.
    • Governance focuses on their alignment and safety and covers the entire lifecycle of the systems.

The cycle is repeated until it reaches a sufficient level of alignment.

Notably, although Backward Alignment aims to ensure the practical alignment of trained systems, it is carried out throughout the system’s lifecycle in service of this goal, including before, during, after training, and after deployment.