In this webpage, we provide, but are not limited to, the following contents:
If you find our survey helpful, please cite it in your publications by clicking the Copy button below. Also, if you are interested in receiving updates on alignment research, please subscribe to our Substack.
@misc{ji2023ai,
title={AI Alignment: A Comprehensive Survey},
author={Jiaming Ji and Tianyi Qiu and Boyuan Chen and Borong Zhang and Hantao Lou and Kaile Wang and Yawen Duan and Zhonghao He and Jiayi Zhou and
Zhaowei Zhang and Fanzhi Zeng and Kwan Yee Ng and Juntao Dai and Xuehai Pan and Aidan O'Gara and Yingshan Lei and Hua Xu and Brian Tse and Jie Fu and
Stephen McAleer and Yaodong Yang and Yizhou Wang and Song-Chun Zhu and Yike Guo and Wen Gao},
year={2023},
eprint={2310.19852},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
Recent advancements in deep learning, including Large Language Models (LLMs) and Reinforcement Learning (RL), have reignited interest in the potential of advanced AI systems, which hold promise to benefit society and further human progress.
The following are some of the alignment problems examples:
Ha et al., 2017 | Amodei et al., 2016 | Christiano et al., 2017 |
---|---|---|
Langosco et al., 2022 | Code Bullet, 2019 |
---|---|
We will give an overview of alignment failure modes, which can be categorized into reward hacking and goal misgeneralization. For more details, please refer to the above link or our paper.
We also analyzed the relationship between AI systems’ misalignment behaviors and dangerous capabilities, revealing the risks that misaligned AI systems may pose. For more details, please refer to Misalignment Issues
The RICE principles define four key characteristics that an aligned system should possess:
These four principles guide the alignment of an AI system with human intentions and values. They are not end goals in themselves but intermediate objectives in service of alignment.The alignment process can be decomposed into Forward Alignment (alignment training) and Backward Alignment (alignment consolidation).
The cycle is repeated until it reaches a sufficient level of alignment.
Notably, although Backward Alignment aims to ensure the practical alignment of trained systems, it is carried out throughout the system’s lifecycle in service of this goal, including before, during, after training, and after deployment.