Feedback is a crucial conduit linking AI behaviors to human intentions leveraged by AI systems to refine their objectives and more closely align with human values. In this section, we introduce three types of feedback employed to align AI systems commonly: reward, demonstration, and comparison.
Preference Modeling, which emphasizes comparison feedback, has emerged as a promising way to aid in aligning powerful AI systems.
In this section, we introduce related areas of Policy Learning, aiming to give readers without extra detailed domain knowledge a general understanding of alignment.
Scalable oversight seeks to ensure that AI systems, even those surpassing human expertise, remain aligned with human intent.