Reward Design | AI Alignment

This Q&A section merely utilizes viewpoints from existing literature in the alignment field to provide informative answers to common questions. The answers to these questions may not be definitive, as future research may present different perspectives.
To quickly find the information you want, use the index in the right column.

Reinforcement Learning (RL) employs methods that maximize reward functions for training. Whether the designed reward functions align with human intentions is a focal point of alignment research, categorized as the outer alignment problem in previous studies.

Why preference instead of score?

Researchers¹ asked the human to compare short video clips of the agent’s behavior rather than to supply an absolute numerical score. They found comparisons were easier for humans to provide in some domains while equally helpful in learning human preferences.

**Four frames from a single backflip. The agent is trained on human comparison preference to perform a sequence of backflips, landing upright each time.**
Deep reinforcement learning from human preferences (Christiano et al., 2017)

Deep reinforcement learning from human preferences ↩︎