Reinforcement Learning | AI Alignment

This Q&A section merely utilizes viewpoints from existing literature in the alignment field to provide informative answers to common questions. The answers to these questions may not be definitive, as future research may present different perspectives.
To quickly find the information you want, use the index in the right column.

Reinforcement Learning (RL) is a crucial component of alignment research. On the one hand, RL techniques (such as RLHF) can guide pre-trained language models to align with human preferences. On the other hand, issues inherent to RL techniques, such as reward hacking, are also concerns within the realm of alignment.

What are the significance and limitations of Imitation Learning and Inverse Reinforcement Learning in alignment?

Inverse Reinforcement Learning (IRL): If we have demonstrations of the desired task, we can use Inverse Reinforcement Learning to extract a reward function.
Imitation Learning: Imitation learning involves replicating demonstrated behavior.

However, these methods are challenging to apply in scenarios involving behaviors that are difficult for humans to demonstrate, such as controlling non-humanoid robots with many degrees of freedom or most applications involving the current popular large-scale language models.

Therefore, allowing humans to provide feedback on the current behavior of complex systems and utilizing this feedback for alignment becomes exceptionally crucial.