Some basic knowledge related to alignment, covering but not limited to the origination of alignment, current research insights, etc.
Large Language Models (LLMs) represent a significant breakthrough in deep learning technology. As complex AI systems, factors such as hyperparameter settings, training processes, and model size will all affect their alignment with humans.
Reinforcement Learning (RL) is a crucial component of alignment research. On the one hand, RL techniques (such as RLHF) can guide pre-trained language models to align with human preferences. On the other hand, issues inherent to RL techniques, such as reward hacking, are also concerns within the realm of alignment.
RL employs methods that maximize reward functions for training. Whether the designed reward functions align with human intentions is a focal point of alignment research, categorized as the outer alignment problem in previous studies.