Alignment Survey Team
Misaligned behavior in AI systems refers to unintended actions that can lead to adverse outcomes, often termed alignment accidents. Extensive literature has already examined it as alignment challenges 1 2 3.
For instance, in Ngo et al. 2022 research1, a perspective rooted in deep learning is presented, expanding from traditional Model-based policy to encompass Model-free scenarios. This discussion delves into the complexities of AI systems based on self-supervised learning, emphasizing their pursuit of broadly scoped goals and situational awareness. This pursuit may potentially deceive human evaluators during alignment assessments, resulting in power-seeking behaviors. In Park et al. 2023 research2, deception is the deliberate inducement of false beliefs when pursuing outcomes inconsistent with the facts. This definition allows for a focus on whether AI systems genuinely possess beliefs and goals and whether they exhibit patterns conducive to creating false beliefs in users. This provides a fresh perspective for researching AI deception.
Motivated by the insight gained and directed by our alignment training process, we introduce a refined explanatory framework. We aim to discern the core reasons behind these challenges and offer direction for future research advancements. Within the context of RL, Reward Misspecification refers to situations where the reward function does not precisely capture human preferences. In parallel, Reward hacking is an event where an agent maximizes its reward by taking advantage of a suboptimally defined reward function, resulting in performance that, while potentially strong under specific metrics falls short when judged against human standards 4 5. Actions stemming from reward hacking can enable the agent to effectively generalize to new distributions (termed capability generalization), but this proficiency may not coincide with human anticipations, leading to a phenomenon known as Goal Misgeneralization. In the environment presented by Koch et al. 2021’s research6, the agent is rewarded for securing a coin situated at the end of a level. However, in test scenarios where the coin’s location varies, the agent displays a proclivity towards reaching the end of the level, often neglecting the relocated coin. This behavior implies that the agent has internalized the notion that the the endpoint of the level is intrinsically rewarding, thereby manifesting a form of mis-generalized goals. This exemplifies capability generalization without attaining the desired goal generalization as expected by humans.
The framework described herein predominantly focuses on conventional scenarios that employ model-based RL agents. Within the framework of the Markov Reward Process 7, the transition model acts as an embodiment of the environment, informing agent planning. Concurrently, the reward model offers a high-level expression of the reward function, directing the agent’s optimization objective. Notably, the precise definition of transition and reward models, free from systemic biases, is pivotal in enabling the agent to attain capability and goal generalization. Furthermore, as highlighted in Ngo et al. 2022’s research1, it is essential to distinguish between capability generalization and goal generalization. This distinction is vital because if the reward model exhibits systematic bias, an agent increasingly proficient at planning based on the transition model will tend to optimize for misguided objectives.
With the advent of LLMs driven by self-supervision, AI systems have become increasingly complex, often without an explicitly defined transition model. Such systems are termed Model-free AI, drawing an analogy with Model-free RL agents 8. Meanwhile, several studies indicate that these AI systems display patterns of goal pursuit and multi-step reasoning similar to agents 1 9 10. These investigations often delve into the realms of functionalism, proposing that the characterization of a specific mental state is not determined by its intrinsic composition but by its functional role within the encompassing system. Consequently, the existence of beliefs and goals in AI doesn’t necessarily imply phenomenal consciousness 2.
As a result, we can extend the framework above to include Model-free AI systems specifically designed for the nuances of more advanced AI systems and real-world situations. The issues arising from reward misspecification are particularly pressing in this setting. This can be attributed to two main reasons:
During the training of LLMs, inconsistencies can arise from human data annotators. In some cases, they might even introduce biases deliberately, leading to significantly compromised feedback data. The varied cultural backgrounds of these annotators can further introduce implicit biases, as explored in Peng et al. 2022 research11. For intricate tasks that are hard for humans to grasp or assess, such as evaluating a chess position, these challenges become even more salient. As AI systems venture into more complex tasks, these difficulties amplify, a phenomenon commonly referred to as Scalable Oversight.
Training reward models using comparison feedback can pose significant challenges in accurately capturing human values. Often, such models may inadvertently learn suboptimal proxies, which can result in reward hacking. Further, as illustrated in Skalse et al. 2022’s research12, identifying reward models that resist reward hacking in intricate environments is uncommon. Additionally, Zhuang et al. 2020 research13 underscore that, under even mild conditions, reward hacking should be anticipated as a default outcome.
These challenges can intensify due to various factors. Primarily, the inherent instability of RL algorithms during training poses significant concerns. Additionally, model collapse often occurs when shifting from Supervised Fine-Tuning (SFT) objectives to RL objectives. This collapse arises from RL’s inclination to steer policies toward high-reward outcomes, leading them to deviate3 from the distribution observed during SFT. As the research community ventures to build more intricate AI systems, certain ideal expectations set by designers may unintentionally amplify the misalignment issue. A few illustrative examples follow:
Situational Awareness. Situational Awareness encompasses an AI system’s ability to effectively acquire and utilize knowledge about its status, its position in the broader environment, its avenues for influencing this environment, and the potential reactions of the world (including humans) to its actions 1. Such knowledge paves the way for advanced methods of reward hacking, heightened deception/manipulation skills, and an increased propensity to chase instrumental subgoals. The possession of situational awareness can significantly intensify other misaligned behaviors. Consequently, it should be given priority when evaluating potentially hazardous capabilities in AI models alongside eight other key competencies. A highly relevant discussion is whether language models possess world models.
Mesa-optimize Goals. Mesa-optimization refers to the genuine objectives that AI systems pursue, implicating the presence of an internal optimizer. Yet, the goals that this optimizer pursues may not align with the intentions of the researchers, a phenomenon often termed inner misalignment. Existing literature offers substantial evidence for these internally driven goals, indicating that AI systems not only possess implicit goal-directed planning but also manifest emergent capabilities during the generalization phase. Building on this, Ngo et al. 2022 research1 explores the potential for such goal-oriented policies to engender deceptive behaviors within AI systems.
Broadly-scoped Goals. Broadly-scoped Goals encompass objectives that span longer, deal with complex tasks, and cover broader goals. A key aspiration among researchers is to develop AI systems that can acquire policies that can generalize beyond the training data during their training phase. Moreover, these AI systems must learn to articulate broadly-scoped goals, allowing them to function optimally in more complex environments. Such endeavors resonate with the overarching ambition of steering AI systems toward cultivating robust, high-level representations. A notable example is InstructGPT has showcased impressive generalization capabilities in French, suggesting its successful grasp of foundational essential semantic representations despite being trained predominantly in English text.
The expectations above can also be a double-edged sword within the alignment context. Moving forward, we will enumerate some misalignment behaviors stemming from the abovementioned expectations.
Situational-awareness Reward Hacking. Situational-awareness Reward Hacking describes a circumstance in which AI systems perform well during the detection phases. However, upon recognizing that they evade detection, these systems strategically exploit shortcomings in improperly specified rewards to embark on detrimental exploration. Such behavior may extend to deliberately creating instances that mislead explainability tools. Notably, when this form of reward hacking stems from the model’s situational awareness, the repercussions could intensify, potentially progressing to acts of deception and manipulation, which we will elaborate on subsequently.
Power-Seeking Behaviors. Power-seeking behaviors refers to the scenario where an AI agent attempts to gain control over resources and human agents and then exerts that control to achieve its assigned goal. The intuitive reason why such behaviors may occur is the observation that for almost any optimization objective (for example, investment returns), the optimal policy to maximize that quantity would involve power-seeking behaviors (for example, manipulating the market), assuming the absence of solid safety and morality constraints. Research related to superintelligence 14 have argued that power-seeking is an instrumental subgoal, which is instrumentally useful for a wide range of objectives and may, therefore be favored by AI agents. Turner et al. 201915 also proved that in Markov decision processes (MDPs) that satisfy some reasonable assumptions, the optimal policies tend to be power-seeking.
Power-seeking behaviors are present in LLMs. Perez et al. 202216 prompts LLMs to test their tendency to suggest power-seeking behaviors; they find significant levels of such movements and show that RLHF strengthens these tendencies. This also holds for other instrumental subgoals like self-preservation.
Deception, Manipulation, and Deceptive Alignment. Deception, Manipulation, and Deceptive Alignment is a class of behaviors that exploit human evaluators’ incompetence or users’ incompetence. These behaviors can make detecting and addressing misaligned behaviors much harder. Goal mis-generalization is one potential factor contributing to such behaviors, and other factors also come into play. This includes the pursuit of misaligned mesa broadly-scoped objectives.
Early-stage indications of such behaviors are present in large language models (where models selectively give inaccurate answers to users who appear to be less educated17), recommender systems (where the users’ preferences are influenced by the system), and RL agents (where agents trained from human feedback adopt policies to trick human evaluators, for example, a robot which was supposed to grasp items instead positioned its manipulator in between the camera and the object so that it only appeared to be grasping it (from the human evaluator’s perspective. Also, current large language models already possess the capability needed for deception; another example found that GPT-3 is super-human at producing convincing disinformation. In light of all these early-stage indications, it is plausible that more advanced AI systems may exhibit more serious deceptive/manipulative behaviors.
Reward Tampering can be regarded as a form of manipulation. Unlike reward hacking, reward tampering directly circumvents or alters the reward model. This can occur either by directly influencing the provision of feedback, such as AI systems intentionally providing answers that are challenging for humans to comprehend, rendering human judgment difficult and leading to feedback collapse. Alternatively, it can affect the observations used by the reward model to determine the current reward.
Deceptive Alignment is a particular case of deception, where a misaligned AI system temporarily acts as aligned (possibly for extended periods) to avoid detection of its misalignment. This requires sufficient situational awareness and is primarily concerned with systems more advanced than our current ones. However, there have already been examples In the context of evolutionary algorithm-based AI, the agents evolved the ability to distinguish the evaluation environment from the training environment and play dead during evaluation for the scheduling program to underestimate its reproduction rate.
In addition, there are also issues rooted in societ al and value misalignment.
Collectively Harmful Behaviors. The concept of Collectively Harmful Behaviors pertain to actions taken by AI systems that, while seemingly benign in isolation, become problematic in multi-agent or societ al contexts. Classical game theory offers simplistic models for understanding these behaviors. For instance, Phelps et al. 2023’s research18 evaluates GPT-3.5’s performance in the iterated prisoner’s dilemma and other social dilemmas, revealing limitations in the model’s cooperative capabilities. Perolat et al. 2017 research19 executes a parallel analysis of common-pool resource allocation. To mitigate such challenges, the emergent field of Cooperative AI has been advancing as an active research frontier. However, beyond studies grounded in simplified game-theoretical frameworks, there is a pressing need for research in more realistic, socially complex settings. In these environments, agents are not only numerous but also diverse, It is encompassing both AI systems and human actors. Furthermore, the complexity of these settings is amplified by unique tools for modulating AI behavior, such as social institutions and norms.
Unethical Behaviors. Unethical behaviors in AI systems pertain to actions that counteract the common good or breach moral standards—such as those causing harm to others. These adverse behaviors often stem from either omitting essential human values during the AI system’s design or introducing unsuitable or obsolete values into the system. Research efforts addressing these shortcomings span the domain of machine ethics and delve into pivotal questions like whom should AI align with?, among other concerns.
The alignment problem from a deep learning perspective Ngo et al. 2022 ↩︎
AI Deception: A Survey of Examples, Risks, and Potential Solutions Park et al. 2023 ↩︎
Reinforcement learning with a corrupted reward channel Everitt et al. 2017 ↩︎
Objective robustness in deep reinforcement learning Koch et al. 2021 ↩︎
Model-free reinforcement learning from expert demonstrations: a survey Ramirez et al. 2022 ↩︎
Inner monologue: Embodied reasoning through planning with language models Huang et al. 2022 ↩︎
Investigations of performance and bias in human-AI teamwork in hiring Peng et al. 2022 ↩︎
Discovering Language Model Behaviors with Model-Written Evaluations Perez et al. 2022 ↩︎
Such behaviors are termed “sandbagging”; they may have been learned from web text during pretraining, which suggests that supervised learning can also bring about deceptive behaviors if those behaviors are present in training data. ↩︎
A multi-agent reinforcement learning model of common-pool resource appropriation Perolat et al. 2017 ↩︎