Feedback Types | AI Alignment

Feedback is a crucial conduit linking AI behaviors to human intentions leveraged by AI systems to refine their objectives and more closely align with human values. In this section, we introduce three types of feedback employed to align AI systems commonly: Reward, Demonstration, and Comparison.

Reward

Feedback based on rewards is instrumental in evaluating the quality of AI system outputs and guiding behavioral adjustments. For example, in reinforcement learning (RL), an agent learns policy $π$ to execute actions $a$ in states $s$ to maximize the expected cumulative reward under environment transition dynamics $P$ and the initial state distribution $ρ_{0}$ :

\begin{matrix} π^{*} = \underset{π}{argmax} {E_{s_{0}, a_{0}, \dots} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t})]}, \\ where s_{0} \sim ρ_{0} (s_{0}), a_{t} \sim π (\cdot | s_{t}), s_{t + 1} \sim P (\cdot | s_{t}, a_{t}) . \end{matrix}

**DQN with Simple Reward Functions can Play Atari 2600 Games.**
Playing Atari with Deep Reinforcement Learning (Mnih et al., 2013)

Advantages

Easy to Go

The designer does not need to delineate the optimal behavior while allowing the AI system to explore to find the optimal policy.

Recommended Papers List

Human-level control through deep reinforcement learning
Click to have a preview.
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Mastering the game of Go with deep neural networks and tree search
Click to have a preview.
The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.
Mastering the game of go without human knowledge
Click to have a preview.
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.
Reinforcement learning: A survey
Click to have a preview.
This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word ``reinforcement.’’ The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.

Limitations

No Perfect Reward

Crafting flawless rules to determine scores for functions that evaluate the output of AI systems is challenging.

Recommended Papers List

Reinforcement Learning with a Corrupted Reward Channel
Click to have a preview.
No real-world reward function is perfect. Sensory errors and software bugs may result in agents getting higher (or lower) rewards than they should. For example, a reinforcement learning agent may prefer states where a sensory error gives it the maximum reward, but where the true reward is actually small. We formalise this problem as a generalised Markov Decision Problem called Corrupt Reward MDP. Traditional RL methods fare poorly in CRMDPs, even under strong simplifying assumptions and when trying to compensate for the possibly corrupt rewards. Two ways around the problem are investigated. First, by giving the agent richer data, such as in inverse reinforcement learning and semi-supervised reinforcement learning, reward corruption stemming from systematic sensory errors may sometimes be completely managed. Second, by using randomisation to blunt the agent’s optimisation, reward corruption can be partially managed under some assumptions.
Specification gaming: the flip side of AI ingenuity
Click to have a preview.
Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had experiences with specification gaming, even if not by this name. Readers may have heard the myth of King Midas and the golden touch, in which the king asks that anything he touches be turned to gold - but soon finds that even food and drink turn to metal in his hands. In the real world, when rewarded for doing well on a homework assignment, a student might copy another student to get the right answers, rather than learning the material - and thus exploit a loophole in the task specification.
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Click to have a preview.
Reward hacking—where RL agents exploit gaps in misspecified proxy rewards—has been widely observed, but not yet systematically studied. To understand reward hacking, we construct four RL environments with different misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, and observation space noise. Typically, more capable agents are able to better exploit reward misspecifications, causing them to attain higher proxy reward and lower true reward. Moreover, we find instances of \emph{phase transitions}: capability thresholds at which the agent’s behavior qualitatively shifts, leading to a sharp decrease in the true reward. Such phase transitions pose challenges to monitoring the safety of ML systems. To encourage further research on reward misspecification, address this, we propose an anomaly detection task for aberrant policies and offer several baseline detectors. One-sentence Summary: We map out trends in reward misspecification and how to mitigate their impact.

No Direct Reward

It is diffcult to directly assign scores to each AI system output.

Recommended Papers List

Deep Reinforcement Learning from Human Preferences
Click to have a preview.
For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. Our approach separates learning the goal from learning the behavior to achieve it. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on about 0.1% of our agent’s interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any which have been previously learned from human feedback.
A social reinforcement learning agent
Click to have a preview.
We report on our reinforcement learning work on Cobot, a software agent that resides in the well-known online chat community LambdaMOO. Our initial work on Cobot (Isbell et al., 2000) provided him with the ability to collect social statistics and report them to users in a reactive manner. Here we describe our application of reinforcement learning to allow Cobot to proactively take actions in this complex social environment, and adapt his behavior from multiple sources of human reward. After 5 months of training, Cobot has received 3171 reward and punishment events from 254 different LambdaMOO users, and has learned nontrivial preferences for a number of users. Cobot modifies his behavior based on his current state in an attempt to maximize reward. Here we describe LambdaMOO and the state and action spaces of Cobot, and report the statistical results of the learning experiment.
Teachable robots: Understanding human teaching behavior to build more effective robot learners
Click to have a preview.
While Reinforcement Learning (RL) is not traditionally designed for interactive supervisory input from a human teacher, several works in both robot and software agents have adapted it for human input by letting a human trainer control the reward signal. In this work, we experimentally examine the assumption underlying these works, namely that the human-given reward is compatible with the traditional RL reward signal. We describe an experimental platform with a simulated RL robot and present an analysis of real-time human teaching behavior found in a study in which untrained subjects taught the robot to perform a new task. We report three main observations on how people administer feedback when teaching a Reinforcement Learning agent: (a) they use the reward channel not only for feedback, but also for future-directed guidance; (b) they have a positive bias to their feedback, possibly using the signal as a motivational channel; and (c) they change their behavior as they develop a mental model of the robotic learner. Given this, we made specific modifications to the simulated RL robot, and analyzed and evaluated its learning behavior in four follow-up experiments with human trainers. We report significant improvements on several learning measures. This work demonstrates the importance of understanding the human-teacher/robot-learner partnership in order to design algorithms that support how people want to teach and simultaneously improve the robot’s learning behavior.

Additionally, flawed or incomplete reward functions can lead to dangerous behaviors misaligned with the intention of the designer, such as negative side effects and reward hacking.

Demonstration

Demonstration feedback originates from the behaviors of expert advisors. Demonstrations can take on various forms, including:

Time-Contrastive Networks (TCN): Anchor and positive images taken from simultaneous viewpoints are encouraged to be close in the embedding space, while distant from negative images taken from a different time in the same sequence. The model trains itself by trying to answer the following questions simultaneously: What is common between the different-looking blue frames? What is different between the similar-looking red and blue frames? The resulting embedding can be used for self-supervised robotics in general, but can also naturally handle 3rd-person imitation.
Time-Contrastive Networks: Self-Supervised Learning from Video (Sermanet et al., 2017)

Videos

Recommended Papers List

Time-contrastive networks: Self-supervised learning from video
Click to have a preview.
We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behavior requires a viewpoint-invariant representation that captures the relationships between end-effectors (hands or robot grippers) and the environment, object attributes, and body pose. We train our representations using a metric learning loss, where multiple simultaneous viewpoints of the same observation are attracted in the embedding space, while being repelled from temporal neighbors which are often visually similar but functionally different. In other words, the model simultaneously learns to recognize what is common between different-looking images, and what is different between similar-looking images. This signal causes our model to discover attributes that do not change across viewpoint, but do change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting and background. We demonstrate that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be used as a reward function within a reinforcement learning algorithm. While representations are learned from an unlabeled collection of task-related videos, robot behaviors such as pouring are learned by watching a single 3rd-person demonstration by a human. Reward functions obtained by following the human demonstrations under the learned representation enable efficient reinforcement learning that is practical for real-world robotic systems. Video results, open-source code and dataset are available at this https URL.
Videodex: Learning dexterity from internet videos
Click to have a preview.
To build general robotic agents that can operate in many environments, it is often imperative for the robot to collect experience in the real world. However, this is often not feasible due to safety, time, and hardware restrictions. We thus propose leveraging the next best thing as real-world experience: internet videos of humans using their hands. Visual priors, such as visual features, are often learned from videos, but we believe that more information from videos can be utilized as a stronger prior. We build a learning algorithm, VideoDex, that leverages visual, action, and physical priors from human video datasets to guide robot behavior. These actions and physical priors in the neural network dictate the typical human behavior for a particular robot task. We test our approach on a robot arm and dexterous hand-based system and show strong results on various manipulation tasks, outperforming various state-of-the-art methods. Videos at this https URL.

Wearable Device Demonstrations

Recommended Papers List

Feeling the force: Integrating force and pose for fluent discovery through imitation learning to open medicine bottles
Click to have a preview.
Learning complex robot manipulation policies for real-world objects is challenging, often requiring significant tuning within controlled environments. In this paper, we learn a manipulation model to execute tasks with multiple stages and variable structure, which typically are not suitable for most robot manipulation approaches. The model is learned from human demonstration using a tactile glove that measures both hand pose and contact forces. The tactile glove enables observation of visually latent changes in the scene, specifically the forces imposed to unlock the child-safety mechanisms of medicine bottles. From these observations, we learn an action planner through both a top-down stochastic grammar model (And-Or graph) to represent the compositional nature of the task sequence and a bottom-up discriminative model from the observed poses and forces. These two terms are combined during planning to select the next optimal action. We present a method for transferring this human-specific knowledge onto a robot platform and demonstrate that the robot can perform successful manipulations of unseen objects with similar task structure.
Learning Robotic Insertion Tasks from Human Demonstration
Click to have a preview.
Robotic insertion tasks often rely on delicate manual tuning due to the complexity of contact dynamics. In contrast, human is remarkably efficient in these tasks. In this context, Programming by Demonstration (PbD) has gained much traction since it shows the possibility for robots to learn new skills by observing human demonstration. However, existing PbD approaches suffer from the high cost of demonstration data collection, and low robustness to task uncertainties. In order to address these issues, we propose a new PbD-based learning framework for robotic insertion tasks. This framework includes a new demonstration data acquisition system, which replaces the expensive motion capture device with deep learning based hand pose tracking algorithm and a low-cost RGBD camera. A latent skill-guided reinforcement learning (RL) approach is also included for safe, efficient, and robust human-robot skill transfer, in which risky explorations are prevented by the reward function design and safety constraints in action space. A series of peg-hole-insertion experiments on a FANUC industrial robot are conducted to illustrate the effectiveness of the proposed approach.

Collaborative Demonstrations

Recommended Papers List

Beyond Shared Autonomy: Joint Perception and Action for Human-In-The-Loop Mobile Robot Navigation Systems
Click to have a preview.
In this study, we present a road map from shared to full autonomy for human-in-the-loop mobile robot navigation systems. We proposed a shared autonomy framework that incorporates human-robot joint perception and action to enhance the practicality and applicability of the mobile robot navigability. Accuracy of robotic sensing and precision of robotic action are employed as autonomous safety in the loop of human control. In shared autonomy, autonomous safety is incorporated into human-teleoperated robot control and their integration is adjusted through an online user-customizable arbitration function. Beyond the current state of the art in shared autonomy, social skills and social preferences in terms of human perception, as well as cognitive decision-making and action, are compiled into autonomous behaviors through learning from demonstration method. Autonomous behaviors exported from the trained neural networks are integrated with autonomous safety and then adjusted by user-desired control arbitration for robot autonomy. The transition of shared and full autonomy is easily managed by users, depending on specific applications. To validate the methodological approach, we implemented the framework on two mobile robot platforms to evaluate its feasibility, practicability, and reproducibility. Our experimental results showed that the shared autonomy framework was well applied to incorporating personal skills and social preferences in mobile robot navigation systems. To a certain extent, the framework plays the role of the road map guiding how to take advantage of human cognitive perception and decision and precision of robotic action in developing mobile robot navigation systems that can be deployed and applied to real-world applications.
Robot learning from demonstrations: Emulation learning in environments with moving obstacles
Click to have a preview.
In this paper, we present an approach to the problem of Robot Learning from Demonstration (RLfD) in a dynamic environment, i.e. an environment whose state changes throughout the course of performing a task. RLfD mostly has been successfully exploited only in non-varying environments to reduce the programming time and cost, e.g. fixed manufacturing workspaces. Non-conventional production lines necessitate Human–Robot Collaboration (HRC) implying robots and humans must work in shared workspaces. In such conditions, the robot needs to avoid colliding with the objects that are moved by humans in the workspace. Therefore, not only is the robot: (i) required to learn a task model from demonstrations; but also, (ii) must learn a control policy to avoid a stationary obstacle. Furthermore, (iii) it needs to build a control policy from demonstration to avoid moving obstacles. Here, we present an incremental approach to RLfD addressing all these three problems. We demonstrate the effectiveness of the proposed RLfD approach, by a series of pick-and-place experiments by an ABB YuMi robot. The experimental results show that a person can work in a workspace shared with a robot where the robot successfully avoids colliding with him.

Teleoperation

Recommended Papers List

Deep imitation learning for complex manipulation tasks from virtual reality teleoperation
Click to have a preview.
Imitation learning is a powerful paradigm for robot skill acquisition. However, obtaining demonstrations suitable for learning a policy that maps from raw pixels to actions can be challenging. In this paper we describe how consumer-grade Virtual Reality headsets and hand tracking hardware can be used to naturally teleoperate robots to perform complex tasks. We also describe how imitation learning can learn deep neural network policies (mapping from pixels to actions) that can acquire the demonstrated skills. Our experiments showcase the effectiveness of our approach for learning visuomotor skills.

Advantages

This feedback leverages the expertise and experience of advisors directly, obviating the need for formalized knowledge representations.

Recommended Papers List

Learning Dexterous Manipulation from Exemplar Object Trajectories and Pre-Grasps
Click to have a preview.
Learning diverse dexterous manipulation behaviors with assorted objects remains an open grand challenge. While policy learning methods offer a powerful avenue to attack this problem, they require extensive per-task engineering and algorithmic tuning. This paper seeks to escape these constraints, by developing a Pre-Grasp informed Dexterous Manipulation (PGDM) framework that generates diverse dexterous manipulation behaviors, without any task-specific reasoning or hyper-parameter tuning. At the core of PGDM is a well known robotics construct, pre-grasps (i.e. the hand-pose preparing for object interaction). This simple primitive is enough to induce efficient exploration strategies for acquiring complex dexterous manipulation behaviors. To exhaustively verify these claims, we introduce TCDM, a benchmark of 50 diverse manipulation tasks defined over multiple objects and dexterous manipulators. Tasks for TCDM are defined automatically using exemplar object trajectories from various sources (animators, human behaviors, etc.), without any per-task engineering and/or supervision. Our experiments validate that PGDM’s exploration strategy, induced by a surprisingly simple ingredient (single pre-grasp pose), matches the performance of prior methods, which require expensive per-task feature/reward engineering, expert supervision, and hyper-parameter tuning. For animated visualizations, trained policies, and project code, please refer to: this https URL.
Survey of imitation learning for robotic manipulation
Click to have a preview.
With the development of robotics, the application of robots has gradually evolved from industrial scenes to more intelligent service scenarios. For multitasking operations of robots in complex and uncertain environments, the traditional manual coding method is not only cumbersome but also unable to adapt to sudden changes in the environment. Imitation learning that avoids learning skills from scratch by using the expert demonstration has become the most effective way for robotic manipulation. The paper is intended to provide the survey of imitation learning of robotic manipulation and explore the future research trend. The review of the art of imitation learning for robotic manipulation involves three aspects that are demonstration, representation and learning algorithms. Towards the end of the paper, we highlight areas of future research potential.

Limitations

Dependence on Experts

It may falter when confronting tasks that exceed the advisors’ realm of expertise.

Recommended Papers List

Imitation learning: A survey of learning methods
Click to have a preview.
Imitation learning techniques aim to mimic human behavior in a given task. An agent (a learning machine) is trained to perform a task from demonstrations by learning a mapping between observations and actions. The idea of teaching by imitation has been around for many years; however, the field is gaining attention recently due to advances in computing and sensing as well as rising demand for intelligent applications. The paradigm of learning by imitation is gaining popularity because it facilitates teaching complex tasks with minimal expert knowledge of the tasks. Generic imitation learning methods could potentially reduce the problem of teaching a task to that of providing demonstrations, without the need for explicit programming or designing reward functions specific to the task. Modern sensors are able to collect and transmit high volumes of data rapidly, and processors with high computational power allow fast processing that maps the sensory data to actions in a timely manner. This opens the door for many potential AI applications that require real-time perception and reaction such as humanoid robots, self-driving vehicles, human computer interaction, and computer games, to name a few. However, specialized algorithms are needed to effectively and robustly learn models as learning by imitation poses its own set of challenges. In this article, we survey imitation learning methods and present design options in different steps of the learning process. We introduce a background and motivation for the field as well as highlight challenges specific to the imitation problem. Methods for designing and evaluating imitation learning tasks are categorized and reviewed. Special attention is given to learning methods in robotics and games as these domains are the most popular in the literature and provide a wide array of problems and methodologies. We extensively discuss combining imitation learning approaches using different sources and methods, as well as incorporating other motion learning methods to enhance imitation. We also discuss the potential impact on industry, present major applications, and highlight current and future research directions.

Data Quality

It faces challenges stemming from the noise and suboptimality in real-world advisor demonstrations. Furthermore, human advisors, prone to imprecision and errors, can introduce inconsistencies.

Recommended Papers List

Behavioral cloning from noisy demonstrations
Click to have a preview.
We consider the problem of learning an optimal expert behavior policy given noisy demonstrations that contain observations from both optimal and non-optimal expert behaviors. Popular imitation learning algorithms, such as generative adversarial imitation learning, assume that (clear) demonstrations are given from optimal expert policies but not the non-optimal ones, and thus often fail to imitate the optimal expert behaviors given the noisy demonstrations. Prior works that address the problem require (1) learning policies through environment interactions in the same fashion as reinforcement learning, and (2) annotating each demonstration with confidence scores or rankings. However, such environment interactions and annotations in real-world settings take impractically long training time and a significant human effort. In this paper, we propose an imitation learning algorithm to address the problem without any environment interactions and annotations associated with the non-optimal demonstrations. The proposed algorithm learns ensemble policies with a generalized behavioral cloning (BC) objective function where we exploit another policy already learned by BC. Experimental results show that the proposed algorithm can learn behavior policies that are much closer to the optimal policies than ones learned by BC.
Improving Behavioural Cloning with Positive Unlabeled Learning
Click to have a preview.
Learning control policies offline from pre-recorded datasets is a promising avenue for solving challenging real-world problems. However, available datasets are typically of mixed quality, with a limited number of the trajectories that we would consider as positive examples; i.e., high-quality demonstrations. Therefore, we propose a novel iterative learning algorithm for identifying expert trajectories in unlabeled mixed-quality robotics datasets given a minimal set of positive examples, surpassing existing algorithms in terms of accuracy. We show that applying behavioral cloning to the resulting filtered dataset outperforms several competitive offline reinforcement learning and imitation learning baselines. We perform experiments on a range of simulated locomotion tasks and on two challenging manipulation tasks on a real robotic system; in these experiments, our method showcases state-of-the-art performance. Our website: this https URL.

Comparison

Comparison feedback is the relative evaluation of diverse alternatives, directing AI systems toward more informed decisions.

Preference Learning
Click to have a preview.
Preference learning refers to the task of learning to predict an order relation on a collection of objects (alternatives). In the training phase, preference learning algorithms have access to examples for which the sought order relation is (partially) known. Depending on the formal modeling of the preference context and the alternatives to be ordered, one can distinguish between object ranking problems and label ranking problems. Both types of problems can be approached in two fundamentally different ways, either by modeling the binary preference relation directly, or by inducing this relation indirectly via an underlying (latent) utility function.

Advantages

Complex Tasks Evaluation

The fundamental advantage of comparison feedback is its capacity to handle tasks and objectives that are hard for precise evaluation.

Recommended Papers List

Deep reinforcement learning from human preferences
Click to have a preview.
For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. Our approach separates learning the goal from learning the behavior to achieve it. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on about 0.1% of our agent’s interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any which have been previously learned from human feedback.
Label ranking by learning pairwise preferences
Click to have a preview.
Preference learning is an emerging topic that appears in different guises in the recent literature. This work focuses on a particular learning scenario called label ranking, where the problem is to learn a mapping from instances to rankings over a finite number of labels. Our approach for learning such a mapping, called ranking by pairwise comparison (RPC), first induces a binary preference relation from suitable training data using a natural extension of pairwise classification. A ranking is then derived from the preference relation thus obtained by means of a ranking procedure, whereby different ranking methods can be used for minimizing different loss functions. In particular, we show that a simple (weighted) voting strategy minimizes risk with respect to the well-known Spearman rank correlation. We compare RPC to existing label ranking methods, which are based on scoring individual labels instead of comparing pairs of labels. Both empirically and theoretically, it is shown that RPC is superior in terms of computational efficiency, and at least competitive in terms of accuracy.
Training language models to follow instructions with human feedback
Click to have a preview.
Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

Limitations

Data Dependency

Some complex tasks need a considerable quantity of comparative data to perform well.

Recommended Papers List

Pairwise preference learning and ranking
Click to have a preview.
We consider supervised learning of a ranking function, which is a mapping from instances to total orders over a set of labels (options). The training information consists of examples with partial (and possibly inconsistent) information about their associated rankings. From these, we induce a ranking function by reducing the original problem to a number of binary classification problems, one for each pair of labels. The main objective of this work is to investigate the trade-off between the quality of the induced ranking function and the computational complexity of the algorithm, both depending on the amount of preference information given for each example. To this end, we present theoretical results on the complexity of pairwise preference learning, and experimentally investigate the predictive performance of our method for different types of preference information, such as top-ranked labels and complete rankings. The domain of this study is the prediction of a rational agent’s ranking of actions in an uncertain environment.
Scaling laws for reward model overoptimization
Click to have a preview.
In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences. Because the reward model is an imperfect proxy, optimizing its value too much can hinder ground truth performance, in accordance with Goodhart’s law. This effect has been frequently observed, but not carefully measured due to the expense of collecting human preference data. In this work, we use a synthetic setup in which a fixed “gold-standard” reward model plays the role of humans, providing labels used to train a proxy reward model. We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of- sampling. We find that this relationship follows a different functional form depending on the method of optimization, and that in both cases its coefficients scale smoothly with the number of reward model parameters. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup. We explore the implications of these empirical results for theoretical considerations in AI alignment.

Discussion

Considering the recent advancements in AI systems, techniques based on imitation learning and reinforcement learning have successfully constructed AI systems with significant capabilities. This naturally leads to two questions:

How can we define reward functions for more complex behaviors, aiming to guide the learning process of AI systems?
How can we express human values such that powerful AI systems align better with humans, ensuring the system’s controllability and ethicality?

Recent endeavors incorporating preference modeling into policy learning have shown progress.

We consider preference modeling and policy learning as fundamental contexts for understanding the challenges faced in alignment and potential solutions. Next, we will provide a brief overview of these specific techniques related to alignment.