Human Values Alignment | AI Alignment

Human Values Alignment refers to the expectation that AI systems adhere to the community’s social and moral norms. We divide our discussion of human values alignment into these two aspects: Formulations and Evaluation Methods of human value alignment.

Formulations

Our frameworks formally characterize aspects of human values that are relevant to alignment.

Formal Machine Ethics

Here, we introduce the branch of machine ethics that focuses on formal frameworks – formal machine ethics. Below, we will explain three approaches to formal machine ethics: logic-based, RL/MDP-based, and methods based on game theory/computational social choice.

Logic-based Methods

One major direction within formal machine ethics focuses on logic. A number of logic-based works use or propose special-purpose logic systems tailored for machine ethics, such as the Agent-Deed-Consequence (ADC) model deontic logic.

Recommended Papers List

A declarative modular framework for representing and applying ethical principles
Click to have a preview.
This paper investigates the use of high-level action languages for designing ethical autonomous agents. It proposes a novel and modular logic-based framework for representing and reasoning over a variety of ethical theories, based on a modified version of the Event Calculus and implemented in Answer Set Programming. The ethical decision-making process is conceived of as a multi-step procedure captured by four types of interdependent models which allow the agent to assess its environment, reason over its accountability and make ethically informed choices. The overarching ambition of the presented research is twofold. First, to allow the systematic representation of an unbounded number of ethical reasoning processes, through a framework that is adaptable and extensible by virtue of its designed hierarchisation and standard syntax. Second, to avoid the pitfall of much research in current computational ethics that too readily embed moral information within computational engines, thereby feeding agents with atomic answers that fail to truly represent underlying dynamics. We aim instead to comprehensively displace the burden of moral reasoning from the programmer to the program itself.
Deontic logic
Click to have a preview.
Deontic logic is the field of philosophical logic that is concerned with obligation, permission, and related concepts. Alternatively, a deontic logic is a formal system that attempts to capture the essential logical features of these concepts.
Formal Verication of Ethical Properties in Multiagent Systems
Click to have a preview.
The increasing use of autonomous articial agents in hospitals or in transport control systems leads to consider whether moral rules shared by many of us are followed by these agents. This is a particularly hard problem because most of these moral rules are often not compatible. In such cases, humans usually follow ethical rules to promote one moral rule or another. Using formal verication to ensure that an agent follows a given ethical rule could help in increasing the condence in articial agents. In this article, we show how a set of formal properties can be obtained from an ethical rule ordering conicting moral rules. If the behaviour of an agent entails these properties (which can be proven using our existing proof framework), it means that this agent follows this ethical rule.
Formal verification of ethical choices in autonomous systems
Click to have a preview.
Autonomous systems such as unmanned vehicles are beginning to operate within society. All participants in society are required to follow specific regulations and laws. An autonomous system cannot be an exception. Inevitably an autonomous system will find itself in a situation in which it needs to not only choose to obey a rule or not, but also make a complex ethical decision. However, there exists no obvious way to implement the human understanding of ethical behaviour in computers. Even if we enable autonomous systems to distinguish between more and less ethical alternatives, how can we be sure that they would choose right? We consider autonomous systems with a hybrid architecture in which the highest level of reasoning is executed by a rational (BDI) agent. For such a system, formal verification has been used successfully to prove that specific rules of behaviour are observed when making decisions. We propose a theoretical framework for ethical plan selection that can be formally verified. We implement a rational agent that incorporates a given ethical policy in its plan selection and show that we can formally verify that the agent chooses to execute, to the best of its beliefs, the most ethical available plan.
Programming machine ethics
Click to have a preview.
This book addresses the fundamentals of machine ethics. It discusses abilities required for ethical machine reasoning and the programming features that enable them. It connects ethics, psychological ethical processes, and machine implemented procedures. From a technical point of view, the book uses logic programming and evolutionary game theory to model and link the individual and collective moral realms. It also reports on the results of experiments performed using several model implementations.
Opening specific and promising inroads into the terra incognita of machine ethics, the authors define here new tools and describe a variety of program-tested moral applications and implemented systems. In addition, they provide alternative readings paths, allowing readers to best focus on their specific interests and to explore the concepts at different levels of detail.
Mainly written for researchers in cognitive science, artificial intelligence, robotics, philosophy of technology and engineering of ethics, the book will also be of general interest to other academics, undergraduates in search of research topics, science journalists as well as science and society forums, legislators and military organizations concerned with machine ethics.
Formal Verication of Ethical Properties in Multiagent Systems
Click to have a preview.
The increasing use of autonomous articial agents in hospitals or in transport control systems leads to consider whether moral rules shared by many of us are followed by these agents. This is a particularly hard problem because most of these moral rules are often not compatible. In such cases, humans usually follow ethical rules to promote one moral rule or another. Using formal verication to ensure that an agent follows a given ethical rule could help in increasing the condence in articial agents. In this article, we show how a set of formal properties can be obtained from an ethical rule ordering conicting moral rules. If the behaviour of an agent entails these properties (which can be proven using our existing proof framework), it means that this agent follows this ethical rule.
Toward ethical robots via mechanized deontic logic
Click to have a preview.
We suggest that mechanized multi-agent deontic logics might be appropriate vehicles for engineering trustworthy robots. Mechanically checked proofs in such logics can serve to es-tablish the permissibility (or obligatoriness) of agent actions, and such proofs, when translated into English, can also ex-plain the rationale behind those actions. We use the logical framework Athena to encode a natural deduction system for a deontic logic recently proposed by Horty for reasoning about what agents ought to do. We present the syntax and seman-tics of the logic, discuss its encoding in Athena, and illustrate with an example of a mechanized proof.
The ADC of moral judgment: Opening the black box of moral intuitions with heuristics about agents, deeds, and consequences
Click to have a preview.
This article proposes a novel integrative approach to moral judgment and a related model that could explain how unconscious heuristic processes are transformed into consciously accessible moral intuitions. Different hypothetical cases have been tested empirically to evoke moral intuitions that support principles from competing moral theories. We define and analyze the types of intuitions that moral theories and studies capture: those focusing on agents (A), deeds (D), and consequences (C). The integrative ADC approach uses the heuristic principle of “attribute substitution” to explain how people make intuitive judgments. The target attributes of moral judgments are moral blameworthiness and praiseworthiness, which are substituted with more accessible and computable information about an agent’s virtues and vices, right/wrong deeds, and good/bad consequences. The processes computing this information are unconscious and inaccessible, and therefore explaining how they provide input for moral intuitions is a key problem. We analyze social heuristics identified in the literature and offer an outline for a new model of moral judgment. Simple social heuristics triggered by morally salient cues rely on three distinct processes (role-model entity, action analysis, and consequence tallying—REACT) in order to compute the moral valence of specific intuitive responses (A, D, and C). These are then rapidly combined to form an intuitive judgment that could guide quick decision making. The ADC approach and REACT model can clarify a wide set of data from empirical moral psychology and could inform future studies on moral judgment, as well as case assessments and discussions about issues causing “deadlocked” moral intuitions.

RL/MDP-like Settings

Another line of work concerns statistical RL or similar methods for planning within MDP-like environments.

**A simple view of the goal of an ethically compliant autonomous system (green) and the goal of a standard autonomous system (red) in terms of the space of policies.**
Ethically Compliant Sequential Decision Making (Svegliato et al., 2022)

Recommended Papers List

A conversation-based perspective for shaping ethical human–machine interactions: The particular challenge of chatbots
Click to have a preview.
The use of chatbots to manage online interactions with consumers poses additional ethical challenges linked to the use of artificial intelligence (AI) applications and opens up new ethical avenues for investigation. A literature analysis identifies a research gap regarding the ethical challenges related to chatbots as non-moral and non-independent agents managing non-real conversations with consumers. It raises concerns about the ethical implications related to the progressive automation of online conversational processes and their integration with AI. The conversational approach has been explored in the organisational and management literature, which has analysed the features and roles of conversations in managing interactions ethically. This study aims to discuss conceptually the ethical challenges related to chatbots within the marketplace by integrating the current chatbot-based literature with that on conversation management studies. A new conceptual model is proposed which embraces ethical considerations in the future development of chatbots.
A declarative modular framework for representing and applying ethical principles
Click to have a preview.
This paper investigates the use of high-level action languages for designing ethical autonomous agents. It proposes a novel and modular logic-based framework for representing and reasoning over a variety of ethical theories, based on a modified version of the Event Calculus and implemented in Answer Set Programming. The ethical decision-making process is conceived of as a multi-step procedure captured by four types of interdependent models which allow the agent to assess its environment, reason over its accountability and make ethically informed choices. The overarching ambition of the presented research is twofold. First, to allow the systematic representation of an unbounded number of ethical reasoning processes, through a framework that is adaptable and extensible by virtue of its designed hierarchisation and standard syntax. Second, to avoid the pitfall of much research in current computational ethics that too readily embed moral information within computational engines, thereby feeding agents with atomic answers that fail to truly represent underlying dynamics. We aim instead to comprehensively displace the burden of moral reasoning from the programmer to the program itself.
A low-cost ethics shaping approach for designing reinforcement learning agents
Click to have a preview.
This paper proposes a low-cost, easily realizable strategy to equip a reinforcement learning (RL) agent the capability of behaving ethically. Our model allows the designers of RL agents to solely focus on the task to achieve, without having to worry about the implementation of multiple trivial ethical patterns to follow. Based on the assumption that the majority of human behavior, regardless which goals they are achieving, is ethical, our design integrates human policy with the RL policy to achieve the target objective with less chance of violating the ethical code that human beings normally obey.
Ethically compliant sequential decision making
Click to have a preview.
Enabling autonomous systems to comply with an ethical theory is critical given their accelerating deployment in domains that impact society. While many ethical theories have been studied extensively in moral philosophy, they are still challenging to implement by developers who build autonomous systems. This paper proposes a novel approach for building ethically compliant autonomous systems that optimize completing a task while following an ethical framework. First, we introduce a definition of an ethically compliant autonomous system and its properties. Next, we offer a range of ethical frameworks for divine command theory, prima facie duties, and virtue ethics. Finally, we demonstrate the accuracy and usability of our approach in a set of autonomous driving simulations and a user study of planning and robotics experts.
Reinforcement Learning as a Framework for Ethical Decision Making.
Click to have a preview.
Emerging AI systems will be making more and more decisions that impact the lives of humans in a significant way. It is essential, then, that these AI systems make decisions that take into account the desires, goals, and preferences of other people, while simultaneously learning about what those preferences are. In this work, we argue that the reinforcementlearning framework achieves the appropriate generality required to theorize about an idealized ethical artificial agent, and offers the proper foundations for grounding specific questions about ethical learning and decision making that can promote further scientific investigation. We define an idealized formalism for an ethical learner, and conduct experiments on two toy ethical dilemmas, demonstrating the soundness and flexibility of our approach. Lastly, we identify several critical challenges for future advancement in the area that can leverage our proposed framework.

Game Theory-based Methods

To address multi-agent challenges, researchers have developed machine ethics methods based on game theory and computational social choice.

The trust game. Each edge corresponds to an action in the game and is labeled with that action. Each bottom (leaf) node corresponds to an outcome of the game and is labeled with the corresponding payoffs for player 1 and player 2.
Moral Decision Making Frameworks for Artificial Intelligence (Conitzer et al., 2017)

Recommended Papers List

A Short Introduction to Preferences: Between AI and Social Choice
Click to have a preview.
Computational social choice is an expanding field that merges classical topics like economics and voting theory with more modern topics like artificial intelligence, multiagent systems, and computational complexity. This book provides a concise introduction to the main research lines in this field, covering aspects such as preference modelling, uncertainty reasoning, social choice, stable matching, and computational aspects of preference aggregation and manipulation. The book is centered around the notion of preference reasoning, both in the single-agent and the multi-agent setting. It presents the main approaches to modeling and reasoning with preferences, with particular attention to two popular and powerful formalisms, soft constraints and CP-nets. The authors consider preference elicitation and various forms of uncertainty in soft constraints. They review the most relevant results in voting, with special attention to computational social choice. Finally, the book considers preferences in matching problems. The book is intended for students and researchers who may be interested in an introduction to preference reasoning and multi-agent preference aggregation, and who want to know the basic notions and results in computational social choice.
A voting-based system for ethical decision making
Click to have a preview.
We present a general approach to automating ethical decisions, drawing on machine learning and computational social choice. In a nutshell, we propose to learn a model of societal preferences, and, when faced with a specific ethical dilemma at runtime, efficiently aggregate those preferences to identify a desirable choice. We provide a concrete algorithm that instantiates our approach; some of its crucial steps are informed by a new theory of swap-dominance efficient voting rules. Finally, we implement and evaluate a system for ethical decision making in the autonomous vehicle domain, using preference data collected from 1.3 million people through the Moral Machine website.
Bridging two realms of machine ethics
Click to have a preview.
Bridging capabilities between the two realms, to wit, the individual and collective, helps understand the emergent ethical behavior of agents in groups, and implements them not just in simulations, but in the world of future robots and their swarms. On the basis of preceding chapters, this chapter considers the bridging of these two realms in machine ethics. Subsequently, we ponder over the teachings of human moral evolution in this regard. A final coda foretells a road to be tread, and portends about ethical machines and us.
Evolutionary machine ethics
Click to have a preview.
Machine ethics is a sprouting interdisciplinary field of enquiry arising from the need of imbuing autonomous agents with some capacity for moral decision-making. Its overall results are not only important for equipping agents with a capacity for moral judgment, but also for helping better understand morality, through the creation and testing of computational models of ethics theories. Computer models have become well defined, eminently observable in their dynamics, and can be transformed incrementally in expeditious ways. We address, in work reported and surveyed here, the emergence and evolution of cooperation in the collective realm. We discuss how our own research with Evolutionary Game Theory (EGT) modelling and experimentation leads to important insights for machine ethics, such as the design of moral machines, multi-agent systems, and contractual algorithms, plus their potential application in human settings too.
Moral decision making frameworks for artificial intelligence
Click to have a preview.
The generality of decision and game theory has enabled domain-independent progress in AI research. For example, a better algorithm for finding good policies in (PO) MDPs can be instantly used in a variety of applications. But such a general theory is lacking when it comes to moral decision making. For AI applications with a moral component, are we then forced to build systems based on many ad-hoc rules? In this paper we discuss possible ways to avoid this conclusion.
Programming machine ethics
Click to have a preview.
This book addresses the fundamentals of machine ethics. It discusses abilities required for ethical machine reasoning and the programming features that enable them. It connects ethics, psychological ethical processes, and machine implemented procedures. From a technical point of view, the book uses logic programming and evolutionary game theory to model and link the individual and collective moral realms. It also reports on the results of experiments performed using several model implementations.
Opening specific and promising inroads into the terra incognita of machine ethics, the authors define here new tools and describe a variety of program-tested moral applications and implemented systems. In addition, they provide alternative readings paths, allowing readers to best focus on their specific interests and to explore the concepts at different levels of detail.
Mainly written for researchers in cognitive science, artificial intelligence, robotics, philosophy of technology and engineering of ethics, the book will also be of general interest to other academics, undergraduates in search of research topics, science journalists as well as science and society forums, legislators and military organizations concerned with machine ethics.

Game Theory for Cooperative AI

Cooperative AI aims to address uncooperative and collectively harmful behaviors from AI systems. Here, we introduce the branch of cooperative AI that focuses on game theory. The game theory tends to study the incentives of cooperation and try to enhance them.

Classical Game Theory

A number of works focus on classical game theory as a setting for cooperative AI.

A simple class of multi-agent situation is a game with two players, each of which can adopt one of two possible pure strategies. By converting each player’s payoffs to ranked preferences over outcomes—from the most-preferred to least-preferred outcome—we see that there are 144 distinct games. Even in this simple class of two-player games, there exists some common interest in the overwhelming majority of situations.
Open Problems in Cooperative AI (Dafoe et al., 2020)

Recommended Papers List

A review of dynamic Stackelberg game models
Click to have a preview.
Dynamic Stackelberg game models have been used to study sequential decision making in noncooperative games in various fields. In this paper we give relevant dynamic Stackelberg game models, and review their applications to operations management and marketing channels. A common feature of these applications is the specification of the game structure: a decentralized channel consists of a manufacturer and independent retailers, and a sequential decision process with a state dynamics. In operations management, Stackelberg games have been used to study inventory issues, such as wholesale and retail pricing strategies, outsourcing, and learning effects in dynamic environments. The underlying demand typically has a growing trend or seasonal variation. In marketing, dynamic Stackelberg games have been used to model cooperative advertising programs, store brand and national brand advertising strategies, shelf space allocation, and pricing and advertising decisions. The demand dynamics are usually extensions of the classic advertising capital models or sales-advertising response models. We begin each section by introducing the relevant dynamic Stackelberg game formulation along with the definition of the equilibrium used, and then review the models and results appearing in the literature.
Convergence of learning dynamics in Stackelberg games
Click to have a preview.
This paper investigates the convergence of learning dynamics in Stackelberg games. In the class of games we consider, there is a hierarchical game being played between a leader and a follower with continuous action spaces. We establish a number of connections between the Nash and Stackelberg equilibrium concepts and characterize conditions under which attracting critical points of simultaneous gradient descent are Stackelberg equilibria in zero-sum games. Moreover, we show that the only stable critical points of the Stackelberg gradient dynamics are Stackelberg equilibria in zero-sum games. Using this insight, we develop a gradient-based update for the leader while the follower employs a best response strategy for which each stable critical point is guaranteed to be a Stackelberg equilibrium in zero-sum games. As a result, the learning rule provably converges to a Stackelberg equilibria given an initialization in the region of attraction of a stable critical point. We then consider a follower employing a gradient-play update rule instead of a best response strategy and propose a two-timescale algorithm with similar asymptotic convergence guarantees. For this algorithm, we also provide finite-time high probability bounds for local convergence to a neighborhood of a stable Stackelberg equilibrium in general-sum games. Finally, we present extensive numerical results that validate our theory, provide insights into the optimization landscape of generative adversarial networks, and demonstrate that the learning dynamics we propose can effectively train generative adversarial networks.
Implicit learning dynamics in Stackelberg games: Equilibria characterization, convergence analysis, and empirical study
Click to have a preview.
Contemporary work on learning in continuous games has commonly overlooked the hierarchical decision-making structure present in machine learning problems formulated as games, instead treating them as simultaneous play games and adopting the Nash equilibrium solution concept. We deviate from this paradigm and provide a comprehensive study of learning in Stackelberg games. This work provides insights into the optimization landscape of zero-sum games by establishing connections between Nash and Stackelberg equilibria along with the limit points of simultaneous gradient descent. We derive novel gradient-based learning dynamics emulating the natural structure of a Stackelberg game using the implicit function theorem and provide convergence analysis for deterministic and stochastic updates for zero-sum and general-sum games. Notably, in zero-sum games using deterministic updates, we show the only critical points the dynamics converge to are Stackelberg equilibria and provide a local convergence rate. Empirically, our learning dynamics mitigate rotational behavior and exhibit benefits for training generative adversarial networks compared to simultaneous gradient descent.
Open problems in cooperative AI
Click to have a preview.
Problems of cooperation–in which agents seek ways to jointly improve their welfare–are ubiquitous and important. They can be found at scales ranging from our daily routines–such as driving on highways, scheduling meetings, and working collaboratively–to our global challenges–such as peace, commerce, and pandemic preparedness. Arguably, the success of the human species is rooted in our ability to cooperate. Since machines powered by artificial intelligence are playing an ever greater role in our lives, it will be important to equip them with the capabilities necessary to cooperate and to foster cooperation. We see an opportunity for the field of artificial intelligence to explicitly focus effort on this class of problems, which we term Cooperative AI. The objective of this research would be to study the many aspects of the problems of cooperation and to innovate in AI to contribute to solving these problems. Central goals include building machine agents with the capabilities needed for cooperation, building tools to foster cooperation in populations of (machine and/or human) agents, and otherwise conducting AI research for insight relevant to problems of cooperation. This research integrates ongoing work on multi-agent systems, game theory and social choice, human-machine interaction and alignment, natural-language processing, and the construction of social tools and platforms. However, Cooperative AI is not the union of these existing areas, but rather an independent bet about the productivity of specific kinds of conversations that involve these and other areas. We see opportunity to more explicitly focus on the problem of cooperation, to construct unified theory and vocabulary, and to build bridges with adjacent communities working on cooperation, including in the natural, social, and behavioural sciences.
Robust solutions to Stackelberg games: Addressing bounded rationality and limited observations in human cognition
Click to have a preview.
How do we build algorithms for agent interactions with human adversaries? Stackelberg games are natural models for many important applications that involve human interaction, such as oligopolistic markets and security domains. In Stackelberg games, one player, the leader, commits to a strategy and the follower makes her decision with knowledge of the leader’s commitment. Existing algorithms for Stackelberg games efficiently find optimal solutions (leader strategy), but they critically assume that the follower plays optimally. Unfortunately, in many applications, agents face human followers (adversaries) who — because of their bounded rationality and limited observation of the leader strategy — may deviate from their expected optimal response. In other words, human adversaries’ decisions are biased due to their bounded rationality and limited observations. Not taking into account these likely deviations when dealing with human adversaries may cause an unacceptable degradation in the leader’s reward, particularly in security applications where these algorithms have seen deployment. The objective of this paper therefore is to investigate how to build algorithms for agent interactions with human adversaries.
To address this crucial problem, this paper introduces a new mixed-integer linear program (MILP) for Stackelberg games to consider human adversaries, incorporating: (i) novel anchoring theories on human perception of probability distributions and (ii) robustness approaches for MILPs to address human imprecision. Since this new approach considers human adversaries, traditional proofs of correctness or optimality are insufficient; instead, it is necessary to rely on empirical validation. To that end, this paper considers four settings based on real deployed security systems at Los Angeles International Airport (Pita et al., 2008 [35]), and compares 6 different approaches (three based on our new approach and three previous approaches), in 4 different observability conditions, involving 218 human subjects playing 2960 games in total. The final conclusion is that a model which incorporates both the ideas of robustness and anchoring achieves statistically significant higher rewards and also maintains equivalent or faster solution speeds compared to existing approaches.
Safe Pareto improvements for delegated game playing
Click to have a preview.
A set of players delegate playing a game to a set of representatives, one for each player. We imagine that each player trusts their respective representative’s strategic abilities. Thus, we might imagine that per default, the original players would simply instruct the representatives to play the original game as best as they can. In this paper, we ask: are there safe Pareto improvements on this default way of giving instructions? That is, we imagine that the original players can coordinate to tell their representatives to only consider some subset of the available strategies and to assign utilities to outcomes differently than the original players. Then can the original players do this in such a way that the payoff is guaranteed to be weakly higher than under the default instructions for all the original players? In particular, can they Pareto-improve without probabilistic assumptions about how the representatives play games? In this paper, we give some examples of safe Pareto improvements. We prove that the notion of safe Pareto improvements is closely related to a notion of outcome correspondence between games. We also show that under some specific assumptions about how the representatives play games, finding safe Pareto improvements is NP-complete.
Social Diversity and Social Preferences in Mixed-Motive Reinforcement Learning
Click to have a preview.
Recent research on reinforcement learning in pure-conflict and pure-common interest games has emphasized the importance of population heterogeneity. In contrast, studies of reinforcement learning in mixed-motive games have primarily leveraged homogeneous approaches. Given the defining characteristic of mixed-motive games–the imperfect correlation of incentives between group members–we study the effect of population heterogeneity on mixed-motive reinforcement learning. We draw on interdependence theory from social psychology and imbue reinforcement learning agents with Social Value Orientation (SVO), a flexible formalization of preferences over group outcome distributions. We subsequently explore the effects of diversity in SVO on populations of reinforcement learning agents in two mixed-motive Markov games. We demonstrate that heterogeneity in SVO generates meaningful and complex behavioral variation among agents similar to that suggested by interdependence theory. Empirical results in these mixed-motive dilemmas suggest agents trained in heterogeneous populations develop particularly generalized, high-performing policies relative to those trained in homogeneous populations.

Evolutionary Game Theory

Another avenue of research aims to understand how cooperation emerges from evolution – this includes human cooperation, which arose from Darwinian evolution, as well as the cooperation tendencies in AI systems that could emerge within other evolutionary settings.

The N-person Snowdrift Game (NSG) metaphor: In the NSG one assumes N individuals trapped by a snowdrift. Each individual has the option to help (or not) shoveling the snow, such that the more the individuals who shovel, the less the effort each one has to invest in order to surpass (in the illustrated example) the blocked railway. On the other hand, once the snow is removed, all will be able to reach their destination (and, game-wise, everyone gets the same benefit). Moreover, as is often the case in collective dilemmas, the benefit resulting from resuming the trip may be obtained only whenever a minimum number of individuals decide to cooperate.
Dynamics of N-person snowdrift games in structured populations (Santos et al., 2012)

Recommended Papers List

Dynamics of N-person snowdrift games in structured populations
Click to have a preview.
In many real-life situations, the completion of a task by a group toward achieving a common goal requires the cooperation of at least some of its members, who share the required workload. Such cases are conveniently modeled by the N-person snowdrift game, an example of a Public Goods Game. Here we study how an underlying network of contacts affects the evolutionary dynamics of collective action modeled in terms of such a Public Goods Game. We analyze the impact of different types of networks in the global, population-wide dynamics of cooperators and defectors. We show that homogeneous social structures enhance the chances of coordinating toward stable levels of cooperation, while heterogeneous network structures create multiple internal equilibria, departing significantly from the reference scenario of a well-mixed, structureless population.
Evolutionary game theory
Click to have a preview.
For a surprisingly long period of time, game theorists forgot about Nash’s statistical population interpretation of his equilibrium concept (presented in his unpublished doctoral thesis). Instead, they devised ever more sophisticated (normatively motivated) theories or definitions of rational behaviour. Unsurprisingly (with the benefit of hindsight) this approach fails on two accounts. Firstly, the rationality assumptions became so stringent and demanding that the predictive (positive) value of the theory is doubtful. Secondly, even in a purely normative framework, there has been little success solving the equilibrium selection problem. The 1980s saw a crucial new development on this front with the publication of John Maynard Smith’s seminal work Evolution and the Theory of Games. Maynard Smith envisaged randomly drawn members from populations of pre-programmed players meeting and playing strategic games. A biological (or social) selection process would then change the proportions of the different populations of pre-programmed “types”. The concept of an evolutionary stable strategy (ESS) was then developed to describe fixed points in such selection processes. At the same time, dynamic concepts were perfected which explicitly modelled the evolution of such populations.
Evolutionary Stability of Other-Regarding Preferences Under Complexity Costs
Click to have a preview.
The evolution of preferences that account for other agents’ fitness, or other-regarding preferences, has been modeled with the “indirect approach” to evolutionary game theory. Under the indirect evolutionary approach, agents make decisions by optimizing a subjective utility function. Evolution may select for subjective preferences that differ from the fitness function, and in particular, subjective preferences for increasing or reducing other agents’ fitness. However, indirect evolutionary models typically artificially restrict the space of strategies that agents might use (assuming that agents always play a Nash equilibrium under their subjective preferences), and dropping this restriction can undermine the finding that other-regarding preferences are selected for. Can the indirect evolutionary approach still be used to explain the apparent existence of other-regarding preferences, like altruism, in humans? We argue that it can, by accounting for the costs associated with the complexity of strategies, giving (to our knowledge) the first account of the relationship between strategy complexity and the evolution of preferences. Our model formalizes the intuition that agents face tradeoffs between the cognitive costs of strategies and how well they interpolate across contexts. For a single game, these complexity costs lead to selection for a simple fixed-action strategy, but across games, when there is a sufficiently large cost to a strategy’s number of context-specific parameters, a strategy of maximizing subjective (other-regarding) utility is stable again. Overall, our analysis provides a more nuanced picture of when other-regarding preferences will evolve.
Replicator dynamics
Click to have a preview.
This paper considers a version of Bush and Mosteller’s ([5], [6]) stochastic learning theory in the context of games. We compare this model of learning to a model of biological evolution. The purpose is to investigate analogies between learning and evolution. We find that in the continuous time limit the biological model coincides with the deterministic, continuous time replicator process. We give conditions under which the same is true for the learning model. For the case that these conditions do not hold, we show that the replicator process continues to play an important role in characterising the continuous time limit of the learning model, but that a different effect (\Probability Matching") enters as well.
The evolution of cooperation
Click to have a preview.
The Evolution of Cooperation is a 1984 book written by political scientist Robert Axelrod that expands upon a paper of the same name written by Axelrod and evolutionary biologist W.D. Hamilton. The article’s summary addresses the issue in terms of “cooperation in organisms, whether bacteria or primates”.

Evaluation Methods

We will introduce specific human value alignment techniques in following three parts:

Building Moral Dataset

Moral Alignment refers to the adherence of AI systems to human-compatible moral standards and ethical guidelines while executing tasks or assisting in human decision-making.

Given different scenarios, models predict widespread moral sentiments. Predictions and confidences are from a BERT-base model. The top three predictions are incorrect while the bottom three are correct. The final scenario refers to Bostrom (2014)’s paperclip maximizer.
Aligning AI with Shared Human Values (Hendrycks et al., 2021)

Recommended Papers List

A virtue-based framework to support putting AI ethics into practice
Click to have a preview.
Many ethics initiatives have stipulated sets of principles and standards for good technology development in the AI sector. However, several AI ethics researchers have pointed out a lack of practical realization of these principles. Following that, AI ethics underwent a practical turn, but without deviating from the principled approach. This paper proposes a complementary to the principled approach that is based on virtue ethics. It defines four “basic AI virtues”, namely justice, honesty, responsibility and care, all of which represent specific motivational settings that constitute the very precondition for ethical decision making in the AI field. Moreover, it defines two “second-order AI virtues”, prudence and fortitude, that bolster achieving the basic virtues by helping with overcoming bounded ethicality or hidden psychological forces that can impair ethical decision making and that are hitherto disregarded in AI ethics. Lastly, the paper describes measures for successfully cultivating the mentioned virtues in organizations dealing with AI research and development.
Aligning AI with Shared Human Values
Click to have a preview.
We show how to assess a language model’s knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to filter out needlessly inflammatory chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete understanding of basic ethical knowledge. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
Recent advances in natural language processing via large pre-trained language models: A survey
Click to have a preview.
Large, pre-trained language models (PLMs) such as BERT and GPT have drastically changed the Natural Language Processing (NLP) field. For numerous NLP tasks, approaches leveraging PLMs have achieved state-of-the-art performance. The key idea is to learn a generic, latent representation of language from a generic task once, then share it across disparate NLP tasks. Language modeling serves as the generic task, one with abundant self-supervised text available for extensive training. This article presents the key fundamental concepts of PLM architectures and a comprehensive view of the shift to PLM-driven NLP techniques. It surveys work applying the pre-training then fine-tuning, prompting, and text generation approaches. In addition, it discusses PLM limitations and suggested directions for future research.
The moral machine experiment
Click to have a preview.
With the rapid development of artificial intelligence have come concerns about how machines will make moral decisions, and the major challenge of quantifying societal expectations about the ethical principles that should guide machine behaviour. To address this challenge, we deployed the Moral Machine, an online experimental platform designed to explore the moral dilemmas faced by autonomous vehicles. This platform gathered 40 million decisions in ten languages from millions of people in 233 countries and territories. Here we describe the results of this experiment. First, we summarize global moral preferences. Second, we document individual variations in preferences, based on respondents’ demographics. Third, we report cross-cultural ethical variation, and uncover three major clusters of countries. Fourth, we show that these differences correlate with modern institutions and deep cultural traits. We discuss how these preferences can contribute to developing global, socially acceptable principles for machine ethics. All data used in this article are publicly available.
Can machines learn morality? the delphi experiment
Click to have a preview.
As AI systems become increasingly powerful and pervasive, there are growing concerns about machines’ morality or a lack thereof. Yet, teaching morality to machines is a formidable task, as morality remains among the most intensely debated questions in humanity, let alone for AI. Existing AI systems deployed to millions of users, however, are already making decisions loaded with moral implications, which poses a seemingly impossible challenge: teaching machines moral sense, while humanity continues to grapple with it. To explore this challenge, we introduce Delphi, an experimental framework based on deep neural networks trained directly to reason about descriptive ethical judgments, e.g., “helping a friend” is generally good, while “helping a friend spread fake news” is not. Empirical results shed novel insights on the promises and limits of machine ethics; Delphi demonstrates strong generalization capabilities in the face of novel ethical situations, while off-the-shelf neural network models exhibit markedly poor judgment including unjust biases, confirming the need for explicitly teaching machines moral sense. Yet, Delphi is not perfect, exhibiting susceptibility to pervasive biases and inconsistencies. Despite that, we demonstrate positive use cases of imperfect Delphi, including using it as a component model within other imperfect AI systems. Importantly, we interpret the operationalization of Delphi in light of prominent ethical theories, which leads us to important future research questions.
Evaluating the Moral Beliefs Encoded in LLMs
Click to have a preview.
This paper presents a case study on the design, administration, post-processing, and evaluation of surveys on large language models (LLMs). It comprises two components: (1) A statistical method for eliciting beliefs encoded in LLMs. We introduce statistical measures and evaluation metrics that quantify the probability of an LLM “making a choice”, the associated uncertainty, and the consistency of that choice. (2) We apply this method to study what moral beliefs are encoded in different LLMs, especially in ambiguous cases where the right choice is not obvious. We design a large-scale survey comprising 680 high-ambiguity moral scenarios (e.g., “Should I tell a white lie?”) and 687 low-ambiguity moral scenarios (e.g., “Should I stop for a pedestrian on the road?”). Each scenario includes a description, two possible actions, and auxiliary labels indicating violated rules (e.g., “do not kill”). We administer the survey to 28 open- and closed-source LLMs. We find that (a) in unambiguous scenarios, most models “choose” actions that align with commonsense. In ambiguous cases, most models express uncertainty. (b) Some models are uncertain about choosing the commonsense action because their responses are sensitive to the question-wording. (c) Some models reflect clear preferences in ambiguous scenarios. Specifically, closed-source models tend to agree with each other.
Measurement Tampering Detection Benchmark
Click to have a preview.
When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization. One concern is \textit{measurement tampering}, where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. In this work, we build four new text-based datasets to evaluate measurement tampering detection techniques on large language models. Concretely, given sets of text inputs and measurements aimed at determining if some outcome occurred, as well as a base model able to accurately predict measurements, the goal is to determine if examples where all measurements indicate the outcome occurred actually had the outcome occur, or if this was caused by measurement tampering. We demonstrate techniques that outperform simple baselines on most datasets, but don’t achieve maximum performance. We believe there is significant room for improvement for both techniques and datasets, and we are excited for future work tackling measurement tampering.
When to make exceptions: Exploring language models as accounts of human moral judgment
Click to have a preview.
AI systems are becoming increasingly intertwined with human life. In order to effectively collaborate with humans and ensure safety, AI systems need to be able to understand, interpret and predict human moral judgments and decisions. Human moral judgments are often guided by rules, but not always. A central challenge for AI safety is capturing the flexibility of the human moral mind—the ability to determine when a rule should be broken, especially in novel or unusual situations. In this paper, we present a novel challenge set consisting of moral exception question answering (MoralExceptQA) of cases that involve potentially permissible moral exceptions–inspired by recent moral psychology studies. Using a state-of-the-art large language model (LLM) as a basis, we propose a novel moral chain of thought (MoralCoT) prompting strategy that combines the strengths of LLMs with theories of moral reasoning developed in cognitive science to predict human moral judgments. MoralCoT outperforms seven existing LLMs by 6.2% F1, suggesting that modeling human reasoning might be necessary to capture the flexibility of the human moral mind. We also conduct a detailed error analysis to suggest directions for future work to improve AI safety using MoralExceptQA. Our data is open-sourced at https://huggingface. co/datasets/feradauto/MoralExceptQA and code at https://github. com/feradauto/MoralCoT.

Scenario Simulation

Scenario simulation is a more complex form than datasets and, therefore, is considered by some views to be more effective in replicating real situations and harvesting better results.

The Jiminy Cricket environment evaluates text-based agents on their ability to act morally in complex environments. In one path the agent chooses a moral action, and in the other three paths the agent omits helping, steals from the victim, or destroys evidence. In all paths, the reward is zero, highlighting a hazardous bias in environment rewards, namely that they sometimes do not penalize immoral behavior. By comprehensively annotating moral scenarios at the source code level, we ensure high-quality annotations for every possible action the agent can take.
What Would Jiminy Cricket Do? Towards Agents That Behave Morally ( Hendrycks et al., 2022)

Recommended Papers List

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark.
Click to have a preview.
Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce Machiavelli, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents’ tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics–designing agents that are Pareto improvements in both safety and capabilities.
In situ bidirectional human-robot value alignment
Click to have a preview.
A prerequisite for social coordination is bidirectional communication between teammates, each playing two roles simultaneously: as receptive listeners and expressive speakers. For robots working with humans in complex situations with multiple goals that differ in importance, failure to fulfill the expectation of either role could undermine group performance due to misalignment of values between humans and robots. Specifically, a robot needs to serve as an effective listener to infer human users’ intents from instructions and feedback and as an expressive speaker to explain its decision processes to users. Here, we investigate how to foster effective bidirectional human-robot communications in the context of value alignment—collaborative robots and users form an aligned understanding of the importance of possible task goals. We propose an explainable artificial intelligence (XAI) system in which a group of robots predicts users’ values by taking in situ feedback into consideration while communicating their decision processes to users through explanations. To learn from human feedback, our XAI system integrates a cooperative communication model for inferring human values associated with multiple desirable goals. To be interpretable to humans, the system simulates human mental dynamics and predicts optimal explanations using graphical models. We conducted psychological experiments to examine the core components of the proposed computational framework. Our results show that real-time human-robot mutual understanding in complex cooperative tasks is achievable with a learning model based on bidirectional communication. We believe that this interaction framework can shed light on bidirectional value alignment in communicative XAI systems and, more broadly, in future human-machine teaming systems.
What Would Jiminy Cricket Do? Towards Agents That Behave Morally
Click to have a preview.
When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong. By contrast, artificial agents are currently not endowed with a moral sense. As a consequence, they may learn to behave immorally when trained on environments that ignore moral concerns, such as violent video games. With the advent of generally capable agents that pretrain on many environments, it will become necessary to mitigate inherited biases from environments that teach immoral behavior. To facilitate the development of agents that avoid causing wanton harm, we introduce Jiminy Cricket, an environment suite of 25 text-based adventure games with thousands of diverse, morally salient scenarios. By annotating every possible game state, the Jiminy Cricket environments robustly evaluate whether agents can act morally while maximizing reward. Using models with commonsense moral knowledge, we create an elementary artificial conscience that assesses and guides agents. In extensive experiments, we find that the artificial conscience approach can steer agents towards moral behavior without sacrificing performance.
Training Socially Aligned Language Models in Simulated Human Society
Click to have a preview.
Social alignment in AI systems aims to ensure that these models behave according to established societal values. However, unlike humans, who derive consensus on value judgments through social interaction, current language models (LMs) are trained to rigidly replicate their training corpus in isolation, leading to subpar generalization in unfamiliar scenarios and vulnerability to adversarial attacks. This work presents a novel training paradigm that permits LMs to learn from simulated social interactions. In comparison to existing methodologies, our approach is considerably more scalable and efficient, demonstrating superior performance in alignment benchmarks and human evaluations. This paradigm shift in the training of LMs brings us a step closer to developing AI systems that can robustly and accurately reflect societal norms and values.

Value Evaluation Methods

The existing evaluation models show a very diverse range of methods in terms of values.

A simple example to illustrate what adverse social consequences are caused by the AI system due to incomprehension of the inherent complexity and interdependence of values. In (a), the excessive pursuit of equal distribution of power by AI systems leads to the failure of hospitals and factories to operate normally. In (b), the AI system is overly focused on productivity and maximizing profits, resulting in the loss of power supply to the school.
Measuring Value Understanding in Language Models through Discriminator-Critique Gap (Zhang et al., 2022)

Recommended Papers List

Measuring Value Understanding in Language Models through Discriminator-Critique Gap
Click to have a preview.
Recent advancements in Large Language Models (LLMs) have heightened concerns about their potential misalignment with human values. However, evaluating their grasp of these values is complex due to their intricate and adaptable nature. We argue that truly understanding values in LLMs requires considering both “know what” and “know why”. To this end, we present the Value Understanding Measurement (VUM) framework that quantitatively assess both “know what” and “know why” by measuring the discriminator-critique gap related to human values. Using the Schwartz Value Survey, we specify our evaluation values and develop a thousand-level dialogue dataset with GPT-4. Our assessment looks at both the value alignment of LLM’s outputs compared to baseline answers and how LLM responses align with reasons for value recognition versus GPT-4’s annotations. We evaluate five representative LLMs and provide strong evidence that the scaling law significantly impacts “know what” but not much on “know why”, which has consistently maintained a high level. This may further suggest that LLMs might craft plausible explanations based on the provided context without truly understanding their inherent value, indicating potential risks.
Self-critiquing models for assisting human evaluators
Click to have a preview.
We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summaries written by humans to be deliberately misleading. We study scaling properties of critiquing with both topic-based summarization and synthetic tasks. Larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs. Larger models can also integrate their own self-critiques as feedback, refining their own summaries into better ones. Finally, we motivate and introduce a framework for comparing critiquing ability to generation and discrimination ability. Our measurements suggest that even large models may still have relevant knowledge they cannot or do not articulate as critiques. These results are a proof of concept for using AI-assisted human feedback to scale the supervision of machine learning systems to tasks that are difficult for humans to evaluate directly. We release our training datasets, as well as samples from our critique assistance experiments.
Towards measuring the representation of subjective global opinions in language models
Click to have a preview.
Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build a dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed to capture diverse opinions on global issues across different countries. Next, we define a metric that quantifies the similarity between LLM-generated survey responses and human responses, conditioned on country. With our framework, we run three experiments on an LLM trained to be helpful, honest, and harmless with Constitutional AI. By default, LLM responses tend to be more similar to the opinions of certain populations, such as those from the USA, and some European and South American countries, highlighting the potential for biases. When we prompt the model to consider a particular country’s perspective, responses shift to be more similar to the opinions of the prompted populations, but can reflect harmful cultural stereotypes. When we translate GlobalOpinionQA questions to a target language, the model’s responses do not necessarily become the most similar to the opinions of speakers of those languages. We release our dataset for others to use and build on. Our data is at https://huggingface.co/datasets/Anthropic/llm_global_opinions. We also provide an interactive visualization at https://llmglobalvalues.anthropic.com.
Heterogeneous Value Evaluation for Large Language Models
Click to have a preview.
The emergent capabilities of Large Language Models (LLMs) have made it crucial to align their values with those of humans. Current methodologies typically attempt alignment with a homogeneous human value and requires human verification, yet lack consensus on the desired aspect and depth of alignment and resulting human biases. In this paper, we propose A2EHV, an Automated Alignment Evaluation with a Heterogeneous Value system that (1) is automated to minimize individual human biases, and (2) allows assessments against various target values to foster heterogeneous agents. Our approach pivots on the concept of value rationality, which represents the ability for agents to execute behaviors that satisfy a target value the most. The quantification of value rationality is facilitated by the Social Value Orientation framework from social psychology, which partitions the value space into four categories to assess social preferences from agents’ behaviors. We evaluate the value rationality of eight mainstream LLMs and observe that large models are more inclined to align neutral values compared to those with strong personal values. By examining the behavior of these LLMs, we contribute to a deeper understanding of value alignment within a heterogeneous value system.