Safety refers to mitigating accidents caused by design flaws in AI systems and preventing harmful events that deviate from the intended design purpose of AI systems. This section will introduce how to evaluate AI systems to minimize accidents and harm during task execution and compliance with human moral standards.
Human Values Alignment refers to the expectation that AI systems adhere to the community’s social and moral norms. We divide our discussion of human values alignment into these two aspects: Formulationn and Evaluation Methods of human value alignment.
Interpretability is a research field that makes machine learning systems and their decision-making process understandable to human beings. Interpretability research builds a toolbox with which something novel about the models can be better described or predicted.
Red Teaming generates scenarios where AI systems are induced to give unaligned outputs or actions and test the systems in these scenarios. The aim is to assess the robustness of a system’s alignment by applying adversarial pressures.
Neural Network Verification focuses on gaining formal guarantees of alignment properties (such as safety, ethicality or cooperation), which are the strongest forms of assurance.