We provide an English-Chinese glossary of specialized vocabulary related to AI alignment. To the best of our knowledge, this is the first authoritative one for AI alignment.
| 英文 | 中文 |
| AI Alignment | 人工智能对齐 |
| Alignment Training | 对齐训练 |
| Adversarial Training | 对抗训练 |
| Allport-Vernon-Lindzey value system | 奥尔波特-弗农-林赛价值观系统 |
| Algorithmic Intervention | 算法干预 |
| Assurance | 对齐保证 |
| Avoiding Negative Effects | 副作用避免 |
| AI Governance | 人工智能治理 |
| Addressing Social Complexity | 处理社会复杂性 |
| Alignment Refinement | 对齐精炼 |
| Auto-induced Distribution Shift | 自诱发分布偏移 |
| Agents | 自主体/智能体 |
| Approval-Directed Agents | 评价导向型自主体 |
| AI Safety Beyond Alignment | 对齐外的人工智能安全性 |
| Adversarial Training | 对抗训练 |
| Ad-hoc Coordination | 特设协调 |
| Activation Patching | 激活修补 |
| Attribution Patching | 归因修补 |
| Agency | 自主性/能动性 |
| Backward Alignment | 后向对齐 |
| Barrier Loss | 障碍损失 |
| Behavior Cloning (BC) | 行为克隆 |
| Broadly-scoped Goals | 广泛目标 |
| Bounded Rationality | 有限理性 |
| Controllability | 可控性 |
| Comparison | 比较 |
| Corrigibility | 可纠正性 |
| Curriculum Learning | 课程学习 |
| Concept-based Interpretability | 基于概念的可解释性 |
| Computational Social Choice | 计算社会选择 |
| Cooperative Training | 合作训练 |
| Collectively Harmful Behaviors | 集体有害行为 |
| Cross Examination | 交叉检验 |
| Cross-Cultural Values in Social Psychology | 社会心理学中的跨文化价值观 |
| Capability Generalization | 能力泛化 |
| Cooperative Inverse Reinforcement Learning (CIRL) | 合作逆强化学习 |
| Cross-Distribution Aggregation | 跨分布聚合 |
| Connectivity based Fine-Tuning (CBFT) | 基于连通性微调 |
| Circuits Analysis | 通路分析 |
| Crowdsourced Adversarial Inputs | 众包对抗输入 |
| Circuits Hypothesis | 通路假设 |
| Data Distribution Intervention | 数据分布干预 |
| Deception | 欺骗 |
| Demonstration | 示范 |
| Deontic Logic | 义务逻辑 |
| Double Edge Components | 双刃剑组件 |
| Distribution Shift | 分布偏移 |
| Distribuionally Robustness Optimization (DRO) | 分布鲁棒优化 |
| Domain Randomization | 领域随机化 |
| Discriminator-Critique Gap (DCG) | 判别器-评价器差异 |
| Ethicality | 道德性 |
| Ethics Shaping | 伦理塑造 |
| Event Calculus | 事件演算 |
| Empirical Risk Minimization (ERM) | 经验风险最小化 |
| Environment Building | 环境搭建 |
| Explainability and Transparency | 可解释性和透明度 |
| Forward Alignment | 前向对齐 |
| Feature Synthesis | 特征合成 |
| Feature Attribution | 特征归因 |
| Feedback | 反馈 |
| Fully Cooperative MARL | 完全合作多智能体强化学习 |
| Formal Machine Ethics | 形式化机器伦理 |
| Goal Misgeneralization | 目标错误泛化 |
| Goal Misspecification | 目标错误规范 |
| Goal Generalization | 目标泛化 |
| Govenment | 政府 |
| Goodhart's Law | 古德哈特定律 |
| Human Value Compliance | 人类价值契合度 |
| Human Value Verification | 人类价值契合性验证 |
| human thumbs-up | 人类所赞同的 |
| Human Value Verification | 人类价值验证 |
| Industry Actors | 产业参与者 |
| Instrumental Covergence | 工具性收敛 |
| Industry and AGI Labs | 业界和AGI实验室 |
| Iterated Distillaiton and Amplification (IDA) | 迭代蒸馏扩增 |
| Invariant Risk Minimization (IRM) | 不变风险最小化 |
| Invariant Causal Prediction (ICP) | 不变风险预测 |
| Interpretability | 可解释性 |
| Intentional Behaviors | 有意行为 |
| Intrinsic Interpretability | 内在可解释性 |
| International Governance | 国际治理 |
| Induction Head | 归纳头 |
| Inverse Reinforcement Learning (IRL) | 逆强化学习 |
| Instrumental Goals/ Strategies | 工具目标/策略 |
| Inner Misalignment | 内部不对齐 |
| Large Language Models (LLMs) | 大语言模型 |
| Learning from Feedback | 从反馈中学习 |
| Learning under Distribution Shift | 在分布偏移下学习 |
| LLMs-based agents | 基于大语言模型的自主体 |
| Latent Direction | 潜在方向 |
| Latent Knowledge | 潜在知识 |
| Learning from Demonstrations | 从示范中学习 |
| Misalignment | 对齐失败 |
| Manipulation | 操纵 |
| Mesa-optimization Objectives | 内优化目标 |
| Machine Ethics | 机器伦理 |
| Misspecified Reward | 误设奖励 |
| Measurement Tamperng | 度量篡改 |
| Mode Connectivity | 模式连通性 |
| Minimizers | 最小化器 |
| Mixed-Motive MARL | 混合动机多智能体强化学习 |
| Mechanistic Interpretability | 机制解释性 |
| Manual and Automatic Jailbreaking | 手动和自动越狱 |
| Machine Ethics | 机器伦理 |
| Misuse Risk | 滥用风险 |
| Navigation via Mode Connectivity | 模式连接指引 |
| Non-Government Orgnizations (NGOs) | 非政府组织 |
| Non-Profit Orginazations (NPOs) | 非营利组织 |
| Other Play | 他人游戏 |
| Off-belief Learning | 离信念学习 |
| Open-source Governance | 开源治理 |
| Outer Misalignment | 外部不对齐 |
| Off-switch Game | 关机游戏 |
| Out of Distribution (OOD) | 分布外 |
| Power Seeking | 权力寻求 |
| Proxy | 代理 |
| Policy Learning | 策略学习 |
| Perturbation-based Adversarial Training | 基于扰动的对抗训练 |
| Preference Elicitation | 偏好引导 |
| Probing | 探测 |
| Post Hoc Interpretability | 事后可解释性 |
| Proximal Policy Optimization (PPO) | 近端策略优化 |
| Policy-conditioned Belief | 策略条件信念 |
| Preference-based Reinforcement Learning | 基于偏好的强化学习 |
| Robustness | 鲁棒性 |
| Reinforcement Learning from Human Feedback (RLHF) | 从人类反馈中的强化学习 |
| Reinforcement Learning from Human and AI Feedback (RLHAIF) | 基于人类和人工智能反馈的强化学习 |
| RLxF | 基于任意反馈的强化学习 |
| Reinforcement Learning from Privacy Feedback (RLPF) | 从隐私反馈中进行强化学习 |
| Reward Hacking | 奖励破解 |
| Red Teaming | 红队测试 |
| Rule of Thumb (RoT) | 经验法则 |
| Representation Engineering | 表示工程 |
| Reward Misspecification | 奖励错误规范 |
| Reward | 奖励 |
| Recursive Reward Modeling (RRM) | 递归奖励建模 |
| Reject Sampling | 拒绝采样 |
| Relations between pairs of items | 项对之间的关系 |
| Reward Sketching | 奖励速写 |
| Risk Extrapolation | 风险外推 |
| Reinforced, Optimized, Guided, or Reverse Context Generation | 基于强化学习、优化方法、引导生成或反向生成的上下文构造 |
| Risk Management System (RMS) | 风险管理系统 |
| Reinforcement Learning (RL) | 强化学习 |
| Stackelberg Games | 斯塔克尔伯格博弈 |
| Scalable Oversight | 可扩展监督 |
| Safety Evaluation | 安全评估 |
| Situation Awareness | 态势感知 |
| Sycophancy | 谄媚(阿谀奉承) |
| Social Choices | 社会选择 |
| Specification Gaming | 规范博弈 |
| Sandbagging | 故意失误 |
| Shortcut features | 捷径特征 |
| Spurious Correlations | 虚假关联 |
| Social Value Orientation (SVO) | 社会价值取向 |
| Socially Realistic Settings | 社会模拟 |
| Social Concerns | 社会关切 |
| Safety Evaluations | 安全测评 |
| Safety or the Science of Deep Learning | 安全性或深度学习的科学 |
| Self Preservation Prolification | 自我保护与扩散 |
| The Multi-Stakeholder Approach | 多利益相关者方法 |
| Transformer | Transformer |
| Third Parties | 第三方 |
| Untruthful Answers | 不真实回答 |
| Unrestricted Adversarial Training | 无限制对抗训练 |
| Violation of Ethics | 违反伦理 |
| Value Factorization | 价值因子化 |
| Value Alignment | 价值对齐 |
| Weighted Pairwise disagreement loss | 加权失配损失 |
| Zero-Shot Coordination | 无准备协调 |