We provide an English-Chinese glossary of specialized vocabulary related to AI alignment. To the best of our knowledge, this is the first authoritative one for AI alignment.
英文 | 中文 |
AI Alignment | 人工智能对齐 |
Alignment Training | 对齐训练 |
Adversarial Training | 对抗训练 |
Allport-Vernon-Lindzey value system | 奥尔波特-弗农-林赛价值观系统 |
Algorithmic Intervention | 算法干预 |
Assurance | 对齐保证 |
Avoiding Negative Effects | 副作用避免 |
AI Governance | 人工智能治理 |
Addressing Social Complexity | 处理社会复杂性 |
Alignment Refinement | 对齐精炼 |
Auto-induced Distribution Shift | 自诱发分布偏移 |
Agents | 自主体/智能体 |
Approval-Directed Agents | 评价导向型自主体 |
AI Safety Beyond Alignment | 对齐外的人工智能安全性 |
Adversarial Training | 对抗训练 |
Ad-hoc Coordination | 特设协调 |
Activation Patching | 激活修补 |
Attribution Patching | 归因修补 |
Agency | 自主性/能动性 |
Backward Alignment | 后向对齐 |
Barrier Loss | 障碍损失 |
Behavior Cloning (BC) | 行为克隆 |
Broadly-scoped Goals | 广泛目标 |
Bounded Rationality | 有限理性 |
Controllability | 可控性 |
Comparison | 比较 |
Corrigibility | 可纠正性 |
Curriculum Learning | 课程学习 |
Concept-based Interpretability | 基于概念的可解释性 |
Computational Social Choice | 计算社会选择 |
Cooperative Training | 合作训练 |
Collectively Harmful Behaviors | 集体有害行为 |
Cross Examination | 交叉检验 |
Cross-Cultural Values in Social Psychology | 社会心理学中的跨文化价值观 |
Capability Generalization | 能力泛化 |
Cooperative Inverse Reinforcement Learning (CIRL) | 合作逆强化学习 |
Cross-Distribution Aggregation | 跨分布聚合 |
Connectivity based Fine-Tuning (CBFT) | 基于连通性微调 |
Circuits Analysis | 通路分析 |
Crowdsourced Adversarial Inputs | 众包对抗输入 |
Circuits Hypothesis | 通路假设 |
Data Distribution Intervention | 数据分布干预 |
Deception | 欺骗 |
Demonstration | 示范 |
Deontic Logic | 义务逻辑 |
Double Edge Components | 双刃剑组件 |
Distribution Shift | 分布偏移 |
Distribuionally Robustness Optimization (DRO) | 分布鲁棒优化 |
Domain Randomization | 领域随机化 |
Discriminator-Critique Gap (DCG) | 判别器-评价器差异 |
Ethicality | 道德性 |
Ethics Shaping | 伦理塑造 |
Event Calculus | 事件演算 |
Empirical Risk Minimization (ERM) | 经验风险最小化 |
Environment Building | 环境搭建 |
Explainability and Transparency | 可解释性和透明度 |
Forward Alignment | 前向对齐 |
Feature Synthesis | 特征合成 |
Feature Attribution | 特征归因 |
Feedback | 反馈 |
Fully Cooperative MARL | 完全合作多智能体强化学习 |
Formal Machine Ethics | 形式化机器伦理 |
Goal Misgeneralization | 目标错误泛化 |
Goal Misspecification | 目标错误规范 |
Goal Generalization | 目标泛化 |
Govenment | 政府 |
Goodhart's Law | 古德哈特定律 |
Human Value Compliance | 人类价值契合度 |
Human Value Verification | 人类价值契合性验证 |
human thumbs-up | 人类所赞同的 |
Human Value Verification | 人类价值验证 |
Industry Actors | 产业参与者 |
Instrumental Covergence | 工具性收敛 |
Industry and AGI Labs | 业界和AGI实验室 |
Iterated Distillaiton and Amplification (IDA) | 迭代蒸馏扩增 |
Invariant Risk Minimization (IRM) | 不变风险最小化 |
Invariant Causal Prediction (ICP) | 不变风险预测 |
Interpretability | 可解释性 |
Intentional Behaviors | 有意行为 |
Intrinsic Interpretability | 内在可解释性 |
International Governance | 国际治理 |
Induction Head | 归纳头 |
Inverse Reinforcement Learning (IRL) | 逆强化学习 |
Instrumental Goals/ Strategies | 工具目标/策略 |
Inner Misalignment | 内部不对齐 |
Large Language Models (LLMs) | 大语言模型 |
Learning from Feedback | 从反馈中学习 |
Learning under Distribution Shift | 在分布偏移下学习 |
LLMs-based agents | 基于大语言模型的自主体 |
Latent Direction | 潜在方向 |
Latent Knowledge | 潜在知识 |
Learning from Demonstrations | 从示范中学习 |
Misalignment | 对齐失败 |
Manipulation | 操纵 |
Mesa-optimization Objectives | 内优化目标 |
Machine Ethics | 机器伦理 |
Misspecified Reward | 误设奖励 |
Measurement Tamperng | 度量篡改 |
Mode Connectivity | 模式连通性 |
Minimizers | 最小化器 |
Mixed-Motive MARL | 混合动机多智能体强化学习 |
Mechanistic Interpretability | 机制解释性 |
Manual and Automatic Jailbreaking | 手动和自动越狱 |
Machine Ethics | 机器伦理 |
Misuse Risk | 滥用风险 |
Navigation via Mode Connectivity | 模式连接指引 |
Non-Government Orgnizations (NGOs) | 非政府组织 |
Non-Profit Orginazations (NPOs) | 非营利组织 |
Other Play | 他人游戏 |
Off-belief Learning | 离信念学习 |
Open-source Governance | 开源治理 |
Outer Misalignment | 外部不对齐 |
Off-switch Game | 关机游戏 |
Out of Distribution (OOD) | 分布外 |
Power Seeking | 权力寻求 |
Proxy | 代理 |
Policy Learning | 策略学习 |
Perturbation-based Adversarial Training | 基于扰动的对抗训练 |
Preference Elicitation | 偏好引导 |
Probing | 探测 |
Post Hoc Interpretability | 事后可解释性 |
Proximal Policy Optimization (PPO) | 近端策略优化 |
Policy-conditioned Belief | 策略条件信念 |
Preference-based Reinforcement Learning | 基于偏好的强化学习 |
Robustness | 鲁棒性 |
Reinforcement Learning from Human Feedback (RLHF) | 从人类反馈中的强化学习 |
Reinforcement Learning from Human and AI Feedback (RLHAIF) | 基于人类和人工智能反馈的强化学习 |
RLxF | 基于任意反馈的强化学习 |
Reinforcement Learning from Privacy Feedback (RLPF) | 从隐私反馈中进行强化学习 |
Reward Hacking | 奖励破解 |
Red Teaming | 红队测试 |
Rule of Thumb (RoT) | 经验法则 |
Representation Engineering | 表示工程 |
Reward Misspecification | 奖励错误规范 |
Reward | 奖励 |
Recursive Reward Modeling (RRM) | 递归奖励建模 |
Reject Sampling | 拒绝采样 |
Relations between pairs of items | 项对之间的关系 |
Reward Sketching | 奖励速写 |
Risk Extrapolation | 风险外推 |
Reinforced, Optimized, Guided, or Reverse Context Generation | 基于强化学习、优化方法、引导生成或反向生成的上下文构造 |
Risk Management System (RMS) | 风险管理系统 |
Reinforcement Learning (RL) | 强化学习 |
Stackelberg Games | 斯塔克尔伯格博弈 |
Scalable Oversight | 可扩展监督 |
Safety Evaluation | 安全评估 |
Situation Awareness | 态势感知 |
Sycophancy | 谄媚(阿谀奉承) |
Social Choices | 社会选择 |
Specification Gaming | 规范博弈 |
Sandbagging | 故意失误 |
Shortcut features | 捷径特征 |
Spurious Correlations | 虚假关联 |
Social Value Orientation (SVO) | 社会价值取向 |
Socially Realistic Settings | 社会模拟 |
Social Concerns | 社会关切 |
Safety Evaluations | 安全测评 |
Safety or the Science of Deep Learning | 安全性或深度学习的科学 |
Self Preservation Prolification | 自我保护与扩散 |
The Multi-Stakeholder Approach | 多利益相关者方法 |
Transformer | Transformer |
Third Parties | 第三方 |
Untruthful Answers | 不真实回答 |
Unrestricted Adversarial Training | 无限制对抗训练 |
Violation of Ethics | 违反伦理 |
Value Factorization | 价值因子化 |
Value Alignment | 价值对齐 |
Weighted Pairwise disagreement loss | 加权失配损失 |
Zero-Shot Coordination | 无准备协调 |