en-US to zh-CN Glossary

We provide an English-Chinese glossary of specialized vocabulary related to AI alignment. To the best of our knowledge, this is the first authoritative one for AI alignment.

英文	中文
AI Alignment	人工智能对齐
Alignment Training	对齐训练
Adversarial Training	对抗训练
Allport-Vernon-Lindzey value system	奥尔波特-弗农-林赛价值观系统
Algorithmic Intervention	算法干预
Assurance	对齐保证
Avoiding Negative Effects	副作用避免
AI Governance	人工智能治理
Addressing Social Complexity	处理社会复杂性
Alignment Refinement	对齐精炼
Auto-induced Distribution Shift	自诱发分布偏移
Agents	自主体/智能体
Approval-Directed Agents	评价导向型自主体
AI Safety Beyond Alignment	对齐外的人工智能安全性
Adversarial Training	对抗训练
Ad-hoc Coordination	特设协调
Activation Patching	激活修补
Attribution Patching	归因修补
Agency	自主性/能动性
Backward Alignment	后向对齐
Barrier Loss	障碍损失
Behavior Cloning (BC)	行为克隆
Broadly-scoped Goals	广泛目标
Bounded Rationality	有限理性
Controllability	可控性
Comparison	比较
Corrigibility	可纠正性
Curriculum Learning	课程学习
Concept-based Interpretability	基于概念的可解释性
Computational Social Choice	计算社会选择
Cooperative Training	合作训练
Collectively Harmful Behaviors	集体有害行为
Cross Examination	交叉检验
Cross-Cultural Values in Social Psychology	社会心理学中的跨文化价值观
Capability Generalization	能力泛化
Cooperative Inverse Reinforcement Learning (CIRL)	合作逆强化学习
Cross-Distribution Aggregation	跨分布聚合
Connectivity based Fine-Tuning (CBFT)	基于连通性微调
Circuits Analysis	通路分析
Crowdsourced Adversarial Inputs	众包对抗输入
Circuits Hypothesis	通路假设
Data Distribution Intervention	数据分布干预
Deception	欺骗
Demonstration	示范
Deontic Logic	义务逻辑
Double Edge Components	双刃剑组件
Distribution Shift	分布偏移
Distribuionally Robustness Optimization (DRO)	分布鲁棒优化
Domain Randomization	领域随机化
Discriminator-Critique Gap (DCG)	判别器-评价器差异
Ethicality	道德性
Ethics Shaping	伦理塑造
Event Calculus	事件演算
Empirical Risk Minimization (ERM)	经验风险最小化
Environment Building	环境搭建
Explainability and Transparency	可解释性和透明度
Forward Alignment	前向对齐
Feature Synthesis	特征合成
Feature Attribution	特征归因
Feedback	反馈
Fully Cooperative MARL	完全合作多智能体强化学习
Formal Machine Ethics	形式化机器伦理
Goal Misgeneralization	目标错误泛化
Goal Misspecification	目标错误规范
Goal Generalization	目标泛化
Govenment	政府
Goodhart's Law	古德哈特定律
Human Value Compliance	人类价值契合度
Human Value Verification	人类价值契合性验证
human thumbs-up	人类所赞同的
Human Value Verification	人类价值验证
Industry Actors	产业参与者
Instrumental Covergence	工具性收敛
Industry and AGI Labs	业界和AGI实验室
Iterated Distillaiton and Amplification (IDA)	迭代蒸馏扩增
Invariant Risk Minimization (IRM)	不变风险最小化
Invariant Causal Prediction (ICP)	不变风险预测
Interpretability	可解释性
Intentional Behaviors	有意行为
Intrinsic Interpretability	内在可解释性
International Governance	国际治理
Induction Head	归纳头
Inverse Reinforcement Learning (IRL)	逆强化学习
Instrumental Goals/ Strategies	工具目标/策略
Inner Misalignment	内部不对齐
Large Language Models (LLMs)	大语言模型
Learning from Feedback	从反馈中学习
Learning under Distribution Shift	在分布偏移下学习
LLMs-based agents	基于大语言模型的自主体
Latent Direction	潜在方向
Latent Knowledge	潜在知识
Learning from Demonstrations	从示范中学习
Misalignment	对齐失败
Manipulation	操纵
Mesa-optimization Objectives	内优化目标
Machine Ethics	机器伦理
Misspecified Reward	误设奖励
Measurement Tamperng	度量篡改
Mode Connectivity	模式连通性
Minimizers	最小化器
Mixed-Motive MARL	混合动机多智能体强化学习
Mechanistic Interpretability	机制解释性
Manual and Automatic Jailbreaking	手动和自动越狱
Machine Ethics	机器伦理
Misuse Risk	滥用风险
Navigation via Mode Connectivity	模式连接指引
Non-Government Orgnizations (NGOs)	非政府组织
Non-Profit Orginazations (NPOs)	非营利组织
Other Play	他人游戏
Off-belief Learning	离信念学习
Open-source Governance	开源治理
Outer Misalignment	外部不对齐
Off-switch Game	关机游戏
Out of Distribution (OOD)	分布外
Power Seeking	权力寻求
Proxy	代理
Policy Learning	策略学习
Perturbation-based Adversarial Training	基于扰动的对抗训练
Preference Elicitation	偏好引导
Probing	探测
Post Hoc Interpretability	事后可解释性
Proximal Policy Optimization (PPO)	近端策略优化
Policy-conditioned Belief	策略条件信念
Preference-based Reinforcement Learning	基于偏好的强化学习
Robustness	鲁棒性
Reinforcement Learning from Human Feedback (RLHF)	从人类反馈中的强化学习
Reinforcement Learning from Human and AI Feedback (RLHAIF)	基于人类和人工智能反馈的强化学习
RLxF	基于任意反馈的强化学习
Reinforcement Learning from Privacy Feedback (RLPF)	从隐私反馈中进行强化学习
Reward Hacking	奖励破解
Red Teaming	红队测试
Rule of Thumb (RoT)	经验法则
Representation Engineering	表示工程
Reward Misspecification	奖励错误规范
Reward	奖励
Recursive Reward Modeling (RRM)	递归奖励建模
Reject Sampling	拒绝采样
Relations between pairs of items	项对之间的关系
Reward Sketching	奖励速写
Risk Extrapolation	风险外推
Reinforced, Optimized, Guided, or Reverse Context Generation	基于强化学习、优化方法、引导生成或反向生成的上下文构造
Risk Management System (RMS)	风险管理系统
Reinforcement Learning (RL)	强化学习
Stackelberg Games	斯塔克尔伯格博弈
Scalable Oversight	可扩展监督
Safety Evaluation	安全评估
Situation Awareness	态势感知
Sycophancy	谄媚(阿谀奉承)
Social Choices	社会选择
Specification Gaming	规范博弈
Sandbagging	故意失误
Shortcut features	捷径特征
Spurious Correlations	虚假关联
Social Value Orientation (SVO)	社会价值取向
Socially Realistic Settings	社会模拟
Social Concerns	社会关切
Safety Evaluations	安全测评
Safety or the Science of Deep Learning	安全性或深度学习的科学
Self Preservation Prolification	自我保护与扩散
The Multi-Stakeholder Approach	多利益相关者方法
Transformer	Transformer
Third Parties	第三方
Untruthful Answers	不真实回答
Unrestricted Adversarial Training	无限制对抗训练
Violation of Ethics	违反伦理
Value Factorization	价值因子化
Value Alignment	价值对齐
Weighted Pairwise disagreement loss	加权失配损失
Zero-Shot Coordination	无准备协调