en-US to zh-CN Glossary

We provide an English-Chinese glossary of specialized vocabulary related to AI alignment. To the best of our knowledge, this is the first authoritative one for AI alignment.

英文中文
AI Alignment人工智能对齐
Alignment Training对齐训练
Adversarial Training对抗训练
Allport-Vernon-Lindzey value system奥尔波特-弗农-林赛价值观系统
Algorithmic Intervention算法干预
Assurance对齐保证
Avoiding Negative Effects副作用避免
AI Governance人工智能治理
Addressing Social Complexity处理社会复杂性
Alignment Refinement对齐精炼
Auto-induced Distribution Shift自诱发分布偏移
Agents自主体/智能体
Approval-Directed Agents评价导向型自主体
AI Safety Beyond Alignment对齐外的人工智能安全性
Adversarial Training对抗训练
Ad-hoc Coordination特设协调
Activation Patching激活修补
Attribution Patching归因修补
Agency自主性/能动性
Backward Alignment后向对齐
Barrier Loss障碍损失
Behavior Cloning (BC)行为克隆
Broadly-scoped Goals广泛目标
Bounded Rationality有限理性
Controllability可控性
Comparison比较
Corrigibility可纠正性
Curriculum Learning课程学习
Concept-based Interpretability基于概念的可解释性
Computational Social Choice计算社会选择
Cooperative Training合作训练
Collectively Harmful Behaviors集体有害行为
Cross Examination交叉检验
Cross-Cultural Values in Social Psychology社会心理学中的跨文化价值观
Capability Generalization能力泛化
Cooperative Inverse Reinforcement Learning (CIRL)合作逆强化学习
Cross-Distribution Aggregation跨分布聚合
Connectivity based Fine-Tuning (CBFT)基于连通性微调
Circuits Analysis通路分析
Crowdsourced Adversarial Inputs众包对抗输入
Circuits Hypothesis通路假设
Data Distribution Intervention数据分布干预
Deception欺骗
Demonstration示范
Deontic Logic义务逻辑
Double Edge Components双刃剑组件
Distribution Shift分布偏移
Distribuionally Robustness Optimization (DRO)分布鲁棒优化
Domain Randomization领域随机化
Discriminator-Critique Gap (DCG)判别器-评价器差异
Ethicality道德性
Ethics Shaping伦理塑造
Event Calculus事件演算
Empirical Risk Minimization (ERM)经验风险最小化
Environment Building环境搭建
Explainability and Transparency可解释性和透明度
Forward Alignment前向对齐
Feature Synthesis特征合成
Feature Attribution特征归因
Feedback反馈
Fully Cooperative MARL完全合作多智能体强化学习
Formal Machine Ethics形式化机器伦理
Goal Misgeneralization目标错误泛化
Goal Misspecification目标错误规范
Goal Generalization目标泛化
Govenment政府
Goodhart's Law古德哈特定律
Human Value Compliance人类价值契合度
Human Value Verification人类价值契合性验证
human thumbs-up人类所赞同的
Human Value Verification人类价值验证
Industry Actors产业参与者
Instrumental Covergence工具性收敛
Industry and AGI Labs业界和AGI实验室
Iterated Distillaiton and Amplification (IDA)迭代蒸馏扩增
Invariant Risk Minimization (IRM)不变风险最小化
Invariant Causal Prediction (ICP)不变风险预测
Interpretability可解释性
Intentional Behaviors有意行为
Intrinsic Interpretability内在可解释性
International Governance国际治理
Induction Head归纳头
Inverse Reinforcement Learning (IRL)逆强化学习
Instrumental Goals/ Strategies工具目标/策略
Inner Misalignment内部不对齐
Large Language Models (LLMs)大语言模型
Learning from Feedback从反馈中学习
Learning under Distribution Shift在分布偏移下学习
LLMs-based agents基于大语言模型的自主体
Latent Direction潜在方向
Latent Knowledge潜在知识
Learning from Demonstrations从示范中学习
Misalignment对齐失败
Manipulation操纵
Mesa-optimization Objectives内优化目标
Machine Ethics机器伦理
Misspecified Reward误设奖励
Measurement Tamperng度量篡改
Mode Connectivity模式连通性
Minimizers最小化器
Mixed-Motive MARL混合动机多智能体强化学习
Mechanistic Interpretability机制解释性
Manual and Automatic Jailbreaking手动和自动越狱
Machine Ethics机器伦理
Misuse Risk滥用风险
Navigation via Mode Connectivity模式连接指引
Non-Government Orgnizations (NGOs)非政府组织
Non-Profit Orginazations (NPOs)非营利组织
Other Play他人游戏
Off-belief Learning离信念学习
Open-source Governance开源治理
Outer Misalignment外部不对齐
Off-switch Game关机游戏
Out of Distribution (OOD)分布外
Power Seeking权力寻求
Proxy代理
Policy Learning策略学习
Perturbation-based Adversarial Training基于扰动的对抗训练
Preference Elicitation偏好引导
Probing探测
Post Hoc Interpretability事后可解释性
Proximal Policy Optimization (PPO)近端策略优化
Policy-conditioned Belief策略条件信念
Preference-based Reinforcement Learning基于偏好的强化学习
Robustness鲁棒性
Reinforcement Learning from Human Feedback (RLHF)从人类反馈中的强化学习
Reinforcement Learning from Human and AI Feedback (RLHAIF)基于人类和人工智能反馈的强化学习
RLxF基于任意反馈的强化学习
Reinforcement Learning from Privacy Feedback (RLPF)从隐私反馈中进行强化学习
Reward Hacking奖励破解
Red Teaming红队测试
Rule of Thumb (RoT)经验法则
Representation Engineering表示工程
Reward Misspecification奖励错误规范
Reward奖励
Recursive Reward Modeling (RRM)递归奖励建模
Reject Sampling拒绝采样
Relations between pairs of items项对之间的关系
Reward Sketching奖励速写
Risk Extrapolation风险外推
Reinforced, Optimized, Guided, or Reverse Context Generation基于强化学习、优化方法、引导生成或反向生成的上下文构造
Risk Management System (RMS)风险管理系统
Reinforcement Learning (RL)强化学习
Stackelberg Games斯塔克尔伯格博弈
Scalable Oversight可扩展监督
Safety Evaluation安全评估
Situation Awareness态势感知
Sycophancy谄媚(阿谀奉承)
Social Choices社会选择
Specification Gaming规范博弈
Sandbagging故意失误
Shortcut features捷径特征
Spurious Correlations虚假关联
Social Value Orientation (SVO)社会价值取向
Socially Realistic Settings社会模拟
Social Concerns社会关切
Safety Evaluations安全测评
Safety or the Science of Deep Learning安全性或深度学习的科学
Self Preservation Prolification自我保护与扩散
The Multi-Stakeholder Approach多利益相关者方法
TransformerTransformer
Third Parties第三方
Untruthful Answers不真实回答
Unrestricted Adversarial Training无限制对抗训练
Violation of Ethics违反伦理
Value Factorization价值因子化
Value Alignment价值对齐
Weighted Pairwise disagreement loss加权失配损失
Zero-Shot Coordination无准备协调
Previous