Papers | Authors | Published in | Date | |
---|---|---|---|---|
Towards Automated Circuit Discovery for Mechanistic Interpretability | Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso | Advances in neural information processing systems | 2023-04 | |
OpenAssistant Conversations - Democratizing Large Language Model Alignment | Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, Alexander Mattick | Advances in neural information processing systems | 2023-04 | |
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models | Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li | Advances in neural information processing systems | 2023-06 | |
Rotating Features for Object Discovery | Sindy Löwe, Phillip Lippe, Francesco Locatello, Max Welling | Advances in neural information processing systems | 2023-06 | |
Raising the Cost of Malicious AI-Powered Image Editing | Hadi Salman, Alaa Khaddaj, Guillaume Leclerc, Andrew Ilyas, Aleksander Madry | International Conference on Machine Learning | 2023-02 | |
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark | Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks | International Conference on Machine Learning | 2023-04 | |
Towards Theoretical Understanding of Inverse Reinforcement Learning | Alberto Maria Metelli, Filippo Lazzati, Marcello Restelli | International Conference on Machine Learning | 2023-04 | |
Settling the Reward Hypothesis | Michael Bowling, John D. Martin, David Abel, Will Dabney | International Conference on Machine Learning | 2022-12 | |
Whose Opinions Do Language Models Reflect? | Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori Hashimoto | International Conference on Machine Learning | 2023-05 | |
Towards Reliable Neural Specifications | Chuqin Geng, Nham Le, Xiaojie Xu, Zhaoyue Wang, Arie Gurfinkel, Xujie Si | International Conference on Machine Learning | 2022-10 |
Authors: Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso
Published in: Advances in neural information processing systems
Date: 2023-04
Authors: Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis and others
Published in: Advances in neural information processing systems
Date: 2023-04
Authors: Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang and others
Published in: Advances in neural information processing systems
Date: 2023-06
Authors: Sindy Löwe, Phillip Lippe, Francesco Locatello, Max Welling
Published in: Advances in Neural Information Processing Systems
Date: 2023-06
Authors: Hadi Salman, Alaa Khaddaj, Guillaume Leclerc, Andrew Ilyas, Aleksander Madry
Published in: International Conference on Machine Learning
Date: 2023-02
Authors: Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks
Published in: International Conference on Machine Learning
Date: 2023-04
Authors: Alberto Maria Metelli, Filippo Lazzati, Marcello Restelli
Published in: International Conference on Machine Learning
Date: 2023-04
Authors: Michael Bowling, John D. Martin, David Abel, Will Dabney
Published in: International Conference on Machine Learning
Date: 2022-12
Authors: Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori Hashimoto
Published in: International Conference on Machine Learning
Date: 2023-05
Authors: Chuqin Geng, Nham Le, Xiaojie Xu, Zhaoyue Wang, Arie Gurfinkel, Xujie Si
Published in: International Conference on Machine Learning
Date: 2022-10