TOP-10 Papers Recommended in 2023-10

Papers	Authors	Published in	Date
Towards Automated Circuit Discovery for Mechanistic Interpretability	Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso	Advances in neural information processing systems	2023-04
OpenAssistant Conversations - Democratizing Large Language Model Alignment	Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, Alexander Mattick	Advances in neural information processing systems	2023-04
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models	Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li	Advances in neural information processing systems	2023-06
Rotating Features for Object Discovery	Sindy Löwe, Phillip Lippe, Francesco Locatello, Max Welling	Advances in neural information processing systems	2023-06
Raising the Cost of Malicious AI-Powered Image Editing	Hadi Salman, Alaa Khaddaj, Guillaume Leclerc, Andrew Ilyas, Aleksander Madry	International Conference on Machine Learning	2023-02
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark	Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks	International Conference on Machine Learning	2023-04
Towards Theoretical Understanding of Inverse Reinforcement Learning	Alberto Maria Metelli, Filippo Lazzati, Marcello Restelli	International Conference on Machine Learning	2023-04
Settling the Reward Hypothesis	Michael Bowling, John D. Martin, David Abel, Will Dabney	International Conference on Machine Learning	2022-12
Whose Opinions Do Language Models Reflect?	Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori Hashimoto	International Conference on Machine Learning	2023-05
Towards Reliable Neural Specifications	Chuqin Geng, Nham Le, Xiaojie Xu, Zhaoyue Wang, Arie Gurfinkel, Xujie Si	International Conference on Machine Learning	2022-10

Towards Automated Circuit Discovery for Mechanistic Interpretability

Authors: Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso

Published in: Advances in neural information processing systems

Date: 2023-04

Read More Google Scholar

OpenAssistant Conversations - Democratizing Large Language Model Alignment

Authors: Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis and others

Published in: Advances in neural information processing systems

Date: 2023-04

Read More Google Scholar

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Authors: Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang and others

Published in: Advances in neural information processing systems

Date: 2023-06

Read More Google Scholar

Rotating Features for Object Discovery

Authors: Sindy Löwe, Phillip Lippe, Francesco Locatello, Max Welling

Published in: Advances in Neural Information Processing Systems

Date: 2023-06

Read More Google Scholar

Raising the Cost of Malicious AI-Powered Image Editing

Authors: Hadi Salman, Alaa Khaddaj, Guillaume Leclerc, Andrew Ilyas, Aleksander Madry

Published in: International Conference on Machine Learning

Date: 2023-02

Read More Google Scholar

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark

Authors: Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks

Published in: International Conference on Machine Learning

Date: 2023-04

Read More Google Scholar