TOP-10 Papers Recommended in 2023-10

PapersAuthorsPublished inDate
Towards Automated Circuit Discovery for Mechanistic InterpretabilityArthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-AlonsoAdvances in neural information processing systems2023-04
OpenAssistant Conversations - Democratizing Large Language Model AlignmentAndreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, Alexander MattickAdvances in neural information processing systems2023-04
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT ModelsBoxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo LiAdvances in neural information processing systems2023-06
Rotating Features for Object DiscoverySindy Löwe, Phillip Lippe, Francesco Locatello, Max WellingAdvances in neural information processing systems2023-06
Raising the Cost of Malicious AI-Powered Image EditingHadi Salman, Alaa Khaddaj, Guillaume Leclerc, Andrew Ilyas, Aleksander MadryInternational Conference on Machine Learning2023-02
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli BenchmarkAlexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan HendrycksInternational Conference on Machine Learning2023-04
Towards Theoretical Understanding of Inverse Reinforcement LearningAlberto Maria Metelli, Filippo Lazzati, Marcello RestelliInternational Conference on Machine Learning2023-04
Settling the Reward HypothesisMichael Bowling, John D. Martin, David Abel, Will DabneyInternational Conference on Machine Learning2022-12
Whose Opinions Do Language Models Reflect?Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori HashimotoInternational Conference on Machine Learning2023-05
Towards Reliable Neural SpecificationsChuqin Geng, Nham Le, Xiaojie Xu, Zhaoyue Wang, Arie Gurfinkel, Xujie SiInternational Conference on Machine Learning2022-10
Towards Automated Circuit Discovery for Mechanistic Interpretability

Authors: Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso

Published in: Advances in neural information processing systems

Date: 2023-04

Read More Google Scholar

OpenAssistant Conversations - Democratizing Large Language Model Alignment

Authors: Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis and others

Published in: Advances in neural information processing systems

Date: 2023-04

Read More Google Scholar

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Authors: Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang and others

Published in: Advances in neural information processing systems

Date: 2023-06

Read More Google Scholar

Rotating Features for Object Discovery

Authors: Sindy Löwe, Phillip Lippe, Francesco Locatello, Max Welling

Published in: Advances in Neural Information Processing Systems

Date: 2023-06

Read More Google Scholar

Raising the Cost of Malicious AI-Powered Image Editing

Authors: Hadi Salman, Alaa Khaddaj, Guillaume Leclerc, Andrew Ilyas, Aleksander Madry

Published in: International Conference on Machine Learning

Date: 2023-02

Read More Google Scholar

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark

Authors: Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks

Published in: International Conference on Machine Learning

Date: 2023-04

Read More Google Scholar

Towards Theoretical Understanding of Inverse Reinforcement Learning

Authors: Alberto Maria Metelli, Filippo Lazzati, Marcello Restelli

Published in: International Conference on Machine Learning

Date: 2023-04

Read More Google Scholar

Settling the Reward Hypothesis

Authors: Michael Bowling, John D. Martin, David Abel, Will Dabney

Published in: International Conference on Machine Learning

Date: 2022-12

Read More Google Scholar

Whose Opinions Do Language Models Reflect?

Authors: Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori Hashimoto

Published in: International Conference on Machine Learning

Date: 2023-05

Read More Google Scholar

Towards Reliable Neural Specifications

Authors: Chuqin Geng, Nham Le, Xiaojie Xu, Zhaoyue Wang, Arie Gurfinkel, Xujie Si

Published in: International Conference on Machine Learning

Date: 2022-10

Read More Google Scholar