Training language models to follow instructions with human feedback L Ouyang, J Wu, X Jiang, D Almeida, C Wainwright, P Mishkin, C Zhang, ... Advances in neural information processing systems 35, 27730-27744, 2022 | 9181 | 2022 |
Concrete problems in AI safety D Amodei, C Olah, J Steinhardt, P Christiano, J Schulman, D Mané arXiv preprint arXiv:1606.06565, 2016 | 2886 | 2016 |
Deep reinforcement learning from human preferences PF Christiano, J Leike, T Brown, M Martic, S Legg, D Amodei Advances in neural information processing systems 30, 2017 | 2771 | 2017 |
Learning to summarize with human feedback N Stiennon, L Ouyang, J Wu, D Ziegler, R Lowe, C Voss, A Radford, ... Advances in Neural Information Processing Systems 33, 3008-3021, 2020 | 1468 | 2020 |
Fine-tuning language models from human preferences DM Ziegler, N Stiennon, J Wu, TB Brown, A Radford, D Amodei, ... arXiv preprint arXiv:1909.08593, 2019 | 1195 | 2019 |
Theano: A Python framework for fast computation of mathematical expressions R Al-Rfou, G Alain, A Almahairi, C Angermueller, D Bahdanau, N Ballas, ... arXiv e-prints, arXiv: 1605.02688, 2016 | 1144* | 2016 |
A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models C Finn, P Christiano, P Abbeel, S Levine arXiv preprint arXiv:1611.03852, 2016 | 421 | 2016 |
Electrical flows, laplacian systems, and faster approximation of maximum flow in undirected graphs P Christiano, JA Kelner, A Madry, DA Spielman, SH Teng Proceedings of the forty-third annual ACM symposium on Theory of computing …, 2011 | 417 | 2011 |
Transfer from simulation to real world through learning deep inverse dynamics model P Christiano, Z Shah, I Mordatch, J Schneider, T Blackwell, J Tobin, ... arXiv preprint arXiv:1610.03518, 2016 | 268 | 2016 |
Recursively summarizing books with human feedback J Wu, L Ouyang, DM Ziegler, N Stiennon, R Lowe, J Leike, P Christiano arXiv preprint arXiv:2109.10862, 2021 | 231 | 2021 |
Quantum money from hidden subspaces S Aaronson, P Christiano Proceedings of the forty-fourth annual ACM symposium on Theory of computing …, 2012 | 210 | 2012 |
A cryptographic test of quantumness and certifiable randomness from a single quantum device Z Brakerski, P Christiano, U Mahadev, U Vazirani, T Vidick Journal of the ACM (JACM) 68 (5), 1-47, 2021 | 174 | 2021 |
AI safety via debate G Irving, P Christiano, D Amodei arXiv preprint arXiv:1805.00899, 2018 | 159 | 2018 |
Model evaluation for extreme risks T Shevlane, S Farquhar, B Garfinkel, M Phuong, J Whittlestone, J Leung, ... arXiv preprint arXiv:2305.15324, 2023 | 114 | 2023 |
Unrestricted adversarial examples TB Brown, N Carlini, C Zhang, C Olsson, P Christiano, I Goodfellow arXiv preprint arXiv:1809.08352, 2018 | 105 | 2018 |
Supervising strong learners by amplifying weak experts P Christiano, B Shlegeris, D Amodei arXiv preprint arXiv:1810.08575, 2018 | 84 | 2018 |
Robust cooperation in the prisoner's dilemma: Program equilibrium via provability logic M Barasz, P Christiano, B Fallenstein, M Herreshoff, P LaVictoire, ... arXiv preprint arXiv:1401.5577, 2014 | 50* | 2014 |
Sleeper agents: Training deceptive llms that persist through safety training E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ... arXiv preprint arXiv:2401.05566, 2024 | 36 | 2024 |
Evaluating language-model agents on realistic autonomous tasks M Kinniment, LJK Sato, H Du, B Goodrich, M Hasin, L Chan, LH Miles, ... arXiv preprint arXiv:2312.11671, 2023 | 24 | 2023 |
Online local learning via semidefinite programming P Christiano Proceedings of the forty-sixth annual ACM symposium on Theory of computing …, 2014 | 19 | 2014 |