Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 2023 | 166 | 2023 |
Pretraining language models with human preferences T Korbak, K Shi, A Chen, RV Bhalerao, C Buckley, J Phang, SR Bowman, ... International Conference on Machine Learning, 17506-17533, 2023 | 83 | 2023 |
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, ... arXiv preprint arXiv:2309.12288, 2023 | 82* | 2023 |
Inverse Scaling: When Bigger Isn't Better IR McKenzie, A Lyzhov, M Pieler, A Parrish, A Mueller, A Prabhu, ... arXiv preprint arXiv:2306.09479, 2023 | 69* | 2023 |
Training language models with language feedback at scale J Scheurer, JA Campos, T Korbak, JS Chan, A Chen, K Cho, E Perez arXiv preprint arXiv:2303.16755, 2023 | 63 | 2023 |
Towards understanding sycophancy in language models M Sharma, M Tong, T Korbak, D Duvenaud, A Askell, SR Bowman, ... arXiv preprint arXiv:2310.13548, 2023 | 43 | 2023 |
Improving code generation by training with natural language feedback A Chen, J Scheurer, T Korbak, JA Campos, JS Chan, SR Bowman, K Cho, ... arXiv preprint arXiv:2303.16749, 2023 | 36 | 2023 |
Aligning language models with preferences through f-divergence minimization D Go, T Korbak, G Kruszewski, J Rozen, N Ryu, M Dymetman arXiv preprint arXiv:2302.08215, 2023 | 30 | 2023 |
Computational enactivism under the free energy principle T Korbak Synthese 198 (3), 2743-2763, 2021 | 30 | 2021 |
RL with KL penalties is better viewed as Bayesian inference T Korbak, E Perez, CL Buckley arXiv preprint arXiv:2205.11275, 2022 | 29* | 2022 |
On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting T Korbak, H Elsahar, G Kruszewski, M Dymetman Advances in Neural Information Processing Systems 35, 16203-16220, 2022 | 23 | 2022 |
Taken out of context: On measuring situational awareness in LLMs L Berglund, AC Stickland, M Balesni, M Kaufmann, M Tong, T Korbak, ... arXiv preprint arXiv:2309.00667, 2023 | 20* | 2023 |
Controlling conditional language models without catastrophic forgetting T Korbak, H Elsahar, G Kruszewski, M Dymetman International Conference on Machine Learning, 11499-11528, 2022 | 19 | 2022 |
Interaction history as a source of compositionality in emergent communication T Korbak, J Zubek, Ł Kuciński, P Miłoś, J Rączaszek-Leonardi Interaction Studies 22 (2), 212-243, 2021 | 17* | 2021 |
Catalytic role of noise and necessity of inductive biases in the emergence of compositional communication Ł Kuciński, T Korbak, P Kołodziej, P Miłoś Advances in neural information processing systems 34, 23075-23088, 2021 | 14 | 2021 |
Scaffolded minds and the evolution of content in signaling pathways T Korbak Studies in Logic, Grammar and Rhetoric 41 (1), 89-103, 2015 | 11 | 2015 |
Measuring non-trivial compositionality in emergent communication T Korbak, J Zubek, J Rączaszek-Leonardi arXiv preprint arXiv:2010.15058, 2020 | 8 | 2020 |
Energy-based models for code generation under compilability constraints T Korbak, H Elsahar, M Dymetman, G Kruszewski arXiv preprint arXiv:2106.04985, 2021 | 7 | 2021 |
Exploiting unsupervised pre-training and automated feature engineering for low-resource hate speech detection in polish R Korzeniowski, R Rolczyński, P Sadownik, T Korbak, M Możejko arXiv preprint arXiv:1906.09325, 2019 | 7 | 2019 |
Fine-tuning Tree-LSTM for phrase-level sentiment classification on a polish dependency treebank T Korbak, P Żak Language and Technology Conference, 31-42, 2017 | 4 | 2017 |