Follow
Javier Rando
Javier Rando
Other namesJavier Rando Ramirez
Verified email at ai.ethz.ch - Homepage
Title
Cited by
Cited by
Year
Open problems and fundamental limitations of reinforcement learning from human feedback
S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ...
arXiv preprint arXiv:2307.15217, 2023
2162023
Red-Teaming the Stable Diffusion Safety Filter
J Rando, D Paleka, D Lindner, L Heim, F Tramèr
ML Safety Workshop - NeurIPS 2022, 2022
842022
Scalable and transferable black-box jailbreaks for language models via persona modulation
R Shah, S Pour, A Tagade, S Casper, J Rando
arXiv preprint arXiv:2311.03348, 2023
402023
"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks
E Mosca, S Agarwal, J Rando-Ramirez, G Groh
ACL 2022, 2022
222022
Universal jailbreak backdoors from poisoned human feedback
J Rando, F Tramèr
arXiv preprint arXiv:2311.14455, 2023
202023
Foundational challenges in assuring alignment and safety of large language models
U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ...
arXiv preprint arXiv:2404.09932, 2024
192024
Uneven coverage of natural disasters in Wikipedia: The case of flood
V Lorini, J Rando, D Saez-Trumper, C Castillo
ISCRAM 2020, 2020
112020
Personas as a Way to Model Truthfulness in Language Models
N Joshi, J Rando, A Saparov, N Kim, H He
arXiv preprint arXiv:2310.18168, 2023
92023
PassGPT: Password Modeling and (Guided) Generation with Large Language Models
J Rando, F Perez-Cruz, B Hitaj
European Symposium on Research in Computer Security, 164-183, 2023
82023
Competition report: Finding universal jailbreak backdoors in aligned llms
J Rando, F Croce, K Mitka, S Shabalin, M Andriushchenko, N Flammarion, ...
arXiv preprint arXiv:2404.14461, 2024
4*2024
Attributions toward artificial agents in a modified Moral Turing Test
E Aharoni, S Fernandes, DJ Brady, C Alexander, M Criner, K Queen, ...
Scientific Reports 14 (1), 8458, 2024
22024
Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO
J Rando, N Naimi, T Baumann, M Mathys
AdvML Frontiers Workshop (ICML 2022), 2022
12022
Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
R Hönig, J Rando, N Carlini, F Tramèr
arXiv preprint arXiv:2406.12027, 2024
2024
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
E Debenedetti, J Rando, D Paleka, F Silaghi, D Albastroiu, N Cohen, ...
arXiv preprint arXiv:2406.07954, 2024
2024
The system can't perform the operation now. Try again later.
Articles 1–14