‪Javier Rando‬ - ‪Google Scholar‬

Get my own profile

Cited by

	All	Since 2019
Citations	533	533
h-index	9	9
i10-index	9	9

0

400

200

100

300

202020212022202320243 4 5 130 391

Co-authors

Florian TramèrAssistant Professor of Computer Science, ETH ZurichVerified email at inf.ethz.ch
Daniel PalekaETH ZurichVerified email at inf.ethz.ch
Stephen CasperPhD student, MITVerified email at mit.edu
Nitish JoshiNew York UniversityVerified email at nyu.edu
He HeNew York UniversityVerified email at cs.nyu.edu
Fernando Perez-CruzSr Adviser, Innovation at Bank for International SettlementsVerified email at bis.org

Javier Rando

Javier Rando

Other namesJavier Rando Ramirez

Verified email at ai.ethz.ch - Homepage

Artificial Intelligence Language Models Safety Security Privacy


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 2023	264	2023
Red-Teaming the Stable Diffusion Safety Filter J Rando, D Paleka, D Lindner, L Heim, F Tramèr ML Safety Workshop - NeurIPS 2022, 2022	93	2022
Scalable and transferable black-box jailbreaks for language models via persona modulation R Shah, S Pour, A Tagade, S Casper, J Rando arXiv preprint arXiv:2311.03348, 2023	46	2023
Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... arXiv preprint arXiv:2404.09932, 2024	34	2024
Universal jailbreak backdoors from poisoned human feedback J Rando, F Tramèr arXiv preprint arXiv:2311.14455, 2023	25	2023
"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks E Mosca, S Agarwal, J Rando-Ramirez, G Groh ACL 2022, 2022	22	2022
Personas as a Way to Model Truthfulness in Language Models N Joshi, J Rando, A Saparov, N Kim, H He arXiv preprint arXiv:2310.18168, 2023	13	2023
PassGPT: password modeling and (guided) generation with large language models J Rando, F Perez-Cruz, B Hitaj European Symposium on Research in Computer Security, 164-183, 2023	12	2023
Uneven coverage of natural disasters in Wikipedia: The case of flood V Lorini, J Rando, D Saez-Trumper, C Castillo ISCRAM 2020, 2020	11	2020
Competition report: Finding universal jailbreak backdoors in aligned llms J Rando, F Croce, K Mitka, S Shabalin, M Andriushchenko, N Flammarion, ... arXiv preprint arXiv:2404.14461, 2024	8*	2024
Attributions toward artificial agents in a modified Moral Turing Test E Aharoni, S Fernandes, DJ Brady, C Alexander, M Criner, K Queen, ... Scientific Reports 14 (1), 8458, 2024	3	2024
Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI R Hönig, J Rando, N Carlini, F Tramèr arXiv preprint arXiv:2406.12027, 2024	1	2024
Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO J Rando, N Naimi, T Baumann, M Mathys AdvML Frontiers Workshop (ICML 2022), 2022	1	2022
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition E Debenedetti, J Rando, D Paleka, SF Florin, D Albastroiu, N Cohen, ... arXiv preprint arXiv:2406.07954, 2024		2024

The system can't perform the operation now. Try again later.

Articles 1–14