Christian Engelmann
Christian Engelmann
Senior Scientist and Intelligent Systems and Facilities Group Leader, Oak Ridge National Laboratory
Verified email at ornl.gov - Homepage
Title
Cited by
Cited by
Year
Proactive fault tolerance for HPC with Xen virtualization
AB Nagarajan, F Mueller, C Engelmann, SL Scott
Proceedings of the 21st annual international conference on Supercomputing, 23-32, 2007
4962007
Addressing failures in exascale computing
M Snir, RW Wisniewski, JA Abraham, SV Adve, S Bagchi, P Balaji, J Belak, ...
The International Journal of High Performance Computing Applications 28 (2 …, 2014
4162014
Detection and correction of silent data corruption for large-scale high-performance computing
D Fiala, F Mueller, C Engelmann, R Riesen, K Ferreira, R Brightwell
SC'12: Proceedings of the International Conference on High Performance …, 2012
3392012
Proactive process-level live migration in HPC environments
C Wang, F Mueller, C Engelmann, SL Scott
SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 1-12, 2008
2272008
Combining partial redundancy and checkpointing for HPC
J Elliott, K Kharbas, D Fiala, F Mueller, K Ferreira, C Engelmann
2012 IEEE 32nd International Conference on Distributed Computing Systems …, 2012
1822012
Proactive fault tolerance using preemptive migration
C Engelmann, GR Vallee, T Naughton, SL Scott
2009 17th Euromicro International Conference on Parallel, Distributed and …, 2009
1242009
A job pause service under LAM/MPI+ BLCR for transparent fault tolerance
C Wang, F Mueller, C Engelmann, SL Scott
2007 IEEE International Parallel and Distributed Processing Symposium, 1-10, 2007
1152007
Functional partitioning to optimize end-to-end performance on many-core architectures
M Li, SS Vazhkudai, AR Butt, F Meng, X Ma, Y Kim, C Engelmann, ...
SC'10: Proceedings of the 2010 ACM/IEEE International Conference for High …, 2010
1132010
The case for modular redundancy in large-scale high performance computing systems
C Engelmann, HH Ong, SL Scott
Proceedings of the 8th IASTED international conference on parallel and …, 2009
1022009
Failures in large scale systems: Long-term measurement, analysis, and implications
S Gupta, T Patel, C Engelmann, D Tiwari
Proceedings of the International Conference for High Performance Computing …, 2017
972017
NVMalloc: Exposing an aggregate SSD store as a memory partition in extreme-scale machines
C Wang, SS Vazhkudai, X Ma, F Meng, Y Kim, C Engelmann
2012 IEEE 26th International Parallel and Distributed Processing Symposium …, 2012
932012
System-level virtualization for high performance computing
G Vallee, T Naughton, C Engelmann, H Ong, SL Scott
16th Euromicro Conference on Parallel, Distributed and Network-Based …, 2008
822008
High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development
N DeBardeleben, J Laros, JT Daly, SL Scott, C Engelmann, B Harrod
Whitepaper, Dec, 2009
792009
Super-scalable algorithms for computing on 100,000 processors
C Engelmann, A Geist
International Conference on Computational Science, 313-321, 2005
782005
A framework for proactive fault tolerance
G Vallee, K Charoenpornwattana, C Engelmann, A Tikotekar, ...
2008 Third International Conference on Availability, Reliability and …, 2008
742008
Redundant execution of HPC applications with MR-MPI
C Engelmann, S Böhm
Proceedings of the 10th IASTED International Conference on Parallel and …, 2011
702011
Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale
C Engelmann
Future Generation Computer Systems 30, 59-65, 2014
652014
Hybrid checkpointing for MPI jobs in HPC environments
C Wang, F Mueller, C Engelmann, SL Scott
2010 IEEE 16th International Conference on Parallel and Distributed Systems …, 2010
652010
xSim: The extreme-scale simulator
S Böhm, C Engelmann
2011 International Conference on High Performance Computing & Simulation …, 2011
622011
Development of naturally fault tolerant algorithms for computing on 100,000 processors
A Geist, C Engelmann
Journal of Parallel and Distributed Computing, 2002
58*2002
The system can't perform the operation now. Try again later.
Articles 1–20