Adagrad stepsizes: Sharp convergence over nonconvex landscapes R Ward, X Wu, L Bottou The Journal of Machine Learning Research 21 (1), 9047-9076, 2020 | 296 | 2020 |
When Do Curricula Work? X Wu, E Dyer, B Neyshabur International Conference on Learning Representations, 2021 | 69 | 2021 |
Wngrad: Learn the learning rate in gradient descent X Wu, R Ward, L Bottou arXiv preprint arXiv:1803.02865, 2018 | 66 | 2018 |
Global convergence of adaptive gradient methods for an over-parameterized neural network X Wu, SS Du, R Ward arXiv preprint arXiv:1902.07111, 2019 | 53 | 2019 |
Linear convergence of adaptive stochastic gradient descent Y Xie, X Wu, R Ward International conference on artificial intelligence and statistics, 1475-1485, 2020 | 43 | 2020 |
Hierarchical learning for generation with long source sequences T Rohde, X Wu, Y Liu arXiv preprint arXiv:2104.07545, 2021 | 27 | 2021 |
Choosing the Sample with Lowest Loss makes SGD Robust V Shah, X Wu, S Sanghavi International Conference on Artificial Intelligence and Statistics 108, 2120 …, 2020 | 26 | 2020 |
Value-at-Risk estimation with stochastic interest rate models for option-bond portfolios X Wang, D Xie, J Jiang, X Wu, J He Finance Research Letters 21, 10-20, 2017 | 18 | 2017 |
Implicit Regularization and Convergence for Weight Normalization X Wu, E Dobriban, T Ren, S Wu, Z Li, S Gunasekar, R Ward, Q Liu Advances in Neural Information Processing Systems 33, 2020 | 14* | 2020 |
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers Z Yao, RY Aminabadi, M Zhang, X Wu, C Li, Y He Advances in Neural Information Processing Systems, 2022 | 12 | 2022 |
LEAP: Learnable Pruning for Transformer-based Models Z Yao, X Wu, L Ma, S Shen, K Keutzer, MW Mahoney, Y He arXiv preprint arXiv:2105.14636, 2021 | 11* | 2021 |
Adaptive differentially private empirical risk minimization X Wu, L Wang, I Cristali, Q Gu, R Willett arXiv preprint arXiv:2110.07435, 2021 | 5 | 2021 |
An optimal mortgage refinancing strategy with stochastic interest rate X Wu, D Xie, DA Edwards Computational Economics 53, 1353-1375, 2019 | 4 | 2019 |
XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient X Wu, Z Yao, M Zhang, C Li, Y He Advances in Neural Information Processing Systems, 2022 | 3* | 2022 |
Adaloss: A computationally-efficient and provably convergent adaptive gradient method X Wu, Y Xie, SS Du, R Ward Proceedings of the AAAI Conference on Artificial Intelligence 36 (8), 8691-8699, 2022 | 3 | 2022 |
Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers Z Yao, X Wu, C Li, C Holmes, M Zhang, C Li, Y He arXiv preprint arXiv:2211.11586, 2022 | 1 | 2022 |
A Comprehensive Study on Post-Training Quantization for Large Language Models Z Yao, C Li, X Wu, S Youn, Y He arXiv preprint arXiv:2303.08302, 2023 | | 2023 |
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases X Wu, C Li, RY Aminabadi, Z Yao, Y He arXiv preprint arXiv:2301.12017, 2023 | | 2023 |
DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing C Li, Z Yao, X Wu, M Zhang, Y He arXiv preprint arXiv:2212.03597, 2022 | | 2022 |
Optimal exercise frontier of Bermudan options by simulation methods D Xie, DA Edwards, X Wu International Journal of Financial Engineering 9 (03), 2250013, 2022 | | 2022 |