Senior Software Engineer, AI Resiliency

Redmond, WA, United States • Posted June 20, 2026

Job Type: Full-time
Location: Redmond, WA
Posted: June 20, 2026
Category: other-general
Application Deadline: June 25, 2026

Role Description

We are now looking for a Senior Software Engineer for AI Resiliency!


At NVIDIA, we are pushing the boundaries of what’s possible in AI. We are currently seeking a Senior Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world. As a member of our AI Software Resiliency team, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs. Your expertise will be crucial in driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable at all times.


What You’ll Be Doing:
+ Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.
+ Hands-On Coding & Optimization: Contribute to large-scale distributed syst...

Interested in this role?

Click the button below to start your application for Senior Software Engineer, AI Resiliency at NVIDIA.

Apply Now