Cognitive Collective

Helping you find your next career in AI. Learn more about the job board on the Scale blog.

Are you a scaling AI startup? Email maxwell@scalevp.com to be added to our board.
companies
Jobs

Senior SRE Engineer

Run:ai

Run:ai

Software Engineering
Tel Aviv-Yafo, Israel · Ra'anana, Israel
Posted on Jul 12, 2025

NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s an outstanding legacy of innovation that’s motivated by extraordinary technology —and amazing people. NVIDIA is looking for a highly motivated DevOps/SRE engineer to join the NVIDIA AIR team – the Digital Twin for Data Center Simulation web application. NVIDIA Air enables cloud-scale efficiency by creating identical replicas of real-world data center infrastructure deployments. To learn more, visit Nvidia Air.

What you'll be doing:

  • The person will be part of the NVIDIA AIR team that is building the SaaS/IaaS platform for digital twin of AI data centers.

  • The responsibility specifically is for infrastructure and Site Reliability Engineering (SRE) requirements for AIR.

  • Focus on efficiency by automating repetitive workflows.

  • Working on microservices based architecture.

  • Deploying and troubleshooting non-disruptive cloud operations with an emphasis on secure production infrastructure.

  • Continuous evaluation of existing system and driving improvements.

  • Managing deployment/upgrade for Operating Systems, Kubernetes(k8s) clusters and/or or other orchestration tools.

  • Day to day support for engineering activities with CI/CD tools like git, Jenkins.

  • Efficiently multi-tasking on the different tracks to efficiently address evolving priorities .

What we need to see:

  • BSc in Engineering/ Relevant Certifications/ equivalent experience.

  • 5+ years of experience in complex microservices based architectures

  • Proven experience in best practices and discipline of managing and monitoring a highly available and secure production infrastructure

  • Experienced with latest Observabilty tools, Prometheous stack, Data Dog, etc

  • Experienced with modern deployment architecture for non-disruptive cloud operations including blue green and canary rollouts

  • Highly skilled in Kubernetes and Docker

  • Experience in IaaS environment - deploying, configuring, and administering Linux-based bare metal servers

  • Experience with relational databases(MySQL) and SQL.

  • Expert in AWS

Ways to stand out from the crowd:

  • Skills in Linux/Unix Administration

  • Experience with Prometheus/Grafana.

  • Experience with APM tools like Dynatrace, Datadog, AppDynamics, New Relic, etc.

  • Implemented robust metrics collection and alerting infrastructure

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people on the planet working for us. If you're creative, passionate and self-motivated, we want to hear from you! NVIDIA is leading the way in ground-breaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services.