Engineer, Staff

Nuvia

Nuvia

Bengaluru, Karnataka, India
Posted on Jan 9, 2026


Company:

Qualcomm India Private Limited

Job Area:

Engineering Group, Engineering Group > Software Engineering

General Summary:

Site Reliability Engineer - AI Infrastructure

About the Role:

Site Reliability Engineering (SRE) is a specialized engineering discipline that combines software development and systems engineering to build and maintain large-scale, highly reliable production systems. At Qualcomm, our SRE team plays a critical role in ensuring our AI infrastructure services deliver maximum reliability, performance, and uptime while enabling rapid innovation and deployment of cutting-edge AI solutions.

As an SRE in our AI Infrastructure team, you will be at the forefront of deploying and managing cloud-based AI inference systems. You'll work with state-of-the-art hardware accelerators and container orchestration platforms to deliver high-performance, scalable AI services. This role demands a unique blend of strong software engineering skills, particularly in Python, deep systems knowledge, and expertise in modern cloud-native technologies.

Our SRE culture emphasizes automation, proactive problem-solving, triaging activities, and continuous improvement. We value intellectual curiosity, collaboration, and a systematic approach to tackling complex distributed systems challenges.

You'll work with a diverse team of engineers who bring varied backgrounds and perspectives to solve some of the most interesting problems in AI infrastructure.

What You'll Be Doing:

  • Design, implement, and maintain large-scale Kubernetes clusters optimized for AI inference workloads, with focus on performance, reliability, and scalability across cloud environments
  • Deploy and manage containerized AI services using Docker, Kubernetes, and KServe (or similar ML serving platforms), ensuring high availability and optimal resource utilization
  • Write production-quality Python code to build automation tools, frameworks, and infrastructure management solutions that eliminate manual processes and improve operational e iciency
  • Lead triaging e orts for complex production incidents, performing deep-dive analysis to identify root causes and implement permanent fixes
  • Debug sophisticated deployment scenarios at multiple levels - from application layer through container orchestration to Linux OS and hardware interfaces
  • Support the full lifecycle of AI inference services - from design and capacity planning through deployment, operation, optimization, and continuous refinement
  • Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible, or similar technologies to ensure reproducible and version-controlled infrastructure
  • Collaborate with ML engineers, software developers, and infrastructure teams to optimize AI workload deployment and performance

What We Need to See:

  • Bachelor's or Master's degree in Computer Science, Engineering, or related technical field, or equivalent practical experience
  • 5+ years of experience in Site Reliability Engineering, DevOps, Systems Engineering, or Software Development with focus on production systems
  • Certifications in Kubernetes (CKA, CKAD), cloud platforms (AWS/Azure/GCP), or related technologies will be an added advantage.
  • Strong proficiency in Python programming with demonstrated ability to write clean, maintainable, and e icient code for automation and tooling
  • Deep expertise in Linux/Unix systems administration, including kernel concepts, system calls, networking stack, storage systems, and performance tuning
  • Hands-on experience with Kubernetes in production environments, including cluster management, workload orchestration, networking (CNI), storage (CSI), and troubleshooting
  • Solid understanding of containerization technologies (Docker, containerd) and container orchestration patterns
  • Experience with cloud platforms (AWS, Azure, GCP, or private cloud) and cloud-native architectures
  • Proven track record of triaging and resolving complex production issues under pressure, with strong analytical and debugging skills
  • Experience with monitoring and observability tools such as Prometheus, Grafana, ELK Stack, or similar platforms
  • Strong understanding of networking concepts including TCP/IP, DNS, load balancing, and service mesh architectures
  • Excellent problem-solving abilities with systematic approach to root cause analysis
  • Strong communication skills with ability to explain technical concepts to both technical and non-technical audiences

Ways to Stand Out from the Crowd:

  • Experience deploying and managing AI/ML inference systems or model serving platforms (KServe, TorchServe, TensorFlow Serving, Triton Inference Server)
  • Knowledge of AI hardware accelerators (GPUs, TPUs, or specialized AI chips) and their integration in cloud environments
  • Familiarity with AI/ML frameworks such as PyTorch, TensorFlow, or ONNX
  • Knowledge of security best practices for containerized environments and cloud infrastructure
  • Contributions to open-source projects related to Kubernetes, cloud-native technologies, or SRE tooling
  • Experience with capacity planning and performance optimization for high-throughput systems
  • Background in implementing and maintaining disaster recovery and business continuity solutions

Key Competencies:

  • Automation-First Mindset: Passion for eliminating repetitive manual work through intelligent automation
  • Systems Thinking: Ability to understand how complex distributed systems interact and impact each other
  • Ownership and Accountability: Taking end-to-end responsibility for services and their

reliability

  • Continuous Learning: Growth mindset with eagerness to learn new technologies and methodologies
  • Collaboration: Ability to work e ectively in global, cross-functional teams
  • Resilience Under Pressure: Maintaining composure and e ectiveness during critical incidents
  • Attention to Detail: Thoroughness in implementation, testing, and documentation

What We O er:

  • Opportunity to work on cutting-edge AI infrastructure at scale
  • Collaborative environment with world-class engineers
  • Continuous learning and professional development opportunities
  • Exposure to latest technologies in cloud computing, AI/ML, and distributed systems
  • Competitive compensation and benefits package

•Qualcomm is an equal opportunity employer committed to diversity and inclusion in the workplace.

Minimum Qualifications:

• Bachelor's degree in Engineering, Information Systems, Computer Science, or related field and 4+ years of Software Engineering or related work experience.
OR
Master's degree in Engineering, Information Systems, Computer Science, or related field and 3+ years of Software Engineering or related work experience.
OR
PhD in Engineering, Information Systems, Computer Science, or related field and 2+ years of Software Engineering or related work experience.

• 2+ years of work experience with Programming Language such as C, C++, Java, Python, etc.

Applicants: Qualcomm is an equal opportunity employer. If you are an individual with a disability and need an accommodation during the application/hiring process, rest assured that Qualcomm is committed to providing an accessible process. You may e-mail disability-accomodations@qualcomm.com or call Qualcomm's toll-free number found here. Upon request, Qualcomm will provide reasonable accommodations to support individuals with disabilities to be able participate in the hiring process. Qualcomm is also committed to making our workplace accessible for individuals with disabilities. (Keep in mind that this email address is used to provide reasonable accommodations for individuals with disabilities. We will not respond here to requests for updates on applications or resume inquiries).

Qualcomm expects its employees to abide by all applicable policies and procedures, including but not limited to security and other requirements regarding protection of Company confidential information and other confidential and/or proprietary information, to the extent those requirements are permissible under applicable law.

To all Staffing and Recruiting Agencies: Our Careers Site is only for individuals seeking a job at Qualcomm. Staffing and recruiting agencies and individuals being represented by an agency are not authorized to use this site or to submit profiles, applications or resumes, and any such submissions will be considered unsolicited. Qualcomm does not accept unsolicited resumes or applications from agencies. Please do not forward resumes to our jobs alias, Qualcomm employees or any other company location. Qualcomm is not responsible for any fees related to unsolicited resumes/applications.

If you would like more information about this role, please contact Qualcomm Careers.