Company:
Qualcomm India Private Limited
Job Area:
Engineering Group, Engineering Group > Software Test Engineering
General Summary:
Site Reliability Engineer - AI Infrastructure
About the Role:
Site Reliability Engineering (SRE) is a specialized engineering discipline that combines software development and systems engineering to build and maintain large-scale, highly reliable production systems. At Qualcomm, our SRE team plays a critical role in ensuring our AI infrastructure services deliver maximum reliability, performance, and uptime while enabling rapid innovation and deployment of cutting-edge AI solutions.
As an SRE in our AI Infrastructure team, you will be at the forefront of deploying and managing cloud-based AI inference systems. You'll work with state-of-the-art hardware accelerators and container orchestration platforms to deliver high-performance, scalable AI services. This role demands a unique blend of strong software engineering skills, particularly in Python, deep systems knowledge, and expertise in modern cloud-native technologies.
Our SRE culture emphasizes automation, proactive problem-solving, triaging activities, and continuous improvement. We value intellectual curiosity, collaboration, and a systematic approach to tackling complex distributed systems challenges.
You'll work with a diverse team of engineers who bring varied backgrounds and perspectives to solve some of the most interesting problems in AI infrastructure.
What You'll Be Doing:
- Design, implement, and maintain large-scale Kubernetes clusters optimized for AI inference workloads, with focus on performance, reliability, and scalability across cloud environments
- Deploy and manage containerized AI services using Docker, Kubernetes, and KServe (or similar ML serving platforms), ensuring high availability and optimal resource utilization
- Write production-quality Python code to build automation tools, frameworks, and infrastructure management solutions that eliminate manual processes and improve operational e iciency
- Lead triaging e orts for complex production incidents, performing deep-dive analysis to identify root causes and implement permanent fixes
- Debug sophisticated deployment scenarios at multiple levels - from application layer through container orchestration to Linux OS and hardware interfaces
- Support the full lifecycle of AI inference services - from design and capacity planning through deployment, operation, optimization, and continuous refinement
- Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible, or similar technologies to ensure reproducible and version-controlled infrastructure
- Collaborate with ML engineers, software developers, and infrastructure teams to optimize AI workload deployment and performance
What We Need to See:
- Bachelor's or master's degree in computer science, Engineering, or related technical field, or equivalent practical experience
- 5+ years of experience in Site Reliability Engineering, DevOps, Systems Engineering, or Software Development with focus on production systems
- Certifications in Kubernetes (CKA, CKAD), cloud platforms (AWS/Azure/GCP), or related technologies will be an added advantage.
- Strong proficiency in Python programming with demonstrated ability to write clean, maintainable, and e incident code for automation and tooling
- Deep expertise in Linux/Unix systems administration, including kernel concepts, system calls, networking stack, storage systems, and performance tuning
- Hands-on experience with Kubernetes in production environments, including cluster management, workload orchestration, networking (CNI), storage (CSI), and troubleshooting
- Solid understanding of containerization technologies (Docker, containerd) and container orchestration patterns
- Experience with cloud platforms (AWS, Azure, GCP, or private cloud) and cloud-native architectures
- Proven track record of triaging and resolving complex production issues under pressure, with strong analytical and debugging skills
- Experience with monitoring and observability tools such as Prometheus, Grafana, ELK Stack, or similar platforms
- Strong understanding of networking concepts including TCP/IP, DNS, load ba
•Qualcomm is an equal opportunity employer committed to diversity and inclusion in the workplace.
Minimum Qualifications:
• Bachelor's degree in Engineering, Information Systems, Computer Science, or related field and 4+ years of Software Test Engineering or related work experience.
OR
Master's degree in Engineering, Information Systems, Computer Science, or related field and 3+ years of Software Test Engineering or related work experience.
OR
PhD in Engineering, Information Systems, Computer Science, or related field and 2+ years of Software Test Engineering or related work experience.
• 2+ year of work experience with Software Test or System Test, developing and automating test plans, and/or tools (e.g., Source Code Control Systems, Continuous Integration Tools, and Bug Tracking Tools).
Applicants: Qualcomm is an equal opportunity employer. If you are an individual with a disability and need an accommodation during the application/hiring process, rest assured that Qualcomm is committed to providing an accessible process. You may e-mail disability-accomodations@qualcomm.com or call Qualcomm's toll-free number found here. Upon request, Qualcomm will provide reasonable accommodations to support individuals with disabilities to be able participate in the hiring process. Qualcomm is also committed to making our workplace accessible for individuals with disabilities. (Keep in mind that this email address is used to provide reasonable accommodations for individuals with disabilities. We will not respond here to requests for updates on applications or resume inquiries).
Qualcomm expects its employees to abide by all applicable policies and procedures, including but not limited to security and other requirements regarding protection of Company confidential information and other confidential and/or proprietary information, to the extent those requirements are permissible under applicable law.
To all Staffing and Recruiting Agencies: Our Careers Site is only for individuals seeking a job at Qualcomm. Staffing and recruiting agencies and individuals being represented by an agency are not authorized to use this site or to submit profiles, applications or resumes, and any such submissions will be considered unsolicited. Qualcomm does not accept unsolicited resumes or applications from agencies. Please do not forward resumes to our jobs alias, Qualcomm employees or any other company location. Qualcomm is not responsible for any fees related to unsolicited resumes/applications.
If you would like more information about this role, please contact Qualcomm Careers.