Job Opportunities at Atlantic Bridge and Atlantic Bridge Portfolio

0

Companies

0

Jobs

Engineer, Staff

Nuvia

Bengaluru, Karnataka, India

Posted on Jan 9, 2026

Apply now

Company:

Qualcomm India Private Limited

Job Area:

Engineering Group, Engineering Group > Software Engineering

General Summary:

Site Reliability Engineer - AI Infrastructure

About the Role:

Site Reliability Engineering (SRE) is a specialized engineering discipline that combines software development and systems engineering to build and maintain large-scale, highly reliable production systems. At Qualcomm, our SRE team plays a critical role in ensuring our AI infrastructure services deliver maximum reliability, performance, and uptime while enabling rapid innovation and deployment of cutting-edge AI solutions.

As an SRE in our AI Infrastructure team, you will be at the forefront of deploying and managing cloud-based AI inference systems. You'll work with state-of-the-art hardware accelerators and container orchestration platforms to deliver high-performance, scalable AI services. This role demands a unique blend of strong software engineering skills, particularly in Python, deep systems knowledge, and expertise in modern cloud-native technologies.

Our SRE culture emphasizes automation, proactive problem-solving, triaging activities, and continuous improvement. We value intellectual curiosity, collaboration, and a systematic approach to tackling complex distributed systems challenges.

You'll work with a diverse team of engineers who bring varied backgrounds and perspectives to solve some of the most interesting problems in AI infrastructure.

What You'll Be Doing:

Design, implement, and maintain large-scale Kubernetes clusters optimized for AI inference workloads, with focus on performance, reliability, and scalability across cloud environments
Deploy and manage containerized AI services using Docker, Kubernetes, and KServe (or similar ML serving platforms), ensuring high availability and optimal resource utilization
Write production-quality Python code to build automation tools, frameworks, and infrastructure management solutions that eliminate manual processes and improve operational e iciency
Lead triaging e orts for complex production incidents, performing deep-dive analysis to identify root causes and implement permanent fixes
Debug sophisticated deployment scenarios at multiple levels - from application layer through container orchestration to Linux OS and hardware interfaces
Support the full lifecycle of AI inference services - from design and capacity planning through deployment, operation, optimization, and continuous refinement
Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible, or similar technologies to ensure reproducible and version-controlled infrastructure
Collaborate with ML engineers, software developers, and infrastructure teams to optimize AI workload deployment and performance

What We Need to See:

Bachelor's or Master's degree in Computer Science, Engineering, or related technical field, or equivalent practical experience
5+ years of experience in Site Reliability Engineering, DevOps, Systems Engineering, or Software Development with focus on production systems
Certifications in Kubernetes (CKA, CKAD), cloud platforms (AWS/Azure/GCP), or related technologies will be an added advantage.
Strong proficiency in Python programming with demonstrated ability to write clean, maintainable, and e icient code for automation and tooling
Deep expertise in Linux/Unix systems administration, including kernel concepts, system calls, networking stack, storage systems, and performance tuning
Hands-on experience with Kubernetes in production environments, including cluster management, workload orchestration, networking (CNI), storage (CSI), and troubleshooting
Solid understanding of containerization technologies (Docker, containerd) and container orchestration patterns
Experience with cloud platforms (AWS, Azure, GCP, or private cloud) and cloud-native architectures
Proven track record of triaging and resolving complex production issues under pressure, with strong analytical and debugging skills
Experience with monitoring and observability tools such as Prometheus, Grafana, ELK Stack, or similar platforms
Strong understanding of networking concepts including TCP/IP, DNS, load balancing, and service mesh architectures
Excellent problem-solving abilities with systematic approach to root cause analysis
Strong communication skills with ability to explain technical concepts to both technical and non-technical audiences

Ways to Stand Out from the Crowd:

Experience deploying and managing AI/ML inference systems or model serving platforms (KServe, TorchServe, TensorFlow Serving, Triton Inference Server)
Knowledge of AI hardware accelerators (GPUs, TPUs, or specialized AI chips) and their integration in cloud environments
Familiarity with AI/ML frameworks such as PyTorch, TensorFlow, or ONNX
Knowledge of security best practices for containerized environments and cloud infrastructure
Contributions to open-source projects related to Kubernetes, cloud-native technologies, or SRE tooling
Experience with capacity planning and performance optimization for high-throughput systems
Background in implementing and maintaining disaster recovery and business continuity solutions

Key Competencies:

Automation-First Mindset: Passion for eliminating repetitive manual work through intelligent automation
Systems Thinking: Ability to understand how complex distributed systems interact and impact each other
Ownership and Accountability: Taking end-to-end responsibility for services and their

reliability

Continuous Learning: Growth mindset with eagerness to learn new technologies and methodologies
Collaboration: Ability to work e ectively in global, cross-functional teams
Resilience Under Pressure: Maintaining composure and e ectiveness during critical incidents
Attention to Detail: Thoroughness in implementation, testing, and documentation

What We O er:

Opportunity to work on cutting-edge AI infrastructure at scale
Collaborative environment with world-class engineers
Continuous learning and professional development opportunities
Exposure to latest technologies in cloud computing, AI/ML, and distributed systems
Competitive compensation and benefits package

•Qualcomm is an equal opportunity employer committed to diversity and inclusion in the workplace.

Minimum Qualifications:

• Bachelor's degree in Engineering, Information Systems, Computer Science, or related field and 4+ years of Software Engineering or related work experience.
OR
Master's degree in Engineering, Information Systems, Computer Science, or related field and 3+ years of Software Engineering or related work experience.
OR
PhD in Engineering, Information Systems, Computer Science, or related field and 2+ years of Software Engineering or related work experience.

• 2+ years of work experience with Programming Language such as C, C++, Java, Python, etc.

Applicants: Qualcomm is an equal opportunity employer. If you are an individual with a disability and need an accommodation during the application/hiring process, rest assured that Qualcomm is committed to providing an accessible process. You may e-mail disability-accomodations@qualcomm.com or call Qualcomm's toll-free number found here. Upon request, Qualcomm will provide reasonable accommodations to support individuals with disabilities to be able participate in the hiring process. Qualcomm is also committed to making our workplace accessible for individuals with disabilities. (Keep in mind that this email address is used to provide reasonable accommodations for individuals with disabilities. We will not respond here to requests for updates on applications or resume inquiries).

Qualcomm expects its employees to abide by all applicable policies and procedures, including but not limited to security and other requirements regarding protection of Company confidential information and other confidential and/or proprietary information, to the extent those requirements are permissible under applicable law.

To all Staffing and Recruiting Agencies: Our Careers Site is only for individuals seeking a job at Qualcomm. Staffing and recruiting agencies and individuals being represented by an agency are not authorized to use this site or to submit profiles, applications or resumes, and any such submissions will be considered unsolicited. Qualcomm does not accept unsolicited resumes or applications from agencies. Please do not forward resumes to our jobs alias, Qualcomm employees or any other company location. Qualcomm is not responsible for any fees related to unsolicited resumes/applications.

If you would like more information about this role, please contact Qualcomm Careers.

Apply now

See more open positions at Nuvia

Office Hours

Office Hours is a platform to meet the Atlantic Bridge team, get a second opinion and build a relationship. The format of the meeting is a 20 minute call with a member of the team. We’re open to giving advice, getting pitched to or just having a chat

Find Out More