Company:
Qualcomm Incorporated
Job Area:
Engineering Group, Engineering Group > Software Engineering
General Summary:
We’re seeking an experienced AI Platforms Leader to own the strategy, architecture, and operation of our end‑to‑end AI Platform—spanning on‑prem GPU clusters and cloud services (AWS/GCP/Azure). You’ll lead a high-caliber engineering team to deliver reliable, secure, and cost‑efficient infrastructure for training, fine‑tuning, inference, retrieval, and agentic orchestration (including A2A patterns and MCP servers). If you love turning complex AI/ML requirements into robust, self‑service platform capabilities for builders across the company, this is your role.
This role requires full-time onsite work in San Diego, CA (5 days per week).
Key Responsibilities
- Own the AI Platform strategy & roadmap
- Define the multi‑year vision for a multi‑tenant, hybrid (on‑prem + cloud) AI platform, aligned to business needs, developer productivity, and cost efficiency.
- Establish clear platform SLAs/SLOs, reliability goals, and security/compliance guardrails.
- Run GPU-based compute at scale
- Operate and optimize on‑prem GPU clusters (e.g., Kubernetes + GPU operator and/or Slurm), including capacity planning, scheduling, partitioning, NCCL, and high‑throughput storage/networking.
- Drive GPU utilization efficiency, right‑sizing, and cost transparency across training and inference workloads.
- Deliver MLOps & LLMOps as a product
- Provide golden paths for data prep, training/fine‑tuning, model registry, lineage, governance, evaluation, red‑teaming, and safe deployment (batch, online, streaming).
- Implement CI/CD for models, prompts, and agents; automate evaluations and rollout/rollback with canaries, A/B, and shadow deployments.
- Agentic AI, A2A, and MCP ecosystem
- Lead the design and operation of agentic orchestration (A2A patterns), tool integration, and MCP (Model Context Protocol) servers to securely expose enterprise tools and data.
- Standardize agent capability schemas, guardrails, observability, and policy enforcement.
- Cloud AI/ML platforms
- Leverage AWS/Azure AI services for training and inference (e.g., Bedrock/SageMaker/EKS; Azure AI Studio/Azure ML/AKS/Azure OpenAI) with robust networking, identity, secrets, and cost controls.
- Establish multi‑cloud patterns for portability, resilience, and vendor risk management.
- Platform engineering & DevOps excellence
- Own core platform services: identity/RBAC, secrets, service meshes, observability (logs/metrics/traces), data access controls, vector stores, feature stores, and model gateways (e.g., KServe/Triton/vLLM).
- Use GitOps/IaC (Terraform/Bicep/Helm) and secure software supply chain practices (SBOMs, image signing, policy as code).
- Operational leadership
- Lead a ~10‑engineer global team (platform, SRE, MLOps/LLMOps) with global collaboration, 24×7 readiness, and a healthy on‑call rotation.
- Drive incident response, post‑mortems, and continuous improvement. Partner with Security, Legal, and Compliance for model/data governance.
- Stakeholder & vendor management
- Partner with product, data, and application teams to enable high‑impact AI use cases.
- Manage strategic vendors (e.g., cloud, GPU, enterprise AI tooling) and negotiate licenses/SOWs aligned to roadmap and budget.
Required Qualifications
- 15+ years overall engineering/technology experience, including ~10 years building and operating large‑scale platforms (AI/ML, data, or high‑performance computing).
- Leadership: Proven experience leading a team of ~10 engineers for 5+ years, across platform/SRE/MLOps/LLMOps, with coaching, hiring, performance management, and clear execution rhythms.
- GPU cluster expertise: Hands‑on operations for on‑prem GPU clusters (Kubernetes + GPU operator and/or Slurm), scheduling, capacity planning, performance tuning, and reliability.
- MLOps & LLMOps: Strong experience with model lifecycle (data → training → registry → deployment), model/agent evaluation, safety/guardrails, and observability.
- Cloud (AWS/GCP/Azure): Deep experience with AI/ML services and managed Kubernetes (EKS/AKS/GKE), networking, security, identity, and cost management.
- DevOps/Platform Engineering: CI/CD, GitOps, IaC (Terraform/Bicep/Helm), containerization (Docker), Kubernetes, and secure SDLC practices.
- Agentic AI & MCP: Solid understanding of agent orchestration, A2A patterns, tool abstractions, and operating MCP servers in production.
- Operational excellence: Demonstrated success running AI or computing clusters with SLOs, on‑call, incident management, and post‑mortems.
- Global collaboration: Experience leading a distributed engineering team across time zones.
- Education: Bachelor’s degree in Engineering, Computer Science, or related field.
Preferred Qualifications
- Master’s or PhD in CS/EE/Math or related field.
- Experience with:
- Training & Inference stacks: PyTorch, CUDA/cuDNN, Triton Inference Server, vLLM, KServe, Ray, Slurm.
- Data & storage: High‑throughput storage (e.g., Lustre, BeeGFS, Ceph), vector databases (e.g., FAISS, Milvus, Pinecone, Azure AI Search), feature stores (e.g., Feast).
- MLOps toolchain: MLflow/Vertex/Azure ML/SageMaker registries, Airflow/Argo, Weights & Biases, LangSmith, Prompt/version management.
- Security & governance: OIDC/RBAC, policy as code (OPA), secrets management (AWS Secrets Manager/Azure Key Vault), model governance/risk controls, privacy/PII safeguards.
- Agentic frameworks: Semantic Kernel, LangChain, CrewAI, AutoGen (or equivalents) and experience integrating enterprise tools via MCP.
- Proven track record shipping platform capabilities that enable multiple product teams (self‑service, docs, SDKs, templates, golden paths).
- Strong communication with executives and technical leaders; clear metrics, dashboards, and business value storytelling.
Minimum Qualifications:
• Bachelor's degree in Engineering, Information Systems, Computer Science, or related field and 8+ years of Software Engineering or related work experience.
OR
Master's degree in Engineering, Information Systems, Computer Science, or related field and 7+ years of Software Engineering or related work experience.
OR
PhD in Engineering, Information Systems, Computer Science, or related field and 6+ years of Software Engineering or related work experience.
• 4+ years of work experience with Programming Language such as C, C++, Java, Python, etc.
Qualcomm is an equal opportunity employer. If you are an individual with a disability and need an accommodation during the application/hiring process, rest assured that Qualcomm is committed to providing an accessible process. You may e-mail disability-accomodations@qualcomm.com or call Qualcomm's toll-free number found here. Upon request, Qualcomm will provide reasonable accommodations to support individuals with disabilities to be able participate in the hiring process. Qualcomm is also committed to making our workplace accessible for individuals with disabilities. (Keep in mind that this email address is used to provide reasonable accommodations for individuals with disabilities. We will not respond here to requests for updates on applications or resume inquiries).
To all Staffing and Recruiting Agencies: Our Careers Site is only for individuals seeking a job at Qualcomm. Staffing and recruiting agencies and individuals being represented by an agency are not authorized to use this site or to submit profiles, applications or resumes, and any such submissions will be considered unsolicited. Qualcomm does not accept unsolicited resumes or applications from agencies. Please do not forward resumes to our jobs alias, Qualcomm employees or any other company location. Qualcomm is not responsible for any fees related to unsolicited resumes/applications.
EEO Employer: Qualcomm is an equal opportunity employer; all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or any other protected classification.
Qualcomm expects its employees to abide by all applicable policies and procedures, including but not limited to security and other requirements regarding protection of Company confidential information and other confidential and/or proprietary information, to the extent those requirements are permissible under applicable law.
Pay range and Other Compensation & Benefits:
$198,500.00 - $297,700.00
The above pay scale reflects the broad, minimum to maximum, pay scale for this job code for the location for which it has been posted. Even more importantly, please note that salary is only one component of total compensation at Qualcomm. We also offer a competitive annual discretionary bonus program and opportunity for annual RSU grants (employees on sales-incentive plans are not eligible for our annual bonus). In addition, our highly competitive benefits package is designed to support your success at work, at home, and at play. Your recruiter will be happy to discuss all that Qualcomm has to offer – and you can review more details about our US benefits at this link.
If you would like more information about this role, please contact Qualcomm Careers.