The third era of AI has arrived, powered by Generative AI. Generative AI is achieving step-function increases in scale, versatility, and accuracy compared to legacy AI technologies, presenting an opportunity for organizations to fundamentally transform their business and operations.
SambaNova Suite™ is enabling organizations and enterprises to achieve the transformative promise of these new AI technologies with a fully integrated hardware-software system that delivers innovation across the full AI stack, including the most accurate generative AI models, optimized for enterprise and government. This creates the AI backbone for the next 10 years and beyond.
Working at SambaNova
This role presents a unique opportunity to shape the future of AI and the value it can unlock across every aspect of an organization’s business and operations. The Service Operations team at SambaNova Systems is responsible for building and operating the platform and infrastructure that enables us to deliver our groundbreaking capabilities to enterprise customers.
Job Description
SambaNova is hiring a Senior Site Reliability Engineer, Cloud Native Platform. As a site reliability engineer on this team, you will work closely alongside the platform engineering team to deploy and manage our Kubernetes-based platform at a global scale. You will lead multiple initiatives to enhance our capabilities and provide a reliable, scalable service for customers, in a hybrid deployment pattern.
This individual will be responsible for:
- Assume broad responsibilities for the successful delivery of our SambaNova services in a hybrid model including but not limited to, deployment, configuration, integrations, and ongoing operations
- Deploy, administer, and manage multiple Kubernetes clusters, both on-prem and in private cloud environments
- Lead efforts to triage, debug and fix issues related to network, storage, scheduling, applications, and systems, for proactive and reactive incident resolution and root cause analysis.
- Develop and continuously improve platform capabilities for observability, monitoring, notifications, logging, tracing, and continuous delivery with reduced toil
- Develop standard solutions that enable consistency in service delivery and proactively engage with multiple cross-functional teams to solve problems that impact service levels.
- Collaborate with the platform engineers for continuous automation of fleet-wide infrastructure and application deployments
- Determine and set SLOs for the service and build the process and tools to measure and implement the SLOs, and prevent recurring problems and undesirable service conditions.
- Participate in on-call rotation responsibilities
Basic Qualifications
- Bachelor and/or Master in CS /EE or related field
- 5+ years of hands-on experience as an SRE with a focus on cloud-native technologies
Additional Required Qualifications
- Hands-on experience deploying, managing, and troubleshooting Kubernetes clusters and components.
- Strong experience configuring and administering Linux systems in cloud/Saas production environments.
- A systematic problem-solving approach to troubleshooting, and the desire to solve the root cause of common problems in 24x7 environments
- Software programming experience in one or more languages including Go/ Python
- Experience delivering infrastructure as code - Ansible, Terraform, Git, Jenkins, Helm, ArgoCD.
- Good understanding of DNS, DHCP, LDAP, NFS, Kerberos, PAM, PXE, SNMP, SSH, HTTP/S, NTP, troubleshooting network performance issues
- Experience with monitoring and logging systems such as Prometheus, Grafana, Nagios, ELK, etc. and the ability to identify new technologies as appropriate
- Experience tuning and optimizing storage solutions including Object Storage and NFS.
- Knowledge of virtualization, multiple hypervisor technologies as well as cloud computing technologies like AWS, Azure, and GCP.
- Configuration and maintenance of web servers, load balancers, databases, storage systems, and messaging systems
- Good understanding of test-driven development, continuous integration, and delivery
- A passion to design for high availability and scale, with the discipline and desire for extensive automation.
- Strong communication skills with the ability and willingness to work with diverse teams, and customers, across multiple time zones.
Preferred Qualifications
- Experience working in a high-growth startup
- A team player who demonstrates humility
- Action-oriented with a focus on speed & results
- Ability to thrive in a no-boundaries culture & make an impact on innovation
#LI-TD1
Submission Guidelines
Please note that in order to be considered an applicant for any position at SambaNova Systems you must submit an application form for each position for which you believe you are qualified.
If you are a new, recent (within the last two years), or upcoming college graduate and are interested in opportunities with SambaNova Systems, please apply through our University job listings.
EEO Policy
SambaNova Systems is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard basis of age (40 and over), color, disability, gender identity, genetic information, marital status, military or veteran status, national origin/ancestry, race, religion, creed, sex (including pregnancy, childbirth, breastfeeding), sexual orientation, and any other applicable status protected by federal, state, or local laws.
Customers turn to SambaNova to quickly deploy state-of-the-art AI capabilities to meet the demands of the AI-enabled world. Our purpose-built enterprise-scale AI platform is the technology backbone for the next generation of AI computing. We enable customers to unlock the valuable business insights trapped in their data. Our flagship offering, SambaNova Suite™, provides the most accurate generative AI models, optimized for enterprise and government organizations, deployed on-premises or in the cloud, and adapted with an organization’s data for greater accuracy