The Site Reliability Lead is responsible for consulting on resilient architectures, as well as programming best practices. They will act as a subject matter expert in the available compute platforms and programming languages and will collaborate with architects to help teams determine the appropriate choices to make for a given workload within their product family. They will conduct localized FMEAs and post-incident reviews for the product family and will be responsible for implementing and carrying out chaos experimentation and performance testing for the product family. Due to their breadth of expertise across the products within their purview, they may be a natural escalation path for incidents that involve multiple product teams. As incident response experts who also work closely with product teams, they'll ensure that each product's runbooks for incident response are kept up to date. The SRE Leads will also have an awareness of SLIs and SLOs for each of the products within their product family and will ensure there is logical alignment between observability targets for products which depend on one another. If optionality remains in the standards set by SRE Champions and/or shared services groups (i.e. more than one option for observing trace data), the SRE Leads will be the ones to make decisions for their product family so that there is consistency across related products.
Responsibilities:
- Consult on architecture and programming design decisions related to availability and resilience.
- Conduct localized FMEAs when new features and architecture patterns are introduced, or at minimum annually per product.
- Facilitate post-incident reviews for any client-impacting events local to the product family.
- Plan and execute chaos experiments regularly for the product family.
- Regularly coordinate performance tests for the product family.
- Assist product teams with triage and troubleshooting during client impacting incidents.
- Maintain product-level runbooks for incident response, in collaboration with SRE Practitioners on each product team.
- Ensure alignment between SLIs and SLOs within the product family.
- Make final decisions regarding usage of tools, libraries, and standards for SRE in situations where multiple options have been provided by SRE Champions and/or shared services teams (i.e. observability tools).
Qualifications:
Minimum of 8-10+ years related experience, with at least two years of development experience.
Undergraduate degree or equivalent combination of training and experience. Graduate degree preferred.
Additional Qualifications that make an impact:
Completed CTO SRE Curriculum training.
Completed AWS Cloud training - Recommended AWS Certified Cloud Practitioner (from AWS)
Minimum 5 years' experience with infrastructure system administration, including compute hardware and operating systems, storage systems and networking.
Have experience with observability tools - Splunk, Honeycomb, Tivoli, Grafana, AWS CloudWatch.
Minimum 5 years' experience with the current cloud based and on-prem compute environments and architectures.
Special Factors
Sponsorship
Vanguard is not offering visa sponsorship for this position.
About Vanguard
At Vanguard, we don't just have a mission-we're on a mission.
To work for the long-term financial wellbeing of our clients. To lead through product and services that transform our clients' lives. To learn and develop our skills as individuals and as a team. From Malvern to Melbourne, our mission drives us forward and inspires us to be our best.
Our commitment to diversity, equity, and inclusion
Vanguard's commitment to diversity, equity, and inclusion (DEI) is central to our ability to deliver on our mission. We aspire to create a work environment that is inclusive, equitable, and diverse-one that enables our employees, whom we call crew, to thrive and bring their best selves to work every day on behalf of our clients.
Cultivating DEI lifts our entire organization, and everyone shares accountability for our progress-from our senior leaders who lay the foundation and set the example for inclusive behaviors to crew who are growing in their personal DEI learning experiences.
Together, we're on a mission. We are fueled by the value of diverse voices and connected through friendships and a culture of care-for our clients, our communities, and each other.
Vanguard's DEI journey has no finish line. Our commitment is enduring, and we remain focused on the path ahead. To learn more about Vanguard goals and progress toward DEI, download our Diversity, Equity, and Inclusion Report .
How We Work
Vanguard has implemented a hybrid working model for the majority of our crew members, designed to capture the benefits of enhanced flexibility while enabling in-person learning, collaboration, and connection. We believe our mission-driven and highly collaborative culture is a critical enabler to support long-term client outcomes and enrich the employee experience.