Jobs search

Site Reliability Engineer (Splunk, Prometheus, Grafana) Hybrid

Chelsoft Solutions Co. • Sunnyvale, CA, United States • $200k - $250k / year • 1m ago

This is a Site Reliability Engineer Role for Sam's Cash Application team.

Role And Responsibilities Include

Production Tickets handling and Troubleshooting: Requires knowledge of: Strong Analytical and problem solving skills; Root cause analysis (RCA); Root cause corrective action (RCCA). To guide team members in RCA and RCCA to identify the origins of and prevent defects/performance gaps. Analyzes complex problems involving multiple parties, networks, hardware, software, and cloud computing technologies.
Assesses immediate restoration versus root cause based on consequences and resource requirements. Analyzes the issues and plans a series of steps to enhance an application's availability and reliability, potentially including reconfiguration, integration, removal, or the addition of application components. Analyzes trends to proactively prevent incidents and provide historical summary reports.
Disaster Recovery Planning: Requires knowledge of: Disaster recovery procedures and processes; Enterprise disaster recovery systems. To coordinate partial and full tests of contingency and disaster recovery plans. Creates and maintains data center contingency documents and action plans. Defines and documents contingency and disaster recovery procedures. Leads the identification of critical functions for assigned area of responsibility. Creates and tests plans for operating in a remote back-up environment. Coordinates the day-to-day activities of control measures used in recovery plans.
Monitoring and Alerting: Requires knowledge of: Monitoring and alerting tools (Splunk, Prometheus, Grafana); Monitoring metrics and key performance indicators (for example, availability, MTBF, MTTR); SLIs and SLOs (for example, request latency, availability, error rates, saturation); Distributed tracing; Alerting logic.
To establish metrics to monitor network, software, or system performance. Establishes SLOs/SLAs to determine availability goals of systems/services. Sets altering priorities by identifying the most important systems based on criticality. Oversees daily system monitoring, including verifying the integrity and availability of all hardware and services, reviews system and application logs, and verifies the completion of scheduled jobs.
Leads end-to-end audits of monitors and alarms based on subsystem knowledge. Provides proactive updates to executive leadership on potential customer-impacting issues. Analyzes systems and makes recommendations to prevent possible incidents using knowledge of complex and company-wide systems.

Data Reporting And Metrics

Advanced SQL skills to pull complex data report from multiple sources, familiar with Databricks or GCP Big Query, capable to write advanced "Splunk" queries to join multiple indices to stitch data, using Data-Driven decision-making process to analyze the impact of the production issues and prioritize them.

Additional Information

What project or initiative will they be working on?

Sam's Cash Reward Project

Will this role be hybrid?

If hybrid, how many days per week will need to be in office?

2-3 times a week

Top 3 Skills Needed Or Required

Strong technical analytical and problem solving skills, experiences on triaging and Troubleshooting Production Issues;
Monitoring and Alerting Skills (Splunk, Prometheus, Grafana)
Data Reporting and Metrics Skills (SQL, Python, Pyspark, Databricks).

What is the makeup of the team?

Team of 8 engineers including Java backend engineers, Site Reliability Engineer and Data Engineers, supporting Sam's Cash Core Application Operations.