About the job
Scaleway is looking for a Site Reliability Engineer to join our teams. Reporting to a Lead SRE, you will be responsible to ensure we can reliably serve our products for users around the world. We expect you to have a strong background in development and system administration. Our systems evolve constantly and the tools needed to observe and act to ensure their resilience need to evolve accordingly.
Minimum qualifications
- Previous experience as a developer in Go, Python or Rust
- Experience in system programming with usual scripting languages (bash, Python)
- Demonstrated ability to troubleshoot production systems failures
- A great attitude and desire to work with a team
- Passion for incremental improvements on tooling, love all things of automation
- Experience with Linux systems (Ubuntu/Debian)
- Experience with cloud environments architecture (baremetal, virtual machines, containers, orchestrators)
- Good understanding of computer networks: TCP/IP, DNS, load-balancing, IPv6, BGP and network virtualisation
- Understanding of written and spoken English, capable of writing technical documentation in English, ability to speak English if needed
Preferred qualifications
- Experience with infrastructure as code and continuous deployment
- Experience dealing with physical hardware automation
- Experience with monitoring & logging systems
- Experience administering relational databases
- Knowledge of one cloud platform and related use-cases
- Take initiatives to propose new solutions and defend them
- Team player, willing to share knowledge, opinions, and participate in regular team rituals
- Good communication skills and coaching skills
Responsibilities
- Create or optimize existing tools & documentation that will help identify, diagnose and remediate production incidents, automating as much as possible
- Troubleshoot high-impact issues working with multiple engineering teams
- Take on-call responsibilities, mitigate issues encountered in production and secure the best real-time answer to our customers
- Ensure a high quality of service for our customers by leveraging observability and monitoring technologies
- Manage lifecycle of products in production
- Help implementing best practices in stability, resiliency, scalability, security and performance across our systems
Technical Stack
- Python, Go, Rust
- RabbitMQ
- PostgreSQL
- HA Proxy, Nginx, REST APIs / Flask
- S3 API
- Sentry, Prometheus, Grafana, ElasticSearch, Fluentd, Kibana
- Ansible, AWX, Foreman, Salt
- GitLab, Nexus
- Ubuntu, Debian, CentOS
- Jira, Confluence, Slack, GSuite
Location
This position is based in our offices in Paris or Lille (France).
#J-18808-Ljbffr