Site Reliability Engineer (SRE)
Your responsibilities will include:
• System design, configuration, integration, deployment, and operations of Observability systems and tools. These systems include collection of metrics/logs/events from gaming services, applications (client, middleware, backend) and infrastructure (GCloud, on-premise).
• Design, deploy our Observability infrastructure and systems to the next level of availability and scale
• Develop metrics and log ingestion pipelines for high volumes of telemetry
• Creating build and deployment pipelines for monitoring tools
• Deployment of monitoring solutions
• Developing a set of alerts and metrics to keep your own services alive and performing well
• Collaborating with other SRE team members, working on improving efficiency and reliability of monitoring solutions
• Collaborate with our Application Development teams to define the standards/APIs that ensure our Applications are emitting the right telemetry (metrics, logs, traces, events)
• Collect, aggregate and visualize the collected metrics to provide visibility and standards for key indicators to understand the health of our most critical systems
• Evaluating, choosing, and implementing the next generation of Observability tools
Who are we looking for:
• As a Senior SRE Observability Engineer, you have extensive working experience building/ integrating/ administering systems that leverage open-source monitoring tools at scale (e.g., Prometheus, VictoriaMetrics), Elastic Stack (Elasticsearch, Logstash, Kibana, Beats) and Grafana. We are working with Atlassian products (Jira, etc.) so it'll be good if you have used them too.
• We try to follow the best methodologies and IT operations in an always-up, always-available service but you will be able to suggest any improvements. Our environment is Agile so it`ll be good if you have worked in such teams.
• You are a quick learner who can adopt and devour a lot of information about our in-house framework and systems fast. In this position you will have to show your good soft skills. You can work under pressure whilst maintaining accuracy and attention to detail. As a team we are results oriented and rely on good communication to achieve success.
You have experience in the following technologies:
• 2 years+ experience with Open-Source Monitoring & Observability tooling/integration
• Time Series Databases (TSDB) - InfluxDB, Prometheus, VictoriaMetrics
• Elastic Stack (Elasticsearch, Logstash, Kibana, Beats)
• Full proficiency with Linux command line environment
• Scripting experience in Bash, Nodejs is a big plus
• Monitoring protocols/frameworks – Prometheus/Influx line format, SNMP
• Building software using Jenkins, TeamCity, Gitlab CI
• Git and versioning software
• Virtualization tools (Proxmox, VMware)
• Database experience is big plus(MariaDB, Mysql)
• Cloud services (Google Cloud, AWS, Azure, etc.)
• Containerisation experience (Docker, Kubernetes)
• Middleware (Kafka)
Our Site Reliability Engineering and DevOps team ensures 24/7 coverage of our systems and as part of the team you`ll have to take part in the on-call shift schedule.