Job Description:
We are looking for a passionate and experienced Site Reliability Engineer (SRE) with a strong focus on observability as part of our Data Center Exit Program. The ideal candidate will be responsible for building and maintaining reliable systems, ensuring high availability, and improving the performance of our infrastructure. You will play a critical role in designing, implementing, and managing observability solutions to provide deep insights into systems and applications.
Key Responsibilities:
Design, implement, and maintain observability solutions, including monitoring, logging, and tracing systems.
Develop and maintain dashboards, alerts, and reports to monitor system performance and health.
Collaborate with development and operations teams to integrate observability into the software development lifecycle.
Troubleshoot and resolve issues related to system performance, reliability, and scalability.
Automate repetitive tasks and processes to improve efficiency and reduce manual intervention.
Continuously evaluate and improve observability tools and practices to ensure they meet organizational needs.
Document observability processes, best practices, and guidelines for the team.
Qualifications:
Bachelors degree in Computer Science, Engineering, or a related field, or equivalent experience.
5+ years of experience as a Site Reliability Engineer or in a similar role with a focus on observability.
Strong experience with Java J2EE and Microservices.
In-depth knowledge of observability tools and technologies, including Prometheus, Grafana, ELK Stack, Jaeger, etc.
Proficiency in scripting and automation using languages such as Python, Bash, or similar.
Experience with cloud platforms (GCP, Azure) and containerization technologies (Docker, Kubernetes).
Excellent problem-solving skills and the ability to work under pressure.
Strong communication and collaboration skills.
Preferred Qualifications:
Experience with Infrastructure as Code (IaC) tools such as Terraform or Ansible.
Familiarity with CI/CD pipelines and DevOps practices.
Knowledge of security best practices and compliance requirements.
SRE (Site Reliability Engineer)
Posted 3 weeks ago