О компании:
Acclaim is a voice-first AI customer experience (CX) platform purpose-built for regulated industries including banking, fintech, healthcare, and insurance.
Требования:
- 5+ years in SRE/DevOps experience
- Deep understanding of Docker and Kubernetes
- Experience with Prometheus, Alertmanager, and Grafana
- Knowledge of SLIs/SLOs and incident management
- Python coding skills for automation tasks
- Cloud experience with GCP and/or AWS
- Strong Linux skills and networking knowledge
- Familiarity with CI/CD and infrastructure as code
Обязанности:
- Responsible for the reliability of our services: SLIs/SLOs, availability, and identifying and eliminating bottlenecks across the system.
- Set up monitoring for services, metrics, alerts, and dashboards.
- Build and maintain Grafana dashboards.
- Run load testing, analyze results, and provide recommendations.
- Investigate incidents, participate in on-call rotations, and write postmortems.
- Work closely with developers and mentor colleagues.