- Proactively identify performance improvements in areas such as responsiveness, availability, and scalability.
- Establish best practices around topics like observability, monitoring and incident response and drive adoption across the organization.
- Lead incident response efforts and conduct post-mortem analyses to prevent future occurrences.
- Coordinate with Software Engineering and DevOps teams to design, implement, and maintain scalable and reliable systems using Kubernetes, Docker, and Istio.
- Monitor system performance and troubleshoot issues proactively, utilizing Datadog for observability.
- Implement and tune Horizontal Pod Autoscalers (HPAs) to optimize resource utilization.
- Develop and maintain automation tools for deployment, monitoring, and incident response.
- Collaborate with software engineering teams to improve system reliability and performance.
- Implement A/B deployments, canary deployments, and traffic mirroring strategies to ensure critical updates go smoothly and can be rolled back with minimal impact if necessary.
- Mentor junior engineers and contribute to team knowledge sharing.
- Oversee and coordinate with SREs in other parts of the world, ensuring effective collaboration during on-call rotations.
- Establish and enforce best practices for system reliability and performance across the organization.
- Utilize Helm charts for application deployment and management.
- Understand and implement AWS systems, including AWS Load Balancers and routing, to support systems handling millions of requests per hour.
- Participate in on-call rotations and provide support for production systems.
- 5+ years of production experience working as a Site Reliability Engineer, DevOps Engineer, or Software Engineer
- Demonstrated ability to deliver highly available solutions at scale.
- Demonstrates advanced problem-solving, troubleshooting, decision making skills
- Expertise in containerization technologies (Docker, Kubernetes, and Istio) to build, package, and deploy optimized container images
- Expertise in AWS
- Experience with Argo CD for continuous delivery and GitOps practices.
- Proficiency in monitoring and alerting tools, particularly Datadog, AppDynamics, ELK, Grafana, or Prometheus.
- Familiarity with A/B, Canary, Blue/Green deployments, and traffic mirroring techniques.
- Experience with scripting and orchestration tools such as Terraform, Ansible, or equivalent.
- Demonstrated ability to balance cost considerations with performance and reliability.
- Experience delegating tasks to junior engineers
- Experience in leading initiatives under direction
- Ability to apply systems thinking to understand interdependencies and design solutions that achieve results
- Ability to learn and apply new technologies, programming practices, patterns, and methods
- Experience mentoring, providing technical guidance, and training more junior team members
- Ability to work independently and take ownership of tasks/assignments
- Organized and detail-oriented
- Ability to develop healthy working relationships and collaborate with peers and leaders
- Exhibits integrity and high standards in work quality
- Excellent verbal and written communication skills
- Proficiency in Golang or Rust are both a plus but not required.
- Values diversity and differences amongst individuals in interactions
Company
Location
Plano, Texas - United States of America
Job type
Full-Time
Golang Job Details
The team is looking for a Contract Senior Site Reliability Engineer to join our dynamic and fast-paced team. The ideal candidate will have extensive experience in managing large-scale microservice based systems, ensuring high availability, and implementing best practices in reliability engineering. You will work closely with development and operations teams to enhance our infrastructure and improve system performance while being mindful of cost-effectiveness.
Responsibilities:
Required Qualifications:
More Developer Job Boards
Fullstack Developer Jobs Golang Jobs JavaScript Jobs Python Jobs React Jobs Rust Jobs Java Jobs