Golang Job: Site Reliability Engineer - Plano, TX Hybrid - Pos

Job added on

Location

Plano, Texas - United States of America

Job type

Full-Time

Golang Job Details

We are looking for a Contract Site Reliability Engineer to join our team. The successful candidate will help ensure the reliability, availability, and performance of our systems. You will work alongside developers and operations teams to implement solutions that enhance our infrastructure and support our applications while being cost-conscious.

Responsibilities:

  • Assist in the design and implementation of reliable and scalable systems using Kubernetes, Docker, and Istio.
  • Proactively identify performance improvements in areas such as responsiveness, availability, and scalability.
  • Monitor system performance and respond to incidents as they arise, utilizing Datadog for observability.
  • Help develop automation scripts for deployment and monitoring.
  • Leverage GitOps to ensure that software can reliably and smoothly be shipped to production.
  • Collaborate with development teams to identify and resolve reliability issues.
  • Conduct load testing to verify that systems can handle expected loads for new products and updates to existing products.
  • Implement A/B deployments, canary deployments, and traffic mirroring strategies to ensure critical updates go smoothly and can be rolled back easily if necessary.
  • Utilize Helm charts for application deployment and management.
  • Understand AWS systems, including AWS Load Balancers, EKS and routing, to support systems handling millions of requests per hour.
  • Ensure that solutions are cost-effective while providing a high-quality customer experience and maintaining very high availability.
  • Participate in on-call rotations and support production systems, collaborating with SREs in other parts of the world.
  • Contribute to documentation and knowledge sharing within the team.
  • Assist in the implementation of best practices for system reliability.

Required Qualifications:

  • 2+ years of experience in Site Reliability Engineering, DevOps, or a related field.
  • Familiarity with AWS.
  • Familiarity with Kubernetes, Docker, and Istio.
  • Basic knowledge of monitoring and alerting tools, particularly Datadog, AppDynamics, ELK, Grafana, or Prometheus.
  • Implement and tune Horizontal Pod Autoscalers (HPAs) to optimize resource utilization.
  • Understanding of Argo CD for GitOps practices.
  • Familiarity with A/B, Canary, Blue/Green deployments, and traffic mirroring techniques.
  • Understanding of scripting and orchestration tools such as Terraform, Ansible, or equivalent.
  • Awareness of cost management in cloud environments and the ability to balance cost with performance and reliability.
  • Demonstrates advanced problem-solving, troubleshooting, decision making skills
  • Ability to learn and apply new technologies, programming practices, patterns, and methods.
  • Ability to work independently and take ownership of tasks/assignments while driving them to completion.
  • Organized and detail-oriented.
  • Ability to develop healthy working relationships and collaborate with peers and leaders.
  • Exhibits integrity and high standards in work quality.
  • Excellent verbal and written communication skills.
  • Proficiency in Golang or Rust are both a plus but not required.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.