Remote jobs in Programming

WORK ANYWHERE!

Balena

Site Reliability Engineer

Worldwide

kubernetes

linux

docker

typescript

prometheus

3 months

Remote Jobs

>

Remote Jobs in Worldwide

>

Site Reliability Engineer

Location: Type: Full-time

What you will do


Our customers trust us to provide critical infrastructure for their distributed IoT fleets, and we work hard to continuously improve the availability, resilience, and efficiency of our systems and services. Our reliability team takes an “Infrastructure as Product” approach and plays a key role in shaping the future of the balena platform. They are part operators and part product builders.


As a member of the team, you will ensure the smooth day-to-day running of the infrastructure powering the large and rapidly scaling “balena fleet”. You will facilitate frictionless deployments to production, develop monitoring solutions, create disaster recovery plans, investigate incidents, and manage outages. You will also be empowered to lead initiatives and develop systematic solutions to high-impact, high-complexity challenges such as building our self-service capabilities – enabling the success of both our product development teams and our end-users.


Responsibilities



  • Identify internal user needs, bottlenecks, and failure patterns in production, and build tools, solutions, and features to allow teams to self-serve, deploy and manage services at scale

  • Implement monitoring systems to collect health data, set error alerts, and increase app behavior visibility

  • Leverage data model definitions to automatically generate code for provisioning reliable infrastructure

  • Support developers with seamless, fault-tolerant deployments and production debugging

  • Conduct load tests to ensure applications are ready to handle projected traffic

  • Respond to incidents, drive blameless postmortems, and leverage learnings to prevent future issues

  • Participate in on-call rotation and customer support – be a source of reliability advice for peers


Requirements



  • Background in software development, infrastructure, and/or platform operations

  • Experience working with Docker containers and running production-grade Kubernetes clusters

  • Firm grasp of Linux operating system internals (e.g., filesystems, system calls) and networking including common networking failures and mitigations

  • Proficiency in at least one programming language (we mostly use Typescript)

  • Desire to make self and others more effective through documentation and automation

  • Ability to manage ambiguity, push through friction, and solve complex challenges while clearly explaining the tradeoffs

  • Excellent verbal and written communication skills, and fluency in English


Bonus points



  • Experience designing large-scale, distributed systems and server load balancing architectures

  • Experience with modern SRE practices and the Twelve Factor App methodology

  • Conversant with cloud automation, APM, and log management (we use Grafana, Prometheus, Loki)

  • Contributions to OSS projects and community involvement

  • Familiarity with IoT, embedded computing, developer tools, or the balena platform as a user/contributor

  • Background in leading projects and working across functions to build resilient systems


Make sure to let us know if any of these items apply to you!

Your DREAM REMOTE JOB inside your inbox!

Get a
email of all new remote
Jobs.

Cookies, terms, and privacy policy

By clicking or navigating this website you accept and allow all our cookies, terms of use and privacy policy. This site uses cookies to offer you a better browsing experience.

UNDERSTOOD
feedbackfeedback

How would you rate your experience?

Experince

We may wish to follow up. Enter your email if you're happy for us to contact you.