We are a Berlin-based IoT software company. Our IoT platform enables the simple and fast implementation of industrial IoT applications. We have been offering software "Made in Germany" for national and international customers since 2006. Our IoT solutions are in use in Europe, America and China. As specialists for digitization and IoT, our TÜV-certified IoT platform is already being used in a wide range of industries. Various research projects and long-term OEM partnerships in the industry underline our competence and expertise in solving the most demanding IoT projects. With our Open Source Edge Software and brand new IoT stack, we are going one step further and will make the Internet of Things possible for every business.
For more information: www.azeti.net
These strange times demand a lot of everyone. We provide a secure and long-term work environment for our team and are in search for team members that add to our team for the next years to come. Besides having a stable business, we are backed by the Aurubis AG and are a secure employer for your next career step.
- Stable and secure work environment
- Flexible working hours and regards to your family matters (homeschooling, daycare, etc.)
- Up to 100% Home Office possible
- Free choice of hardware (Linux, Mac, Windows) and equipment for your home office
- Above-average salary and awesome career development possibilities
- Agile work in a diverse international team, flat hierarchies
- Direct impact in product development with immediate effect for our customer base
- Top-notch technology stack with exciting developments already inline
- Urban Sports Club membership, fully paid by us
- Bike leasing by business-bike or jobrad
- Corporate Benefits Programm with nice discounts
- Continuous learning and 100 € Udemy budget per month
Join our Operations team as our first Site Reliability Engineer (SRE) and help us to lift the operation of our IoT platform (multi cloud) to the next level in regards to resilience, observability, incident management and automation.
You will take ownership of our tools for monitoring/incident management and observability and consult our engineering teams with best practices and guidance on how to build resilient modern applications. Together with our Ops and DevOps engineers, you'll work on automation and extension of CI/CD pipelines.
50% of your time will be dedicated to new developments, necessary for good SRE/DevOps practices and 50% will be dedicated in our daily operations, deploying infrastructure, solving issues and participating in on-call rotation. Things will break and you'll support the team to facilitate post-mortem reviews and optimise our documentation. You keep a close an eye on reoccurring alerts and create runbooks for common mitigation tasks together with our engineering.
Our tech stack and ongoing initiatives
You'll join us in the midst of our migration from Docker to automated (Ansible) Kubernetes deployments (Helm) with Gitlab CI following CI/CD Best Practices. Our IoT platform is cloud-agnostic and we're running it as a service for our customers in their cloud environments (Azure, AWS). Our platform includes RDBMS, Time Series Database, Java/Kotlin and React apps plus message broker infrastructure. Our goal is to further automate all deployments with observability built into our applications and to extend resilience in our application stack.
Your first challenge
Facilitate a chaos test with our engineers and start a tradition of chaos engineering, so we're on top of potential weak spots and continuously improve our infrastructure and tooling.
Are you an experienced software engineer who wants to move into SRE but you don't own all required skills (see below) yet? If you got the right mindset and core knowledge (patterns, methods, tooling), even though not yet applied in large scale, let's talk. SRE is a mindset and we believe that bright minds learn fast. If you're structured, a great communicator, well organised and happy to take on responsibility, then this might be your chance to move into SRE.
You've got experience in SRE best practices and implementation of such. Incident management, observability tooling, automation with Ansible and Kubernetes are your strong fields. You have a strong mindset for collaboration and encouragement of modern SRE and DevOps best practices within teams. You communicate clearly and consult engineers in using new methods and applying patterns, i.e. chaos testing, runbooks or post mortem reviews.
- Strong understanding and ideally experience in SRE patterns and best practices
- Experience in incident management and software observability (monitoring), Opsgenie is our tool of choice
- Experience in monitoring of Kubernetes environments with Prometheus
- Extensive experience in infrastructure automation with Terraform and Ansible
- Strong understanding of container orchestration with Kubernetes
- Strong collaboration, communication skills and structured work style
- Organised work flow that includes sharing knowledge, especially as runbooks
- Experience in infrastructure operations within AWS or Azure
- Good understanding of CI/CD and their respective pattern
- Strong Linux & Networking troubleshooting skills
- Fluency in English, especially well-written.
These are a plus:
- Existing work permit for Europe, ideally Germany
- Experience with SLI, SLO patterns for engineering teams
- Experience in chaos and resilience testing for modern cloud applications
- Experience with broker systems Pub/sub, i.e. VerneMQ, ActiveMQ
- Experience Grafana, Fluentd and Opsgenie
- Experience with Helm charts for k8s deployments
- Experience with Ansible AWX
- Experience in Scrum based teams and agile work
- Experience in build automation for Java, Kotlin, Spring, ReactJS applications
It's a plus if you live close or in Berlin. We encourage remote work but hope to meet again in our Berlin office regularly.
Please submit your application on our career page or send it to [email protected] Your main contact is Sebastian Koch.