About the Company
Arcules is changing the video surveillance market and moving customers to a smarter, more reliable cloud-based solution. Our company is a technology spin out of Canon Inc. and has seed technology from the Milestone video surveillance division. We are targeting mid-market enterprises looking to interconnect their business locations. Our technology brings video and smart building elements together and our analytics help customers make more informed decisions. Our go-to-market strategy leverages systems integration companies to get our solutions to end customers.
Arcules offers excellent benefits, including a top-tier PPO medical plan, four weeks of vacation, three weeks of sick leave, 401(k) plan after three months of employment (4% company match), an on-site gym and game pavilion, an awesome work environment and more.
Overview of the Job
The Site Reliability Engineering team at Arcules provides leadership, direction and accountability for platform architecture, system design and end-to-end implementation to meet and exceed the product non-functional requirements including quality, security, reliability, availability and performance.
A Site Reliability Engineer (SRE) on our dynamic SRE team will focus on using software engineering to enable automation and efficiency in all aspects of platform change management and operations. The Staff SRE will write, test and deploy software to optimize day-to-day activities of support product roll out and operation reliability.
You will have opportunities to work across a spectrum of devops challenges and implementation to enhance developer workflow and production stability. With senior team member guidance and direction, you will work to implement technology solutions to maximize performance and availability of our environment.
We are open to remote work for this role (US based).
- Automation: implement orchestration and tooling solutions to ensure that repetitive administration tasks are performed at a high level of efficiency and free of defect
- Build and implement monitoring and recovery tools to provide for site high availability (HA) and disaster recovery (DR)
- Measure and monitor availability and overall system and environment health
- Build Continuous integration/continuous deployment (CI/CD) pipelines and templates to support deployment and release process
- Deploy releases to QA-production environment according to release plan
- Ensure that releases are deployed to production correctly and reliably as planned
- Assist in troubleshooting and root cause analysis
- Fully document a playbook of solutions deployed as well as documenting the postmortem report in the incident response management as part of the incident responders or as assigned to incident captain role
- Other tasks as assigned
- 2-4 years of experience in software engineering, infrastructure design, system engineering, QA/testing automation
- Demonstrable experience in testing methodology, testing automation framework
- Familiar with SRE methodologies and passionate about solving operation problems through automation and software engineering
- Ability to manage competing priorities and work well under pressure
- Ability to communicate effectively vertically and horizontally within the organization via demonstrated written and verbal communication skills
- Intermediate level of Linux/Unix skills
- Experience working with Google Cloud preferred but will consider any other public cloud providers experience
- Intermediate coding skills with at least one of the modern programming languages: Python, Go, Ruby, Java
- Familiar with at least one of configuration management tool experiences with Ansible, Salt, Puppet or Kubernetes configuration tools such as Helm
- Release software tooling (git, Jenkins, Spinnaker or other Cloud specific cloud environment)
- Fully comfortable with Kubernetes administration and operation tasks