Senior Site Reliability Engineer (Remote)
Pear Deck is an Ed Tech startup headquartered in Iowa City, Iowa, with remote team members around the country. We're driven by a mission to help teachers deliver powerful learning moments to every student, every day.
We are building a team of individuals who value inclusion, work according to our core values of truth, brilliance, humility, and determination, and are excited to apply their talents to creating something meaningful together. If you like the idea of being part of a mission-driven company working on big problems in education, join us!
We embrace diversity and invite applications from people of all walks of life. We don't discriminate against employees or applicants based on gender identity or expression, sexual orientation, race, religion, age, national origin, citizenship, disability, pregnancy status, veteran status, or any other differences. Also, if you have a disability, please let us know if there's any way we can make the interview process better for you; we're happy to make accommodations.
As a Senior Site Reliability Engineer you will contribute to Pear Deck's mission by helping us focus our expectations around availability, correctness, and performance while building tools and sharing expertise with the team to ensure our service continues to meet expectations as it scales. The work will cover a wide area, from directly improving our core services to oncall and incident analysis, education around scaling and resilience, and feedback into the product itself.
- Demonstrate truth, humility, brilliance, and determination in their work
- Demand Forecasting and Capacity planning for continued and/or improved site reliability
- Implement and provision necessary infrastructure changes for continued and/or improved site reliability
- Plan and Implement changes to reduce toil
- Read, understand, and review application code to support software development efforts from a reliability / infrastructure perspective
- Monitor health of production infrastructure and investigate/analyse any issues and abnormalities to identify problems or bottlenecks
- Communicate uptime and quality of service issues effectively
- On call rotations and incident response during off-hours
- Implement and deploy hotfixes as necessary
- Plan, track and perform routine system maintenance and software updates to infrastructure
- Track and document reliability related issues and incidents
- Mapping business goals to architectural/infrastructure decisions
- Software development experience and understanding of programming languages, data structures and algorithms
- Experience operating Kubernetes clusters
- Experience in large-scale cloud environments
- Excellent troubleshooting/debugging skills
- Willingness to learn about, work with and understand existing systems
- Comfortable with a Blameless Post-Mortem Culture.
- Ability to remain calm and collected under pressure
- Significant experience with the following technologies:
- Google Cloud Platform
- Amazon Web Services
- Kubernetes (CKA or similar certification preferred)
- Experience with the following or related technologies is a plus:
- Firebase Realtime Database
- 401K with company match
- Health, Dental, Vision Insurance
- Paid Holidays and Unlimited PTO