Reliability Engineering is not a job.
It’s a culture.
With a long history of operations, automation, orchestration, and scripting, my career is now all about reliability engineering. This starts with engineering for tooling, automation, self-healing, self-scaling, and deep-dive investigations of faults.
From that, it grows to a culture of reliability, strategizing with app developers on production readiness, incident command and incident lifecycle, DevOps philosophy, and blameless culture.
I want to be held accountable for your reliability story and assist everyone from engineering interns to CTOs in creating a culture of reliability.
Recruiters and hiring managers: Read “Career Path” before contacting as it will answer a lot of questions.
Senior Site Reliability Engineer
Slack
San Francisco
August 2018 - August 2022
Reliability by Design and Reliability by Culture focused engineering, writing and presentation. Highlights include System configuration automation control with Chef and Puppet (most recently Chef) with custom roles and modules and feature flagging. AWS management with and without Terraform for EC2, SQS/SNS, S3, and more. Migrating to the use of Docker and Kubernetes and assisting developers in the transition to stateless services. Homemade tooling and scripts (primarily Golang and Python and shell) for developers to run their own operational tasks. Extensive use of Slack, Slack bots, integrations and best practices. On-call response for mission-critical production infrastructure. Participation and conduction of production readiness reviews and planned disaster response exercises and helping fix issues that arose from those exercises. Participation in the global major incident commander pager rotation, requiring broad knowledge of system/team responsibilities and assigning tasks and preparing frequent reports on major incidents to both customer service and to executives. Participation in shaping the culture of reliable engineering, including running incident reviews (postmortems) and follow-ups, writing and presenting the blameless culture policy, and mentoring other incident commanders and incident review facilitators.
Software Engineer II (Platform/Datacenter)
Microsoft (Yammer)
San Francisco
May 2014 - May 2018
Software Engineer II (more of a platform/operations role) on the Yammer Compute team. Duties include a wide range of datacenter management tasks including maintaining systems on a custom in-house containerization layer based on LXC and assisting in migrating the platform to Docker based (and Mesos/Marathon orchestrated) systems in Microsoft Azure. Tasks were heavy on debugging, scripting, Puppet, HAProxy, Ubuntu, monitoring and metrics, provisioning and resource management, and building custom tools for automatic volume encryption, DNS synchronization between providers, and all the various nuts and bolts that make large-deployment Linux environments challenging. Very heavy in bash/awk scripting with lighter solutions in Python (mostly for parsing data from JSON endpoints) and Ruby (mostly within the context of Puppet.)
Prior to 2014
Years of Linux Systems Administration with understanding of Linux fundamentals and deep-dive investigations. Principles of Internet operations like DNS, ARP, firewalling, load balancing. Analytical skills for project planning, problem solving, vulnerability detection, and decision making. Creative problem solving and understanding that solutions are a balance of technical needs and people needs.