Reliability Engineering is not a job.

It’s a culture.

With a long history of operations, automation, orchestration, and scripting, my career is now all about reliability engineering. This starts with engineering for tooling, automation, self-healing, self-scaling, and deep-dive investigations of faults.

From that, it grows to a culture of reliability, strategizing with app developers on production readiness, incident command and incident lifecycle, DevOps philosophy, and blameless culture.

I want to be held accountable for your reliability story and assist everyone from engineering interns to CTOs in creating a culture of reliability.

Recruiters and hiring managers: Read “Career Path” before contacting as it will answer a lot of questions.

Senior Site Reliability Engineer

Slack

San Francisco

August 2018 - August 2022

Reliability by Design and Reliability by Culture focused engineering, writing and presentation. Highlights include System configuration automation control with Chef and Puppet (most recently Chef) with custom roles and modules and feature flagging. AWS management with and without Terraform for EC2, SQS/SNS, S3, and more. Migrating to the use of Docker and Kubernetes and assisting developers in the transition to stateless services. Homemade tooling and scripts (primarily Golang and Python and shell) for developers to run their own operational tasks. Extensive use of Slack, Slack bots, integrations and best practices. On-call response for mission-critical production infrastructure. Participation and conduction of production readiness reviews and planned disaster response exercises and helping fix issues that arose from those exercises. Participation in the global major incident commander pager rotation, requiring broad knowledge of system/team responsibilities and assigning tasks and preparing frequent reports on major incidents to both customer service and to executives. Participation in shaping the culture of reliable engineering, including running incident reviews (postmortems) and follow-ups, writing and presenting the blameless culture policy, and mentoring other incident commanders and incident review facilitators.

Software Engineer II (Platform/Datacenter)

Microsoft (Yammer)

San Francisco

May 2014 - May 2018

Software Engineer II (more of a platform/operations role) on the Yammer Compute team. Duties include a wide range of datacenter management tasks including maintaining systems on a custom in-house containerization layer based on LXC and assisting in migrating the platform to Docker based (and Mesos/Marathon orchestrated) systems in Microsoft Azure. Tasks were heavy on debugging, scripting, Puppet, HAProxy, Ubuntu, monitoring and metrics, provisioning and resource management, and building custom tools for automatic volume encryption, DNS synchronization between providers, and all the various nuts and bolts that make large-deployment Linux environments challenging. Very heavy in bash/awk scripting with lighter solutions in Python (mostly for parsing data from JSON endpoints) and Ruby (mostly within the context of Puppet.)

Prior to 2014

Years of Linux Systems Administration with understanding of Linux fundamentals and deep-dive investigations. Principles of Internet operations like DNS, ARP, firewalling, load balancing. Analytical skills for project planning, problem solving, vulnerability detection, and decision making. Creative problem solving and understanding that solutions are a balance of technical needs and people needs.

Great engineering comes from minds that are not hindered by undue stress, distraction, fear, or lack of belonging. To that end, I will advocate for equality and inclusiveness in any workplace. This includes gender equality, respect for any gender and orientation identity, support for all racial backgrounds, support for all disabilities, and recognition that all voices from “fresh intern” to “40 years in tech” have meaningful contributions to the team. Mental healthcare is healthcare and should be treated as such.

No team will succeed without the needs of the individuals being met, and no leader should put an OKR goal above a contributor’s health or sense of belonging.