Empowering Users through Site Reliability Engineering

May 23, 2019 ยท 11 minute read

When..

was the last time you got excited about new technology? What was it?

Was it a programming language?

Rust, Go, Node.js?

Maybe a new framework?

Angular, React?

Or was it something like serverless or containers?

Azure functions, Docker, Kubernetes?

Think back on what it was.

Do you have something in your mind?

Ok, now …

How did you learn about it?

Likely from a friend or colleague introducing it to you, right? Maybe they sent you a link to “check this out”. Or maybe you saw someone on Twitter talking about it. Someone you trust or respect enough to follow. Maybe it came from a blog post you read. ProductHunt? Reddit? Dev.to?

Regardless of where you initially heard about it, it more than likely came from a “trusted” source. A person or brand that you have established a relationship with. You have confidence in their opinions.

In most cases we get excited about something because someone we know and trust is excited about it.

At any given moment, someone is sharing something that they suddenly can’t live without - with someone else.

Whether face to face or in blogs, reviews, discussions, tweets, comments, updates, texts, photoes - people talk. There are many different forms of “word of mouth”.

How does that happen? What inspires those recommendations?

What compels you to share something with someone else?

All the Feels

It’s not really about the product, or the company, or the brand.

In fact a lot of the new stuff that we love, comes from a startup or unknown company. The underdog. The new kid.

Users haven’t had a chance to formulate an opinion on the company or brand providing the service yet. It’s not about how the user feels about us as the provider.

It has nothing to do with us.

It’s about how the user feels about themself.

Shifting Self Perception

How does the service we are building change the way they perceive themselves? Does it make them better at something?

What does our product or service help them do and be?

How does it transform them? What is enabled?

This is what we need to discover.

Successful Users

“I can’t believe you were able to do that with that thing. You’re amazing!”

We want to build products, services, and support in ways that inspire users to talk about themselves. We want them to be proud of themselves and what they are capable of.

Instead of looking for common attributes across successful products, we must look for common attributes across successful users of those products. - Kathy Sierra (Badass: Making Users Awesome - O’Reilly Media)

Where you find sustained success driven by recommendations, you find smarter, more skillful, more powerful users. Users who know more and can do more in a way that’s personally meaningful.

Fans of new tech dont’ share it with their friends because they love a company. They tip off their friends because they like their friends.

It’s about what they can do or be as a result of what our product, service, experience enables. Sustained bestsellers help their users achieve results.

All of my digital devices contain dozens of apps that enable me to be a better version of me. A more creative one. A more rested one. One who can travel around new cities with ease. One who (although often forgetful) always knows which flights and hotels they are supposed to be in.

It’s about the objectives of our users. We must care more about what they are doing, attempting to do, failing at doing, etc.

An Unlikely Ally

Many of these concepts are currently being amplified through Site Reliability Engineering principles.

Reliability means a lot of things but at the end of the day, something is NOT reliable if it can’t deliver on the promise to enable user results.

If a user could communicate what they were thinking or feeling while they used our product, that would be helpful, right?

How do we find out?

We move monitoring closer to the user.

For a deep dive on this idea, I highly recommend reading “Practical Monitoring” (O’Reilly) by Mike Julian

Monitoring For What Users Care About

Traditionally, we monitored our systems for problems related to the underlying technology. Installing agents on new servers to update a central console about the health and performance of things such as CPU, memory, and disk space.

A problem with any of these and many other component level items meant it would likely affect any applications running on the servers. If apps aren’t able to perform, databases aren’t able to save, and tasks can’t complete, users would be impacted.

Known Knowns

It’s a logical cause and effect scenario. It makes perfect sense that if there is a problem with the tech that supports the app, the app won’t work and the user won’t be able to achieve their expected results. Everybody knows that.

Those are the “known knowns” of our system. We know that if a disk is full, the database can no longer write.

In the cloud world, we can easily set up automation to address the known knowns .. since we know about them and we know what countermeasures should be taken to avoid the problem. For example, if disk space runs low, we simply acquire more storage. We can even automate the process of scaling resources back down on demand.

Known Unknowns

But as practitioners and architects building and operating systems, we’ve all seen some weird stuff go down. Stuff we didn’t even know was possible. As a result, we now have the wisdom to continously remind ourselves (and our teammates):

“Don’t get too comfortable. It’s coming”.

Those are the “known unknowns” of our system. We know that there are things we DO NOT know (i.e. unknown).

Like life below the surface of the ocean, we know so little.

But at least we are self-aware of our own ignorance.

Anomaly detection and finely tuned monitoring, logging, and eventing tools can tip you off to these types of problems. First responders are alerted to an incident while providing whatever context is available at the time.

In many cases, not much is known in terms of possible causality, but the system (or parts of it) are not currently healthy and users are likely experiencing some disruption in service. We may not in fact be sure that they are experiencing downtime, but we often feel it’s safe to assume.

The fact is, we don’t know but because this is something we’ve never seen before. Combined with the information we are able to obtain by querying and investigating different aspects of the system we have strong reason to believe users know there is a problem.

That leaves us with the trickiest of them all - “uknown unknowns”

Unknown Unknowns

You don’t know what you don’t know!

There are things about your system, and particularly your users that you are completely blind to.

Why?

There’s no monitoring for it.

Example: An online retailer missed out on tens of thousands of dollars from orders placed on Black Friday simply because the cart checkout wasn’t accepting American Express. Operations teams were monitoring for failed components of the credit card system but had no idea it “suddenly” couldn’t handle Amex.

Users didn’t complain through support or social media. They just abandoned their cart.

Abandoned carts are pretty common. But are you monitoring for it?

Which has a bigger impact to business value?

Customers abandoning their carts at checkout. Leaving frustrated and without making a purchase …

or …

CPU utilization of the virtual machines on which these apps run is running at 99%?

How would we monitor for something like this? What type of feedback do we need in order to give us more about what it’s actually like for a user?

Creating Virtuous Feedback Loops

In site reliability engineering, we use what are known as Service Level Indicators and Service Level Objectives as a way of monitoring and alerting our systems.

Indicators are time based values we calculate as simple ratios.

For example:

Availability of a service may be calculated as the number of successful requests received by the load balancer divided by the total number of requests (successful AND failed).

Over the course of an hour, the load balancer may see 100,000 total requests come through. Of those, 1,000 failed. This means 99,000 requests were succesful. 99,000 over 100,000 is 99% available - as measured by the load balancer for the last hour.

The time and source are both key to building a service level indicator (SLI).

Check out: Site Reliabiltiy course on Microsoft Learn for more about SLIs and SLOs.

A service level objective is our line in the sand, so to speak. What target are we setting that if breached, we need to alert a human?

In our shopping cart example we may have a service level indicator that measures the total number of abandoned carts - (as measured by the credit card processing system) - (over the course of 30 minutes).

As an objective, we need to determine what is a reasonable number of abandoned shopping carts - (over the course of 30 minutes) - are we comfortable with?

How many until a person is alerted to an issue to begin troubleshooting?

Hint: It’s NOT 1.

Many would say we don’t want anyone to abandon a cart, no matter what time period you are looking at. And maybe that’s the goal you are setting. The objective.

As someone who has been on-call most of their life, I’ll tell you that chances of you paging an engineer quick enough for them to respond and remediate whatever is contributing to this Amex problem BEFORE the user notices and changes their mind… are pretty close to zero.

You might be thinking… “Sure, we lost one person. But I want someone to investigate immediately so that a second one doesn’t happen.”

Just because it happened once doesn’t necessarily mean we have a problem either.

There are lots of reasons why people may abandon a shopping cart, even if coincidentally all of them were attempting to use an American Express credit card.

We need to make data driven decisions.

Remember the unknown unknowns?

We don’t know what the problem is (yet), only that something is wrong enough that we want to get a person involved once our line in the sand has been crossed. That’s our service level objective.

This is our feedback loop.

By placing more focus on what it’s like for the end user as they attempt to achieve results (like buying something with an American Express credit card), we have a much better understanding of what it’s like to use our system. Are we delivering on the value? Are we monitoring for the things that actually give us some sort of indication as to what’s going on from their perspective, or are we focusing too much on the menutia and noise (old kown knowns and known unknowns) that make up the guts of our services

Increasing Observability

Our goal is to protect the reliability of business value .. whatever that may mean to your business.

In order to do so we must first identify the true value we are aiming to deliver and then seek out ways to give us high fidelity on what state that value is in at any given time (service level indicators).

Once we know why people use our services we then establish a threshold (service level objectives) that when breached informs software engineers to take action.

What engineering effort is needed to fill in the current blindspots?

What does your incident response plan look like? Do you even have one?

Is there urgency in your response to recovering from a degredation in service or does your process involve creating tickets that are sent to a support group that has never seen, let alone touched a line of code in your system?

Users DO NOT care about your tech stack. They DO NOT care what cloud you are using or even if the service you provide runs on a Raspberry Pi in the back of a closet in your parents basement. They just want to use your service to do a thing. A thing that makes them awesome!

Final Thoughts

Establish service level indicators and objectives that closely match the true value of the service you intend to provide. This may take some engineering effort.

Plan and schedule the work. It should be part of your engineering sprint. It’s just as (actually more) important than any new feature you are working on, so it needs to be treated as such.

Discuss what it will take to increase observability and formulate a standardized response plan for WHEN things aren’t meeting the expectations of our users.

And .. most important .. focus on continuously improving your monitoring efforts. You’ll never be able to have a pulse on every aspect of a complex system but we can create ways to give us clues that something isn’t quite right and we need to investigate and possibly mitigate problems before users are impacted and take their money to another provider who can “reliably” make them awesome.