Service Management

While product management focuses on what you build, service management focuses on how you deliver and support it. For many organizations, the product is the service, and the distinction blurs. Regardless of whether you think of your offering as a product or a service, someone needs to be responsible for how it operates in the hands of customers.

Service Design

Service design is the practice of planning and organizing the people, processes, and technology behind a service to improve the experience for both customers and the organization delivering it.

Customer journey mapping - Trace the end-to-end experience from the customer's perspective. Where do they interact with the service? Where are the friction points?
Blueprinting - Map the front-stage interactions (what the customer sees) to the back-stage processes (what happens behind the scenes). This reveals dependencies and potential failure points.
Touchpoints - Identify every point of interaction between the customer and the service. Each touchpoint is an opportunity to build trust or erode it.

Good service design makes the complex feel simple. The customer should not need to understand the internal mechanics to have a good experience.

Service Level Management

Service level management is about setting expectations and then meeting them.

SLA (Service Level Agreement) - A formal agreement with customers that defines the expected level of service. SLAs typically cover availability, response times, and resolution times.
SLO (Service Level Objective) - An internal target that is typically more ambitious than the SLA. The buffer between the SLO and SLA provides a margin of safety.
SLI (Service Level Indicator) - The actual measurement used to evaluate whether the SLO is being met. For example, the percentage of requests served within 200ms.

The relationship between these three is straightforward: SLIs measure performance, SLOs set internal targets, and SLAs set external commitments. This framework ties closely to SRE practices.

An important principle: do not set SLAs that you cannot measure. If you cannot track an SLI reliably, you cannot know whether you are meeting your commitments.

Incident Management

Incidents are unplanned disruptions to a service. Incident management is the process of restoring normal service as quickly as possible while minimizing impact.

Key practices include:

Detection - Monitoring and alerting systems should detect incidents before customers report them. If customers are your primary detection mechanism, your monitoring needs work.
Triage - Quickly assess the severity and impact of an incident. Not all incidents require the same level of response.
Communication - Keep stakeholders and affected customers informed. Silence during an incident is worse than bad news.
Resolution - Focus on restoring service first, then investigate root cause. The temptation to diagnose during an active incident often delays recovery.
Post-incident review - After the incident is resolved, conduct a blameless review to understand what happened and how to prevent recurrence. The goal is learning, not blame.

The measure of a good incident management process is not whether incidents happen, but how quickly and effectively the team responds when they do.

Problem Management

Where incident management focuses on restoring service, problem management focuses on identifying and eliminating root causes to prevent incidents from recurring.

Reactive problem management - Investigating recurring incidents to find common root causes.
Proactive problem management - Analyzing trends, near-misses, and known weaknesses to address problems before they cause incidents.

The distinction matters. Incident management treats the symptoms. Problem management treats the disease.

Change Management

Changes to production services are one of the most common causes of incidents. Change management is the process of coordinating changes to minimize risk and disruption.

Change assessment - Evaluate the risk and impact of each change before implementing it.
Change scheduling - Coordinate the timing of changes to avoid conflicts and reduce risk. Avoid deploying changes on Fridays unless you want to work on weekends.
Rollback planning - Have a plan for reverting a change if something goes wrong. If you cannot roll back, the change is higher risk than you think.
Automation - Automated deployments are more reliable and repeatable than manual ones. This connects directly to DevOps practices and continuous delivery.

The goal of change management is not to slow things down. It is to ensure that changes are made safely. Organizations that practice continuous delivery effectively have made change management fast and safe.

Continual Improvement

No service is ever perfect. Continual improvement is the practice of regularly reviewing service performance and making incremental changes over time.

Service reviews - Regularly review service metrics, customer feedback, and incident trends. Look for patterns, not just individual data points.
Retrospectives - Bring the team together to reflect on what is working and what is not. This practice is borrowed from agile and applies equally well to service management.
Small changes - Prefer small, incremental improvements over large, infrequent overhauls. Small changes are easier to implement, easier to measure, and easier to roll back if they do not work.

Good service management is often invisible. When it works well, customers simply experience a reliable, well-supported service. When it fails, the impact is felt immediately.

References

Kim, Gene, et al. The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win. IT Revolution Press, 2013.
Limoncelli, Thomas, et al. The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems. Addison-Wesley, 2014.