An Overview of the Problem Management Practice in ITIL 4

Erika Flora
Written by Erika Flora

Why Problem Management is Important

Great Problem Management can help pull ourselves away from daily firefighting and focus our time — and that of our customers — on more valuable work. Unfortunately, most of us don’t do Problem Management well since we don’t give it the focused attention it deserves. This article will cover the highlights of the ITIL 4 Problem Management practice guide, define some key terminology, review some of the steps we can put in place to ensure this practice works well, and talk through some of the new ideas that have been introduced in ITIL 4.

The Purpose of the Problem Management Practice in ITIL 4

Let’s start with the purpose of Problem Management:

Purpose: To reduce the likelihood and impact of incidents by identifying actual and potential causes of incidents, and managing workarounds and known errors.

The ITIL 4 Practice Guide does a nice job of explaining the need for Problem Management:

“No product or service is perfect. Every product or service has errors or flaws that can cause incidents. Errors may originate in any of the four dimensions of service management. For example, a mistake in a third-party contract is as likely to cause an incident as a component failure. Many errors are identified before a product or service goes live and are then resolved during design, development, or testing. However, some errors will remain undiscovered and will proceed to the live environment, and these may cause incidents. To manage errors that have arisen in the live environment, organizations have developed the Problem Management practice. The practice aims to identify and analyse errors in the organization’s products and minimize their negative impacts on the products or services being provided.”

The errors that cause incidents are called problems:

Problem: A cause, or potential cause, of one or more incidents.

The beauty of good Problem Management is that we have a way to identify, understand, and eliminate these errors in our environment quickly and reliably. That way, they don’t continue to return and plague us and our customers, causing outages and other issues.

What is the Difference Between Problem Management and Incident Management?

In short, the Incident Management practice responds to and resolves incidents (they are focused on addressing the “symptoms”); and they work to get customers up and running again as quickly as possible. The Problem Management practice, however, takes a deeper dive and really tries to understand the underlying error.

Three Phases of the Problem Management Practice in ITIL 4

A new Problem Management concept introduced in ITIL 4 is that of the three general phases of activities we go through in this practice, namely:

 

These phases give a simple way to look at the activities that happen as part of understanding and eliminating problems from our environment, which we will discuss in more detail here.

Phase 1 of the Problem Management Practice in ITIL 4: Problem Identification

In this first phase, we are performing reactive activities like addressing recurring incidents to get at the underlying cause and try to get it resolved. Other reactive Problem Management activities may include:

  • Logging, categorizing, prioritizing, and assigning problem records and linking them to related incident records (as parent/child tickets)
  • Understanding incident symptoms and using that information to determine causality
  • Using correlation tools and people’s knowledge to identify cause

Most organizations tend to do a decent job here. In the Problem Identification phase, however, we should also be looking for ways to do proactive Problem Management, which prevents incidents from happening in the first place. Here, we tend not to do quite as well. Proactive Problem Management activities may include:

  • Working with development teams to understand errors, and making adjustments (implementing workarounds, etc.) to lessen the chances and/or the impact of errors or bugs on customers
  • Working with vendors in the same manner as above
  • Performing trend analysis on incident reports over a period of time.
  • Monitoring infrastructure to identify trends, yet-to-be-experienced or reported issues, etc.
  • Identifying ways to prevent future, recurring incidents

The more we can stay on top of problems in the Problem Identification phase and find a balance of reactive and proactive Problem Management activities, the better off we will be.

In addition to logging problems in the Problem Identification phase, we will also categorize (or at least determine what category we think a problem falls into) and prioritize them. These steps will help us do better Problem Control (described next) as we want to make sure we’re working on what’s most important to the organization and not wasting time analyzing and working on something that’s not (for example, something that’s impacting one person). There will always be more problems than we have time to handle, so we want to make sure to prioritize our work. Organizing our open problems into a prioritized queue or “product backlog” can help make this step easier.

Phase 2 of the Problem Management Practice in ITIL 4: Problem Control

Once we complete the steps in Problem Identification, we want to look at them a bit more closely. The ITIL Practice Guide states:

“Problem Control focuses on the analysis of problems. In reactive Problem Management, problem analysis uses information about the product architecture and configuration to identify configuration items that are likely to cause the relevant incidents. The analysis is not limited to CIs and includes other factors, such as user behavior, human error, and procedure errors.”

Errors can come from lots of different places, and good Problem Management teams often use a variety of tools and techniques to help identify impact and perform root cause analysis (more on that later). At times, there may not be a single cause — or a clean solution — to a very complex problem.

When a problem has gone through this analysis step, it is now termed a known error.

Known error: A problem that has been analyzed but has not been resolved.

Assigning a “known error” status to a problem record tells us that someone has looked at the problem, tried to understand what’s happened and the impact (to our other products or services, organization, and customers), hopefully identified the cause, isolated the error (and/or come up with a solution), and have put a workaround in place in the meantime. A workaround (defined below) is essentially a Band-Aid we put in place until we can put in the permanent fix. Not all of these activities always happen, but we want to do as many of them as possible and/or feasible.

Workaround: A solution that reduces or eliminates the impact of an incident or problem for which a full resolution is not yet available. Some workarounds reduce the likelihood of incidents.

Phase 3 of the Problem Management Practice in ITIL 4: Error Control

The third phase within Problem Management includes activities like submitting a change request and scheduling a change that will implement the permanent fix, notifying customers that the problem has been resolved, and following up with customers to ensure that the problem has, in fact, been fixed from their end.

Other Error Control activities include:

  • Reviewing, reprioritizing, refining, and updating problems, workarounds, and known errors
  • Finding solutions, submitting change requests, and writing up change justifications
  • Implementing solutions and closing out problem records and associated incident records
  • Writing up lessons learned (after action reports, etc.), capturing knowledge about known errors and workarounds, and making improvements
  • Capturing and reporting details on the return on investment from Problem Management activities

Here are some quick, practical recommendations for managing problems:

  • Come up with a few, simple characteristics to quickly assign priority (importance of the product or service, number of people impacted, and/or amount of time needed to resolve the problem to assess things like risk, cost, time, impact, and urgency)

  • Combine problems into a single, prioritized backlog with other work to be done. This will help maintain visibility within and across teams on the work not yet done.

  • Help your teams learn some of the tools and techniques for analyzing problems and identifying root cause

  • Avoid having anybody on the Service Desk “own” the Problem Management practice.

  • Whenever possible, be sure to communicate the return on investment (time/cost savings, etc.) of doing Problem Management to show the value that the practice is bringing to the organization

With that said, there may be times when it’s not feasible to resolve a particular problem. For example, we might be waiting on the next patch or release of software from a vendor, or the cost of resolution is much higher than the benefit of getting rid of it. In these instances, we may periodically review problem records to see whether things in the environment have changed and whether the problem can now be resolved. In other cases, temporary workarounds may become the permanent solution. Per the ITIL Practice Guide, “Known errors are a part of an organization’s technical debt [defined below] and should be removed where reasonably practicable.”

Technical Debt: The total rework backlog accumulated by choosing workarounds instead of system solutions that would take longer.

Using Problem, Monitoring and Event, and Incident Management Practices Together

Problem Management, when done in conjunction with Monitoring and Event Management (which can alert you to significant changes in the environment that have the potential to turn into problems) and Incident Management, will make all three practices stronger and we will deal with fewer overall “fires” on a daily basis.

What are Problem Models in ITIL 4?

One of the concepts that was introduced in ITIL v3 and continues into ITIL 4 is that of problem models, which can help us address certain types of problems more quickly:

Problem model: A repeatable approach to the management of a particular type of problem.

The idea here is that the better we can define a repeatable approach to tackling different types of problems (hardware or software-related errors, etc.), the quicker we can analyze, understand, and hopefully resolve said problems. Problem models may include specific questions to ask, helpful tools or techniques to use, and people or teams to pull into the conversation.

How Problem Management in ITIL 4 Differs from ITIL v3

The overall purpose and definitions in the ITIL 4 Problem Management practice are similar to what was covered in ITIL v3. Some of the new ideas introduced as part of the ITIL 4 release have clarified and refined of some of the key terms like problem and workaround, and ITIL 4 introduced the three phases concept of Problem Management. In the ITIL material, a lot more detail has been added around what a problem manager should do as part of their role, particularly around proactive Problem Management activities and key information that should be included in a problem record.

Root Cause Analysis, 5 Whys, Fault Tree Analysis, Business Impact Analysis, and more

In the Problem Control section of the Practice Guide, it mentions some of the common root cause analysis tools and techniques like the 5 Whys, Kepner Tregoe, and Fault Tree Analysis along with tools and techniques that analyze impact like Business Impact Analyses (BIAs) and Component Failure Impact Analyses (CFIAs). Unfortunately, the guide does not go into any additional detail beyond the initial mention. For those that are heavily involved in doing Problem Management work, I recommend doing a deeper dive into each of these topics. They are extremely helpful skills for teams to develop and help bolster their critical thinking skills.

Practice Success Factors or PSFs for Problem Management in ITIL 4

All of the ITIL 4 practices include ideas around Practice Success Factor or PSFs (what was referred to in ITIL v3 as Critical Success Factors or CSFs). The PSFs for Problem Management include:

  • identifying and understanding the problems and their impact on services
  • optimizing problem resolution and mitigation

The better we can get in these two areas, the stronger our Problem Management practice will be and the more benefit we will bring to our organization as a result.

Who Should Do Problem Management?

Problem Management is not usually someone’s full time job, at least not until you get into very large organizations. The role of a Problem Manager generally falls to a person or a team of people that contribute to the practice as just one of the many “hats” they wear. People that own or are involved in the Problem Management practice should have good technical expertise, be able to pull the right people into the room to discuss problems, can effectively manage and prioritize the overall backlog of problems, and help drive problems to resolution. The specific titles and types of people involved, however, often varies across organizations.

The ITIL 4 Practice Guide provides some additional thoughts on the role:

“Where a dedicated problem manager role is defined, it is usually assigned to specialists combining good knowledge of the organization’s products (architecture, configurations, and interdependencies) with solid analytical skills (the ability and authority to coordinate teamwork and provide good risk management). This role is usually responsible for managing and coordinating the specialist activities in the Problem Management processes, including:

  • conducting and coordinating problem registration based on the submitted information
  • initial categorization of the problems
  • coordinating problem investigation and solution implementation control
  • coordinating communication with the teams responsible for incident resolution and change implementation
  • developing and communicating problem models, where applicable
  • coordinating known error monitoring and review
  • formal problem closure

Many organizations find it useful to form temporary teams to investigate high-impact problems and/or to develop solutions.”

One thing worth noting is that those working on the Service Desk are in a good position to help identify problems, but they should not own the practice. The reason for this is twofold. They may not have the deep specialist expertise to resolve problems. Most importantly, however, we don’t want to take their focus away from quickly and effectively managing incidents and service requests at the Service Desk.

Where to learn more about Problem Management

AXELOS’s Practice Guide has additional details on the Problem Management practice and is available for free as part of a MyITIL subscription for those that have taken and passed a 2-day ITIL 4 Foundation course and exam. You can also find guides for the Service Desk, Service Request Management, Incident Management, as well as the Monitoring Event Management practice as part of the MyITIL subscription. If you don’t currently hold the ITIL 4 Foundation certificate, a 1-year subscription to MyITIL costs $50.

If you do hold the ITIL 4 Foundation credential, you have the necessary pre-requisites to take any of the advanced ITIL 4 courses that discuss Problem Management concepts in greater detail, specifically:

We also offer a customized 2-day Problem Management and Root Cause Analysis Workshop that provides a deeper diver into specific tools and techniques mentioned in this article and helps develop a team’s critical thinking and analytical skills in a fun, collaborative way.

Ready to Learn the Fundamentals of ITIL 4 Problem Management?

Join us for a next-gen virtual training course.
View Upcoming ITIL 4 CDS Classes

Originally published June 06 2021, updated April 04 2022
ITIL/ITSM  
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]