Welcome to episode 4 of our series “From Theory to Practice.” Blameless’s Matt Davis and Kurt Andersen were joined by Joanna Mazgaj, director of production support at Tala, and Laura Nolan, principal software engineer at Stanza Systems. They tackled a tricky and often overlooked aspect of incident management: problem detection.
It can be tempting to gloss over problem detection when building an incident management process. The process might start with classifying and triaging the problem and declaring an incident accordingly. The fact that the problem was detected in the first place is treated as a given, something assumed to have already happened before the process starts. Sometimes it is as simple as your monitoring tools or a customer report bringing your attention to an outage or other anomaly. But there will always be problems that won’t be caught with conventional means, and those are often the ones needing the most attention.