Part 04
Alert Correlation
Speaker 1: During a major outage, alerts can pile up faster than anyone can read them.
The dashboard becomes a blur of red.
Speaker 2: Right, and you end up jumping between windows hoping one of them tells
you what's really wrong.
Speaker 1: Instead of playing whack-a-mole, we group related alerts so they read like a
story.
Speaker 2: That story forms the timeline—who did what, when it happened, and which
alerts were just copycats.
Speaker 1: Once you can see the sequence, you stop chasing ghosts and start fixing the
real issue.
Speaker 1: Alert correlation groups related notifications so you aren't chasing separate
fires that stem from the same spark.
Speaker 2: Think about that time the database slowed down and suddenly we had ten
different services throwing errors.
Speaker 1: By viewing them together, we realized it was all one issue and ignored the
noise.
Speaker 2: It also cuts down on false positives. If multiple sensors complain but share
the same timestamp, it's probably one real problem, not ten.
Speaker 1: Correlation keeps us focused on fixing what's broken instead of firefighting
every alert in sight.
Speaker 1: Once the storm settles, gather the alerts, logs, and chat messages to create
a single timeline of the incident.
Speaker 2: Aligning timestamps reveals what triggered what—did the database crash
first, or was it a network blip that snowballed?
Speaker 1: Normalizing time zones can be tricky when teams are spread around the
globe, so double-check the clocks on your servers.
Speaker 2: Any gaps in the timeline show where monitoring was missing or people were
slow to respond, which gives us clear improvement targets.
Speaker 1: Correlation engines in a SIEM can automatically link alerts by source, host,
or time window, saving hours of manual sorting.
Speaker 2: Tools like Splunk or QRadar let you write rules that spot cascading failures
or repeated login errors across servers.
Speaker 1: Don't forget the human side. Chat exports from Slack or Teams show who
ran commands and when.
Speaker 2: ServiceNow tickets, GitHub issues, and even quick screenshots all feed the
timeline so post-mortems have solid evidence to reference later.
Speaker 1: When you line up the alerts with actions and outcomes, a clear story
emerges about what actually happened.
Speaker 2: That story shows where your monitoring shines and where it falls short,
setting the stage for better prevention next time.
Speaker 1: Correlation and timelines aren't busywork—they're your map for continuous
improvement and faster resolutions.
Speaker 2: Keep an eye on metrics like mean time to resolution or how often you
mislabel alerts. If those numbers drop, you know your correlation efforts are paying off.
Communicating Outcomes
Speaker 1: Once the dust finally settles—hopefully not literally falling from the ceiling
onto your keyboard—it's tempting to simply move on.
Speaker 2: We all want to forget the chaos, but if we don't discuss what happened,
those same mistakes sneak back up on us.
Speaker 1: This segment walks through why sharing the post-mortem results matters,
how to keep different teams in the loop, and a few ways to make updates stick.
Speaker 2: Think of it as sweeping up the debris, labeling the bags, and setting them
out so everyone can recycle the lessons. Plus, a little openness keeps future dust from
piling up again.
Speaker 1: The first message after a major incident should be short and clear. Think
"Our database server went down at 1 pm, we restored service by 2 pm, here's what
happened."
Speaker 2: Attach or link to the post-mortem document so anyone who needs the gritty
details can read them later.
Speaker 1: Send a similar summary to leadership, but highlight the business impact,
such as how many users were affected, and outline next steps.
Speaker 2: For customers or non-technical audiences, focus on reassurance—we're
monitoring closely and will share updates each week until all fixes are deployed.
Speaker 1: Sharing in different channels—email, Slack, or a status page—keeps
everyone aligned and sets expectations for follow-ups.
Speaker 1: Once a post-mortem wraps, every action item needs a single owner and a
real deadline.
Speaker 2: Stick it in ServiceNow or a GitHub issue—somewhere visible—so it can't
quietly tumble down a crack in the floor.
Speaker 1: A good ticket includes the action description, an assignee and the target
date, like [OUT-123] Update database config – Owner: Lee, Due: 2025-08-01.
Speaker 2: Bring these items to daily stand-ups or weekly meetings. If progress stalls
for a week, escalate early.
Speaker 1: Cross each item off as it's completed, then pat yourself on the back before
the next one tries to slip away. It's amazing how slippery those tasks can be.
Speaker 1: Communication doesn't end once the meeting wraps up.
Speaker 2: Keep posting updates in the same ticket or chat thread so there's a single
place to see progress.
Speaker 1: When a fix is deployed, share a quick note such as: "Patch deployed to
production at 22:00 UTC. Monitoring looks good."
Speaker 2: At the next post-mortem or weekly review, run through any unfinished items
and highlight improvements backed by metrics.
Speaker 1: If tasks stall, escalate or reassign them so they don't linger forever. Closing
the loop shows you're serious about follow-through and prevents lingering action items
from becoming new incidents. A quick thank-you also goes a long way in keeping
momentum high.
Speaker 1: When everyone knows the outcome and who's responsible for each fix,
those improvements actually stick.
Speaker 2: Documenting the steps in a shared place and checking back on them shows
professionalism and builds confidence in the process.
Speaker 1: It also prevents the dreaded "So whatever happened with that outage?"
question from upper management.
Speaker 2: Remind owners about upcoming deadlines and escalate only when
absolutely necessary—a friendly nudge usually does the trick.
Speaker 1: Open communication turns an unpleasant incident into an opportunity to
improve and keeps you from repeating history.
Speaker 2: Soon you'll have a track record of completing action items, which is the best
proof that the post-mortem process works. So keep sharing updates, celebrate
completed tasks and watch the trust in your team grow.
Kaizen Corrective Actions
Speaker 1: Ever notice how IT teams love saying "continuous improvement" at
meetings? It's usually right after something breaks for the third time this month.
Speaker 2: Exactly. That phrase isn't just buzz—it's rooted in Kaizen, the practice of
making tiny fixes every day so problems never build up.
Speaker 1: We'll explore what Kaizen looks like in real life, from tweaking scripts to
updating documentation before issues escalate.
Speaker 2: Then we'll contrast it with corrective actions, which kick in when an outage
or audit exposes a larger flaw.
Speaker 1: By the end you'll see how Kaizen habits reduce emergencies and how
corrective actions reinforce those habits when bigger issues surface.
Speaker 1: Kaizen is the opposite of a grand overhaul. It shows up in small acts, like
adding a common support question to the FAQ instead of answering it five times a day.
Speaker 2: Because these tweaks are tiny, they carry little risk and rarely need lengthy
approval.
Speaker 1: Sarah used to spend ten minutes each morning checking backups. She
wrote a two-line script to email the status, saving an hour a week for the team.
Speaker 2: Kaizen can also mean tidying up scripts after deployments—finally deleting
those "//TODO: fix this horrible hack" comments from 2019.
Speaker 1: Managers gather suggestions during stand-ups and track them on an
improvement list so progress is visible.
Speaker 2: Mature teams see five to fifteen Kaizen wins per person each month, freeing
up time for bigger challenges.
Speaker 1: Corrective actions kick in when something serious happens, like the email
server crashing at 3am or an auditor flagging a missing approval.
Speaker 2: First we investigate to uncover the root cause, then plan the fix, assign an
owner and set a deadline.
Speaker 1: Our last email outage came from an expired certificate. The corrective
action added monitoring and a thirty‑day renewal reminder.
Speaker 2: After the fix goes in, we verify it worked and record the evidence in tools
such as ServiceNow.
Speaker 1: Because these steps are formal, they usually require management sign-off
and extra documentation so nothing slips through the cracks.
Speaker 2: They take more time than Kaizen, but they keep serious problems from
repeating. A common pitfall is treating symptoms instead of root causes or skipping
verification.
Speaker 1: So Kaizen is like eating your vegetables, and corrective actions are like
taking medicine when you're sick?
Speaker 2: Exactly. Kaizen keeps the system healthy day to day, while corrective
actions cure the nasty surprises.
Speaker 1: Teams track Kaizen ideas on an improvement list and review them at
weekly stand‑ups. Corrective actions get their own tickets with deadlines and
verification steps.
Speaker 2: One team posts its "Kaizen wins" on a dashboard—they average a dozen
small improvements each month and cut incident tickets by forty percent over six
months.
Speaker 1: The key is making sure the two approaches complement each other. If a
corrective action reveals a process gap, spin off related Kaizen tasks.
Speaker 2: That workflow lines up with ITIL and DevOps practices: continuous
improvement feeds the pipeline and corrective actions keep it honest.
Speaker 1: We've seen how Kaizen builds momentum through tiny, everyday tweaks,
while corrective actions handle the emergencies that still sneak through.
Speaker 2: The real trick is measuring both: track how many Kaizen ideas are
implemented and check whether each corrective action actually prevents a repeat
incident.
Speaker 1: Mature teams log at least a handful of Kaizen items per person each month
and review them alongside open corrective actions to spot patterns.
Speaker 2: When teams invest a little time each week in improvement, they learn new
skills and spend less effort explaining why the same thing broke again.
Speaker 1: Blending these approaches creates a culture that prizes prevention and
quick recovery—a combination that keeps services reliable and people motivated.
Speaker 2: Stick with it, and that balance turns endless firefighting into predictable
improvement.
Log Analysis Git Blame
Speaker 1: Logs are the storybook of every application. Each entry marks what
happened and at what severity level.
Speaker 2: Right, without them we'd be guessing whether a failure came from a bad
deployment or a hardware hiccup.
Speaker 1: Remember the checkout bug we chased for days? The logs finally showed
payment timeouts minutes after a database lock warning.
Speaker 2: Once we lined those timestamps up, the cause was obvious, and we avoided
hours of finger-pointing.
Speaker 1: That's why we dig through logs even when it's tedious. They turn hunches
into hard evidence and help us spot patterns early.
Speaker 2: We've all stared at a wall of red ERROR messages at 2 AM wondering where
to even begin.
Speaker 1: When the app is small, a quick grep for "ERROR" usually does the trick.
Speaker 2: But how do you even know where to start looking in a 10GB log file?
Speaker 2: Once you have several services, tools like Elastic or Splunk become
essential.
Speaker 1: They let you search structured fields and follow one request across many
logs.
Speaker 2: We tag every entry with a request ID so we can trace a user's journey end to
end.
Speaker 1: Dashboards help spot patterns too. A jump in WARN logs might reveal
memory pressure before anything crashes.
Speaker 2: Whatever tool you use, keep enough history so you can go back and learn
from incidents.
Speaker 1: `git blame` shows who last touched a line, but that alone doesn't prove
responsibility.
Speaker 2: The author may have been fixing someone else's bug or working with
incomplete specs.
Speaker 1: When we spot a risky change, we message the contributor and ask what
problem they were solving.
Speaker 1: Remember when we blamed the database for three hours before realizing it
was a typo in the config?
Speaker 2: Usually we uncover useful context, like a last-minute hotfix that forced a
quick decision.
Speaker 1: We also use options like `-w` to ignore whitespace or `-C` to track code
moved between files.
Speaker 2: Used kindly, blame provides insight without turning conversations into witch
hunts.
Speaker 1: Here's a typical investigation. We start by scanning the logs for errors like
"database connection timeout".
Speaker 1: It can feel stressful when production is down, so having a checklist keeps
everyone calm.
Speaker 2: If network metrics look normal, we examine recent commits that touched
the connection pool.
Speaker 1: Running git blame on that section shows who adjusted the pool size.
Speaker 2: Instead of blaming them, we ask what issue they were trying to solve and if
it's still relevant.
Speaker 1: Together we test new settings, update the documentation, and note
everything in the ticket.
Speaker 2: Saving those logs and discussions means the next team understands why
we made each change.
Speaker 1: Logs show what happened, and blame hints at who changed the code and
why.
Speaker 2: Used together, they help us resolve issues quickly without turning the
post‑mortem into a witch hunt.
Speaker 1: We also respect privacy, follow retention rules and document lessons
learned.
Speaker 2: That builds trust. People feel safe admitting mistakes, so the whole team
improves.
Speaker 1: The goal isn't to catch someone out; it's to make the system stronger after
each incident.
Speaker 2: Treat logs and blame as tools for insight, not weapons, and they'll guide
your career as much as your code.
Managing Emotions
Speaker 1: Even the calmest engineers can get defensive after a sleepless night
responding to an outage.
Speaker 2: We've all been there—you're exhausted, adrenaline is fading and suddenly
every question feels like an accusation.
Speaker 1: A solid plan for managing emotions keeps the conversation focused on
learning, not finger‑pointing, no matter how stressed the team feels.
Speaker 2: We'll also look at how cultural expectations shape those reactions so you
can lead inclusive post‑mortems that help your career as much as the codebase.
Speaker 1: Picture this—it's Black Friday and the payment gateway crashes right before
thousands of customers hit "buy".
Speaker 2: The pressure is sky‑high and everyone's worried about being singled out.
That fear can lead people to keep quiet about missing monitors or shortcuts taken
during the rush.
Speaker 1: By calling out those emotions early—"I know we're all tense"—you
encourage honesty and stop the blame game before it starts.
Speaker 2: The more open the discussion, the faster you dig up the real causes and
move toward solutions.
Speaker 1: Remember, we're diffusing tension, not defusing bombs—though sometimes
it feels similar!
Speaker 2: If frustration flares, try repeating back what you heard: "So you're worried
the rollback script failed?" That shows you get it without blaming anyone.
Speaker 1: Suggest a quick stretch break when voices rise. People come back calmer
and ready to listen.
Speaker 2: Encourage phrases like "I felt rushed" or "I was confused" instead of "You
messed up". Those small tweaks keep the discussion productive.
Speaker 1: I once worked with a Japanese developer who barely spoke during
post‑mortems, even when he had the missing puzzle piece.
Speaker 2: That's common in cultures where disagreement can feel disrespectful. We
started doing short one‑on‑one chats afterward and paired him with a mentor who
modelled feedback.
Speaker 1: After a few weeks he was comfortable explaining issues in the group. His
insights saved us from repeating mistakes.
Speaker 2: The key is setting ground rules that welcome respectful critique and
adapting your style so everyone feels safe speaking up.
Speaker 1: When voices get loud or the chat blows up, it's tempting to play referee.
Speaker 2: A quick reset works better. Try the NAME framework—Notice what's
happening, Acknowledge the emotion, Move forward to the facts, and Engage everyone
in solutions.
Speaker 1: In remote meetings it can be as simple as "I can see this is frustrating. Let's
take a minute, then focus on what we control." Sometimes a short break is all it takes
to cool heads.
Speaker 1: A good facilitator keeps the group focused on improvement rather than
blame.
Speaker 2: They might start by sharing a quick emotional check‑in—"Green, yellow, or
red?"—so everyone gauges the mood.
Speaker 1: Jot down emotional cues next to the facts. If someone looks uneasy, offer a
one‑on‑one follow‑up.
Speaker 2: Ground rules like "assume good intent" and "speak from your own
experience" help new team members, especially across cultures.
Speaker 1: These skills translate directly to leadership roles where guiding difficult
conversations is part of the job description.
Speaker 1: Handling emotions and cultural differences takes practice, not just theory.
Speaker 2: When you approach post‑mortems with empathy and clear expectations, the
focus stays on learning rather than blame.
Speaker 1: Those habits pay off in your career too—leaders who navigate tough
conversations calmly are trusted with bigger challenges.
Speaker 2: Keep refining these skills and you'll turn every incident into an opportunity
for growth, both for the system and for yourself.
Metrics To Monitor
Speaker 1: Measuring outcomes is how we prove our process works. Without numbers
it's just opinion.
Speaker 2: Exactly. Leadership wants to see trends like recovery times getting shorter,
not just hear that we "did better".
Speaker 1: We'll focus on MTTR, recurrence rates and whether action items actually get
done.
Speaker 2: Tracking these may sound tedious, but it quickly shows if fixes stick or if the
same outages come back.
Speaker 1: MTTR stands for Mean Time to Recovery. It's the average duration from
detection to full service restoration.
Speaker 2: Recurrence rate tracks how often the same type of incident reappears within
a set period.
Speaker 1: Action item completion ratio shows what percentage of agreed fixes were
actually carried out.
Speaker 2: We also watch the age of open tasks so nothing lingers forever.
Speaker 1: ServiceNow and Jira both let you export incident data straight into reports.
Speaker 2: For smaller teams, a shared spreadsheet works fine, as long as someone
updates it each week.
Speaker 1: Grafana or Kibana are great for graphing MTTR trends alongside deployment
metrics.
Speaker 2: Whatever tool you pick, make sure it can export to CSV for audits and
compliance.
Speaker 1: Numbers only help if you act on them. Review trends monthly and ask why
any spike occurred.
Speaker 2: If the same issue recurs, escalate and revisit your fixes—maybe a root cause
was missed.
Speaker 1: Celebrate when MTTR drops or when all action items close on time. Share
those wins widely.
Speaker 2: And when tasks slip, discuss roadblocks in stand-ups and adjust priorities
rather than ignoring them.
Speaker 1: Consistently measured metrics turn one-off fixes into lasting improvements.
Speaker 2: When leadership sees recovery times shrinking and fewer repeat incidents,
support for your process grows.
Speaker 1: Keep tracking completion rates so action items don't fade away once the
spotlight moves on.
Speaker 2: The data speaks for itself; use it to drive decisions and keep refining your
operations.
Post Mortem Agenda
Speaker 1: Post-mortems can drift into rambling war stories if no one sets a clear
agenda. A simple structure keeps the conversation productive and short.
Speaker 2: In this segment we'll outline a repeatable agenda that teams of any size can
follow. You'll learn who should be in the room and what each person contributes.
Speaker 1: We'll also cover how thorough documentation helps future investigators
understand what went wrong and why. By the end you'll have a framework you can
adapt to your own organisation.
Speaker 2: You might think it's overkill to formalise a meeting about one incident, but
the agenda keeps everyone focused on facts instead of opinions. Without it, people
tend to jump around the timeline or get stuck debating blame.
Speaker 1: Begin every post-mortem by walking through a concise timeline of events.
Note when alerts fired, when the first responder acknowledged them and when service
was restored. This sets a factual foundation and keeps speculation in check.
Speaker 2: After the timeline, cover the business impact—how many users were
affected and what the cost might be. Then move on to root cause analysis using a
structured method like five whys or a fishbone diagram.
Speaker 1: Capture action items as they're discussed and assign owners on the spot.
End the meeting by confirming due dates and when follow-up reviews will happen. A
typical agenda can wrap up in under 30 minutes when everyone sticks to these steps.
Speaker 1: A successful post-mortem needs the right mix of perspectives. Incident
responders bring the technical details and know exactly what they tried in the heat of
the moment.
Speaker 2: Service owners speak to business impact and can decide which fixes are
worth prioritising. A facilitator keeps the conversation moving and makes sure quieter
voices are heard.
Speaker 1: You'll also want a scribe to capture notes and action items in a shared
document or ticketing system. Finally, invite at least one stakeholder from the business
side so the discussion stays grounded in user impact rather than just technical details.
Speaker 1: Documentation is often the most overlooked part of a post-mortem. Use a
central template so every incident report looks the same. Include the timeline, root
cause details, impact assessment and action items.
Speaker 2: Link your ServiceNow tickets or JIRA issues, plus any GitHub commits or pull
requests that contain the fixes. This makes it easy for future teams to trace the history
if something resurfaces.
Speaker 1: Add metrics like MTTR and number of affected users. Summarise lessons
learned in plain language so they can be reused in training or onboarding. Finally, store
the report in a shared location and announce it to the team so knowledge spreads
beyond those who attended the meeting.
Speaker 1: A consistent agenda turns post-mortems into a learning tool instead of a
finger-pointing session. It ensures every incident gets the same level of scrutiny.
Speaker 2: Documenting who attended and what was decided makes follow-up easier
and helps new team members learn from past mistakes. Keep the reports concise and
accessible so they actually get read.
Speaker 1: With clear roles, a repeatable agenda and solid records, post-mortems
become a catalyst for improvement rather than a dreaded meeting.
Speaker 2: Set a calendar reminder to revisit unresolved action items every quarter.
Nothing kills trust faster than open tasks left hanging indefinitely.
Post Mortem Culture
Speaker 1: Remember when our payment system crashed last month? The
post-mortem felt like a witch hunt.
Speaker 2: Right! That's exactly what we want to avoid. Today we'll learn how to turn
failures into learning opportunities instead of finger-pointing sessions. We'll practice
staying curious about how our process let the issue happen so we can fix it together.
Speaker 1: Psychological safety means no one gets punished for admitting an honest
mistake.
Speaker 2: Exactly! When teams feel safe, they surface the real issues quickly.
Remember how Sarah skipped the deployment checklist? Instead of firing her, we
improved the process so it's impossible to skip.
Speaker 1: Think of it like a confession booth for code—people need to feel safe
admitting their digital sins.
Speaker 1: When something breaks, the first instinct is often "Who messed up?".
Speaker 2: But blaming shuts the conversation down. Instead of asking "Why did you
delete the database?" try "What led you to run that command?".
Speaker 1: Great example! We focus on how the process allowed the mistake, not who
pushed the button. The goal is debugging the system, not making someone cry.
Speaker 1: Invite everyone who was involved—engineers, managers, and even
customer support.
Speaker 2: Right, because each role sees a different part of the picture. Junior devs
might notice missing tests, while support teams capture real user impact.
Speaker 1: And drawing out quiet participants makes sure the action items reflect
reality, not just the loudest voices.
Speaker 1: Let's start every post-mortem by reviewing the ServiceNow ticket and the
exact timeline of events.
Speaker 2: Then we map each step to the ITIL incident-management flow and dig into
root causes with the five-whys technique.
Speaker 1: Document action items as GitHub issues so we can track them. DORA
metrics like MTTR show if our fixes actually work.
Speaker 1: One big pitfall is rushing to solutions before we really understand the
problem.
Speaker 2: Absolutely. Another is letting one "hero" take all the blame or glory. We
need the whole team learning, not just one person.
Speaker 1: And of course the blame game spiral—once finger-pointing starts, people
shut down and hide information.
Speaker 1: Here's a scenario: the website crashes right after a big marketing blast.
Speaker 2: I'd pull in the on-call engineer, the database admin, and support to map the
timeline. Then we'd ask what in our process allowed the traffic spike to take us down.
Speaker 1: Exactly. Keep the questions neutral so we discover the real gaps instead of
assigning blame.
Speaker 1: When tensions rise, try saying, "Help me understand what led up to this"
instead of "Who did it?"
Speaker 2: Right. Redirect accusations toward the workflow. Ask, "What monitoring
failed us?" or "What review step was missing?"
Speaker 1: Inviting quiet voices with "Anything we missed from your side?" keeps
everyone engaged and prevents defensiveness.
Speaker 2: Over time, using phrases like these shows you're ready for leadership roles
because you focus on improving the system, not blaming people.
Speaker 1: To dig deeper, check out Google's post-mortem template and Amy
Edmondson's book *The Fearless Organization*.
Speaker 2: We also have ServiceNow guides and a DORA metrics cheat sheet linked in
the notes. Use them to strengthen your next post-mortem.
Rca Frameworks
Speaker 1: We've spent time exploring how to hold post-mortems without pointing
fingers. Now we need a toolkit for digging into the technical reasons behind failures.
We'll cover two proven methods: the Five Whys and the fishbone diagram. Each
provides a step-by-step path to go beyond symptoms and uncover the underlying
system weaknesses.
Speaker 2: Using a framework keeps the conversation grounded in evidence rather than
opinions. Everyone participates by examining facts, which builds a shared
understanding of the incident. These methods also help with documentation, since the
process itself guides what to record. We'll see how they fit into the overall post-mortem
workflow and how you can apply them in your own environment.
Speaker 1: Without a framework, teams often jump straight to a fix and miss patterns
hiding in the data. When we slow down and follow a structured approach, every step
requires evidence. This keeps the analysis objective and prevents the conversation
from veering into blame or guesses.
Speaker 2: Frameworks also save time over the long run because they are repeatable.
If each incident is analysed differently, new members struggle to learn. By following a
shared checklist, we build a knowledge base of root causes, which in turn improves
reliability metrics like MTTR. Studies show companies using consistent RCA methods cut
repeat incidents by over fifty percent.
Speaker 1: The Five Whys method begins with the problem statement. Let's say a
website went offline. We ask, "Why was the site down?" The first answer might be "The
database became unreachable." We then ask, "Why was the database unreachable?"
Maybe "Because a deployment script changed the network settings." The next why digs
deeper: "Why did the script change them?" Because there was no peer review before
the deploy. "Why wasn't there a review?" Because our automation pipeline doesn't
enforce it. At the fifth why we discover the real issue: the pipeline needs a mandatory
approval step.
Speaker 2: The trick is not to stop after the second or third why. Each answer should be
backed by evidence so we avoid speculation. It's easy to fall into solution mode too
early, but the point is to expose the hidden weakness in the process. Document each
question and answer chain so others can follow the logic later.
Speaker 1: A fishbone diagram looks like the skeleton of a fish, with the issue at the
head and major cause categories branching off the spine. Common categories include
People, Process, Technology, Environment, Materials, and Methods. For each branch
you list possible contributing factors. In an IT context, the "Technology" branch might
include network configuration, while "Process" could reveal gaps in change
management.
Speaker 2: This technique works well when failures have several intertwined causes.
The visual layout helps teams brainstorm systematically without losing track of ideas.
As you fill in the branches, patterns emerge that highlight where to investigate first.
Draw the diagram on a whiteboard or in collaboration software so everyone can
contribute. It's also a good way to record findings for future reference.
Speaker 1: So how do you decide which tool to use? Start with the Five Whys if the
problem seems to follow a single chain of events. It's fast and requires nothing more
than a whiteboard. If the conversation stalls or new branches appear, switch to a
fishbone diagram to capture the wider context. Sometimes you'll use both: the fishbone
to map categories, then a Five Whys on each branch.
Speaker 2: Keep in mind that no tool fits every situation. Highly complex or political
issues may need a formal investigation beyond these techniques. Always document the
questions asked, the evidence gathered, and the conclusions. That record becomes part
of your post-mortem notes, and it helps the next team understand how you arrived at
the root cause.
Speaker 1: Whether you prefer the simplicity of Five Whys or the visual power of a
fishbone diagram, the goal is the same: uncover the underlying cause so you can fix it
for good. Treat each incident as a chance to strengthen your system, not as a failure to
hide. When done consistently, these frameworks build a culture of learning.
Speaker 2: Make it a habit to share your findings with the whole team and to track the
resulting action items. Over time you'll see patterns in your incident trends and you'll
develop a more robust improvement process. Mastering these analysis techniques is a
valuable skill for any IT professional who wants to lead problem management efforts.
Rca Servicenow Github
Speaker 1: We've talked about how to run a blameless post‑mortem, but what happens
to those findings afterward? We've all seen post‑mortems that become "post‑mortem"
themselves—dead and buried in someone's email within a week.
Speaker 2: Exactly. The best place for that information is a ticketing system like
ServiceNow. A problem record stores the timeline, contributing factors, and any
workarounds so nothing slips through the cracks.
Speaker 1: From there we link follow‑up work in GitHub. Today we'll walk through that
flow so you can turn lessons learned into real improvements.
Speaker 1: When you create a problem record in ServiceNow, start with a short title
that hints at the business impact. Then lay out a clear timeline, the contributing factors,
and any workarounds discovered during the incident.
Speaker 2: Link related incident tickets and the change request that eventually fixes
the issue. A good record might read, "Checkout failure during Black Friday—DB
connection pool exhausted; manual order processing used until 4:30 PM."
Speaker 1: Assign an owner, set a target date, and capture the final resolution.
Managers appreciate the audit trail, and the team can easily revisit the record when a
similar issue crops up.
Speaker 1: After the problem record is in place, open a GitHub issue for each
improvement task. A helpful title might be "Increase DB connection pool size -
PRB0001234," not just "Fix database."
Speaker 2: In the description, reference the ServiceNow ticket and explain the business
impact. That cross-link lets developers see why the work matters without leaving
GitHub. As code changes move through pull requests, mention the issue number so
everything stays connected.
Speaker 1: Once the fix is deployed and verified, close the GitHub issue and update the
ServiceNow record. Now both systems tell the same story.
Speaker 1: Putting it all together starts with a shared document right after the incident.
Capture timelines, logs, and team observations while the details are fresh.
Speaker 2: Within 24 hours, summarise those findings in a ServiceNow problem ticket
so managers have a clear view. Then create GitHub issues for each action item and link
them back to that ticket.
Speaker 1: During improvement meetings, review the problem record and its linked
issues to check progress. For example, the outage occurred at 2:45 PM, was resolved
by 4:30 PM, and the fix was deployed two days later. Keeping that timeline in one place
ensures nothing from the RCA gets forgotten once the incident fades.
Speaker 1: Integrating your RCA notes with ServiceNow and GitHub keeps everyone on
the same page, from engineers fixing code to managers tracking risk.
Speaker 2: It also helps during audits or handovers because every decision and
follow‑up lives in one place with clear links to the code changes.
Speaker 1: When teams consistently link these systems, improvements actually get
implemented instead of disappearing into a folder.
Speaker 2: That habit turns each incident into documented learning rather than another
"we should totally fix that someday" conversation.
Speaker 1: Plus, seeing past action items accomplished builds trust and motivates the
team to keep improving the process.
Tracking Improvement
Speaker 1: If you've ever wondered whether your quick fix actually solved anything or
simply moved the problem somewhere else, this module is for you. We've all attended
post-mortems where action items pile up, yet nobody checks whether those items had
any real impact.
Speaker 2: That's why we track deployment metrics and incident trends. Numbers give
us an unbiased view of progress. We'll look at tools like GitHub Insights, JIRA reports
and monitoring dashboards that make collecting this data easier than it sounds.
Speaker 1: We'll also talk about establishing a baseline before changes and how long it
typically takes to see meaningful trends. By the end you'll know which metrics matter,
common pitfalls to avoid and how these measurements help teams improve week after
week.
Speaker 1: Let's start with the basics—DORA metrics. Track deployment frequency, lead
time for changes, change failure rate and mean time to recovery. These reveal how
smoothly code travels from commit to production.
Speaker 2: To gather them, use GitHub Insights or your CI/CD dashboard for
deployment stats and JIRA or ServiceNow for incident logs. Establish a baseline before
you roll out new processes so you can see the effect over time.
Speaker 1: Good values differ by organisation, but watch for high failure rates or long
recovery times. They often hint at inadequate testing or rushed releases. Connect these
numbers to user experience: slower recovery means customers stuck on error pages
longer.
Speaker 2: Don't forget incident counts and severities. Plot everything on a timeline. If
deployments spike but incident severity climbs with them, it might be time to revisit
your quality gates rather than celebrate extra releases.
Speaker 1: Once you've collected a few sprints of data, compare it to your baseline. Did
your deployment lead time shrink? Are rollbacks less frequent?
Speaker 2: If the numbers improve, highlight them in a quick dashboard demo during
your post-mortems. Showing a trend line dropping from two‑week deployments to
two‑day cycles can convince leadership to keep investing in automation.
Speaker 1: When the metrics move the wrong way, dig into the timeline around each
spike. Maybe a new testing tool slowed the pipeline or a Friday release pattern
correlated with more incidents. Invite the team to suggest fixes rather than assign
blame.
Speaker 2: Present findings to management in plain language: "Our recovery time
increased last month, likely due to rushed hotfixes. We propose adding a staging step."
Real data helps secure approval for those changes and keeps everyone accountable.
Speaker 1: Metrics turn vague promises into measurable progress. They show whether
you're really improving or just churning through tasks.
Speaker 2: Keep your dashboards visible and review them regularly. Patterns often
emerge after a month or two, so be patient. Remember, correlation doesn't imply
causation—but it sure waves its arms to get your attention.
Speaker 1: Watch for gaming. If someone deploys "fix typo" fifty times on Friday just to
boost counts, you're measuring the wrong thing. Balance speed metrics with quality
indicators like change failure rate.
Speaker 2: Use these numbers to justify resources. Showing a 40 % drop in recovery
time helped one team secure budget for an extra SRE. With clear data, you can pivot
quickly when things don't work and celebrate when they do.