website/run-script/part-04/page

Part 04

Alert Correlation

Speaker 1: During a major outage, alerts can pile up faster than anyone can read them.
The dashboard becomes a blur of red.
Speaker 2: Right, and you end up jumping between windows hoping one of them tells
you what's really wrong.
Speaker 1: Instead of playing whack-a-mole, we group related alerts so they read like a
story.
Speaker 2: That story forms the timeline—who did what, when it happened, and which
alerts were just copycats.
Speaker 1: Once you can see the sequence, you stop chasing ghosts and start fixing the
real issue.

Speaker 1: Alert correlation groups related notifications so you aren't chasing separate
fires that stem from the same spark.
Speaker 2: Think about that time the database slowed down and suddenly we had ten
different services throwing errors.
Speaker 1: By viewing them together, we realized it was all one issue and ignored the
noise.
Speaker 2: It also cuts down on false positives. If multiple sensors complain but share
the same timestamp, it's probably one real problem, not ten.
Speaker 1: Correlation keeps us focused on fixing what's broken instead of firefighting
every alert in sight.

Speaker 1: Once the storm settles, gather the alerts, logs, and chat messages to create
a single timeline of the incident.
Speaker 2: Aligning timestamps reveals what triggered what—did the database crash
first, or was it a network blip that snowballed?
Speaker 1: Normalizing time zones can be tricky when teams are spread around the
globe, so double-check the clocks on your servers.
Speaker 2: Any gaps in the timeline show where monitoring was missing or people were
slow to respond, which gives us clear improvement targets.

Speaker 1: Correlation engines in a SIEM can automatically link alerts by source, host,
or time window, saving hours of manual sorting.
Speaker 2: Tools like Splunk or QRadar let you write rules that spot cascading failures
or repeated login errors across servers.
Speaker 1: Don't forget the human side. Chat exports from Slack or Teams show who
ran commands and when.
Speaker 2: ServiceNow tickets, GitHub issues, and even quick screenshots all feed the
timeline so post-mortems have solid evidence to reference later.

Speaker  1:  When  you  line  up  the  alerts  with  actions  and  outcomes,  a  clear  story
emerges about what actually happened.
Speaker  2:  That  story  shows  where  your  monitoring  shines  and  where  it  falls  short,
setting the stage for better prevention next time.
Speaker 1: Correlation and timelines aren't busywork—they're your map for continuous
improvement and faster resolutions.
Speaker  2:  Keep  an  eye  on  metrics  like  mean  time  to  resolution  or  how  often  you
mislabel alerts. If those numbers drop, you know your correlation efforts are paying off.

Communicating Outcomes

Speaker 1: Once the dust finally settles—hopefully not literally falling from the ceiling
onto your keyboard—it's tempting to simply move on.
Speaker 2: We all want to forget the chaos, but if we don't discuss what happened,
those same mistakes sneak back up on us.
Speaker 1: This segment walks through why sharing the post-mortem results matters,
how to keep different teams in the loop, and a few ways to make updates stick.
Speaker 2: Think of it as sweeping up the debris, labeling the bags, and setting them
out so everyone can recycle the lessons. Plus, a little openness keeps future dust from
piling up again.

Speaker  1:  The  first  message  after  a  major  incident  should  be  short  and  clear.  Think
"Our  database  server  went  down  at  1 pm,  we  restored  service  by  2 pm,  here's  what
happened."
Speaker 2: Attach or link to the post-mortem document so anyone who needs the gritty
details can read them later.
Speaker  1:  Send  a  similar  summary  to  leadership,  but  highlight  the  business  impact,
such as how many users were affected, and outline next steps.
Speaker  2:  For  customers  or  non-technical  audiences,  focus  on  reassurance—we're
monitoring closely and will share updates each week until all fixes are deployed.
Speaker  1:  Sharing  in  different  channels—email,  Slack,  or  a  status  page—keeps
everyone aligned and sets expectations for follow-ups.

Speaker 1: Once a post-mortem wraps, every action item needs a single owner and a
real deadline.
Speaker  2:  Stick  it  in  ServiceNow  or  a  GitHub  issue—somewhere  visible—so  it  can't
quietly tumble down a crack in the floor.
Speaker  1:  A  good  ticket  includes  the  action  description,  an  assignee  and  the  target
date, like [OUT-123] Update database config – Owner: Lee, Due: 2025-08-01.
Speaker  2:  Bring  these  items  to  daily  stand-ups  or  weekly  meetings.  If  progress  stalls
for a week, escalate early.
Speaker 1: Cross each item off as it's completed, then pat yourself on the back before
the next one tries to slip away. It's amazing how slippery those tasks can be.

Speaker 1: Communication doesn't end once the meeting wraps up.
Speaker 2: Keep posting updates in the same ticket or chat thread so there's a single
place to see progress.
Speaker 1: When a fix is deployed, share a quick note such as: "Patch deployed to
production at 22:00 UTC. Monitoring looks good."
Speaker 2: At the next post-mortem or weekly review, run through any unfinished items
and highlight improvements backed by metrics.
Speaker 1: If tasks stall, escalate or reassign them so they don't linger forever. Closing
the loop shows you're serious about follow-through and prevents lingering action items
from becoming new incidents. A quick thank-you also goes a long way in keeping
momentum high.

Speaker  1:  When  everyone  knows  the  outcome  and  who's  responsible  for  each  fix,
those improvements actually stick.
Speaker 2: Documenting the steps in a shared place and checking back on them shows
professionalism and builds confidence in the process.
Speaker  1:  It  also  prevents  the  dreaded  "So  whatever  happened  with  that  outage?"
question from upper management.
Speaker  2:  Remind  owners  about  upcoming  deadlines  and  escalate  only  when
absolutely necessary—a friendly nudge usually does the trick.
Speaker  1:  Open  communication  turns  an  unpleasant  incident  into  an  opportunity  to
improve and keeps you from repeating history.
Speaker 2: Soon you'll have a track record of completing action items, which is the best
proof  that  the  post-mortem  process  works.  So  keep  sharing  updates,  celebrate
completed tasks and watch the trust in your team grow.

Kaizen Corrective Actions

Speaker  1:  Ever  notice  how  IT  teams  love  saying  "continuous  improvement"  at
meetings? It's usually right after something breaks for the third time this month.
Speaker  2:  Exactly.  That  phrase  isn't  just  buzz—it's  rooted  in  Kaizen,  the  practice  of
making tiny fixes every day so problems never build up.
Speaker  1:  We'll  explore  what  Kaizen  looks  like  in  real  life,  from  tweaking  scripts  to
updating documentation before issues escalate.
Speaker 2: Then we'll contrast it with corrective actions, which kick in when an outage
or audit exposes a larger flaw.
Speaker  1:  By  the  end  you'll  see  how  Kaizen  habits  reduce  emergencies  and  how
corrective actions reinforce those habits when bigger issues surface.

Speaker  1:  Kaizen  is  the  opposite  of  a  grand  overhaul.  It  shows  up  in  small  acts,  like
adding a common support question to the FAQ instead of answering it five times a day.
Speaker 2: Because these tweaks are tiny, they carry little risk and rarely need lengthy
approval.
Speaker  1:  Sarah  used  to  spend  ten  minutes  each  morning  checking  backups.  She
wrote a two-line script to email the status, saving an hour a week for the team.
Speaker 2: Kaizen can also mean tidying up scripts after deployments—finally deleting
those "//TODO: fix this horrible hack" comments from 2019.
Speaker  1:  Managers  gather  suggestions  during  stand-ups  and  track  them  on  an
improvement list so progress is visible.
Speaker 2: Mature teams see five to fifteen Kaizen wins per person each month, freeing
up time for bigger challenges.

Speaker  1:  Corrective  actions  kick  in  when  something  serious  happens,  like  the  email
server crashing at 3am or an auditor flagging a missing approval.
Speaker 2: First we investigate to uncover the root cause, then plan the fix, assign an
owner and set a deadline.
Speaker  1:  Our  last  email  outage  came  from  an  expired  certificate.  The  corrective
action added monitoring and a thirty‑day renewal reminder.
Speaker  2:  After  the  fix  goes  in,  we  verify  it  worked  and  record  the  evidence  in  tools
such as ServiceNow.
Speaker 1: Because these steps are formal, they usually require management sign-off
and extra documentation so nothing slips through the cracks.
Speaker  2:  They  take  more  time  than  Kaizen,  but  they  keep  serious  problems  from
repeating.  A  common  pitfall  is  treating  symptoms  instead  of  root  causes  or  skipping
verification.

Speaker  1:  So  Kaizen  is  like  eating  your  vegetables,  and  corrective  actions  are  like
taking medicine when you're sick?
Speaker  2:  Exactly.  Kaizen  keeps  the  system  healthy  day  to  day,  while  corrective
actions cure the nasty surprises.
Speaker  1:  Teams  track  Kaizen  ideas  on  an  improvement  list  and  review  them  at
weekly  stand‑ups.  Corrective  actions  get  their  own  tickets  with  deadlines  and
verification steps.
Speaker  2:  One  team  posts  its  "Kaizen  wins"  on  a  dashboard—they  average  a  dozen
small  improvements  each  month  and  cut  incident  tickets  by  forty  percent  over  six
months.
Speaker  1:  The  key  is  making  sure  the  two  approaches  complement  each  other.  If  a
corrective action reveals a process gap, spin off related Kaizen tasks.
Speaker  2:  That  workflow  lines  up  with  ITIL  and  DevOps  practices:  continuous
improvement feeds the pipeline and corrective actions keep it honest.

Speaker  1:  We've  seen  how  Kaizen  builds  momentum  through  tiny,  everyday  tweaks,
while corrective actions handle the emergencies that still sneak through.
Speaker  2:  The  real  trick  is  measuring  both:  track  how  many  Kaizen  ideas  are
implemented  and  check  whether  each  corrective  action  actually  prevents  a  repeat
incident.
Speaker 1: Mature teams log at least a handful of Kaizen items per person each month
and review them alongside open corrective actions to spot patterns.
Speaker 2: When teams invest a little time each week in improvement, they learn new
skills and spend less effort explaining why the same thing broke again.
Speaker  1:  Blending  these  approaches  creates  a  culture  that  prizes  prevention  and
quick recovery—a combination that keeps services reliable and people motivated.
Speaker  2:  Stick  with  it,  and  that  balance  turns  endless  firefighting  into  predictable
improvement.

Log Analysis Git Blame

Speaker  1:  Logs  are  the  storybook  of  every  application.  Each  entry  marks  what
happened and at what severity level.
Speaker  2:  Right,  without  them  we'd  be  guessing  whether  a  failure  came  from  a  bad
deployment or a hardware hiccup.
Speaker  1:  Remember  the  checkout  bug  we  chased  for  days?  The  logs  finally  showed
payment timeouts minutes after a database lock warning.
Speaker 2: Once we lined those timestamps up, the cause was obvious, and we avoided
hours of finger-pointing.
Speaker 1: That's why we dig through logs even when it's tedious. They turn hunches
into hard evidence and help us spot patterns early.
Speaker 2: We've all stared at a wall of red ERROR messages at 2 AM wondering where
to even begin.

Speaker 1: When the app is small, a quick grep for "ERROR" usually does the trick.
Speaker 2: But how do you even know where to start looking in a 10GB log file?
Speaker  2:  Once  you  have  several  services,  tools  like  Elastic  or  Splunk  become
essential.
Speaker  1:  They  let  you  search  structured  fields  and  follow  one  request  across  many
logs.
Speaker 2: We tag every entry with a request ID so we can trace a user's journey end to
end.
Speaker  1:  Dashboards  help  spot  patterns  too.  A  jump  in  WARN  logs  might  reveal
memory pressure before anything crashes.
Speaker 2: Whatever tool you use, keep enough history so you can go back and learn
from incidents.

Speaker  1:  `git  blame`  shows  who  last  touched  a  line,  but  that  alone  doesn't  prove
responsibility.
Speaker  2:  The  author  may  have  been  fixing  someone  else's  bug  or  working  with
incomplete specs.
Speaker  1:  When  we  spot  a  risky  change,  we  message  the  contributor  and  ask  what
problem they were solving.
Speaker 1: Remember when we blamed the database for three hours before realizing it
was a typo in the config?
Speaker  2:  Usually  we  uncover  useful  context,  like  a  last-minute  hotfix  that  forced  a
quick decision.
Speaker  1:  We  also  use  options  like  `-w`  to  ignore  whitespace  or  `-C`  to  track  code
moved between files.
Speaker 2: Used kindly, blame provides insight without turning conversations into witch
hunts.

Speaker 1: Here's a typical investigation. We start by scanning the logs for errors like
"database connection timeout".
Speaker  1:  It  can  feel  stressful  when  production  is  down,  so  having  a  checklist  keeps
everyone calm.
Speaker  2:  If  network  metrics  look  normal,  we  examine  recent  commits  that  touched
the connection pool.
Speaker 1: Running git blame on that section shows who adjusted the pool size.
Speaker 2: Instead of blaming them, we ask what issue they were trying to solve and if
it's still relevant.
Speaker  1:  Together  we  test  new  settings,  update  the  documentation,  and  note
everything in the ticket.
Speaker  2:  Saving  those  logs  and  discussions  means  the  next  team  understands  why
we made each change.

Speaker 1: Logs show what happened, and blame hints at who changed the code and
why.
Speaker  2:  Used  together,  they  help  us  resolve  issues  quickly  without  turning  the
post‑mortem into a witch hunt.
Speaker  1:  We  also  respect  privacy,  follow  retention  rules  and  document  lessons
learned.
Speaker  2:  That  builds  trust.  People  feel  safe  admitting  mistakes,  so  the  whole  team
improves.
Speaker 1: The goal isn't to catch someone out; it's to make the system stronger after
each incident.
Speaker  2:  Treat  logs  and  blame  as  tools  for  insight,  not  weapons,  and  they'll  guide
your career as much as your code.

Managing Emotions

Speaker  1:  Even  the  calmest  engineers  can  get  defensive  after  a  sleepless  night
responding to an outage.
Speaker 2: We've all been there—you're exhausted, adrenaline is fading and suddenly
every question feels like an accusation.
Speaker  1:  A  solid  plan  for  managing  emotions  keeps  the  conversation  focused  on
learning, not finger‑pointing, no matter how stressed the team feels.
Speaker  2:  We'll  also  look  at  how  cultural  expectations  shape  those  reactions  so  you
can lead inclusive post‑mortems that help your career as much as the codebase.

Speaker 1: Picture this—it's Black Friday and the payment gateway crashes right before
thousands of customers hit "buy".
Speaker  2:  The  pressure  is  sky‑high  and  everyone's  worried  about  being  singled  out.
That  fear  can  lead  people  to  keep  quiet  about  missing  monitors  or  shortcuts  taken
during the rush.
Speaker  1:  By  calling  out  those  emotions  early—"I  know  we're  all  tense"—you
encourage honesty and stop the blame game before it starts.
Speaker  2:  The  more  open  the  discussion,  the  faster  you  dig  up  the  real  causes  and
move toward solutions.

Speaker 1: Remember, we're diffusing tension, not defusing bombs—though sometimes
it feels similar!
Speaker 2: If frustration flares, try repeating back what you heard: "So you're worried
the rollback script failed?" That shows you get it without blaming anyone.
Speaker 1: Suggest a quick stretch break when voices rise. People come back calmer
and ready to listen.
Speaker 2: Encourage phrases like "I felt rushed" or "I was confused" instead of "You
messed up". Those small tweaks keep the discussion productive.

Speaker  1:  I  once  worked  with  a  Japanese  developer  who  barely  spoke  during
post‑mortems, even when he had the missing puzzle piece.
Speaker  2:  That's  common  in  cultures  where  disagreement  can  feel  disrespectful.  We
started  doing  short  one‑on‑one  chats  afterward  and  paired  him  with  a  mentor  who
modelled feedback.
Speaker  1:  After  a  few  weeks  he  was  comfortable  explaining  issues  in  the  group.  His
insights saved us from repeating mistakes.
Speaker  2:  The  key  is  setting  ground  rules  that  welcome  respectful  critique  and
adapting your style so everyone feels safe speaking up.

Speaker 1: When voices get loud or the chat blows up, it's tempting to play referee.
Speaker 2: A quick reset works better. Try the NAME framework—Notice what's
happening, Acknowledge the emotion, Move forward to the facts, and Engage everyone
in solutions.
Speaker 1: In remote meetings it can be as simple as "I can see this is frustrating. Let's
take a minute, then focus on what we control." Sometimes a short break is all it takes
to cool heads.

Speaker  1:  A  good  facilitator  keeps  the  group  focused  on  improvement  rather  than
blame.
Speaker 2: They might start by sharing a quick emotional check‑in—"Green, yellow, or
red?"—so everyone gauges the mood.
Speaker 1: Jot down emotional cues next to the facts. If someone looks uneasy, offer a
one‑on‑one follow‑up.
Speaker  2:  Ground  rules  like  "assume  good  intent"  and  "speak  from  your  own
experience" help new team members, especially across cultures.
Speaker  1:  These  skills  translate  directly  to  leadership  roles  where  guiding  difficult
conversations is part of the job description.

Speaker 1: Handling emotions and cultural differences takes practice, not just theory.
Speaker 2: When you approach post‑mortems with empathy and clear expectations, the
focus stays on learning rather than blame.
Speaker 1: Those habits pay off in your career too—leaders who navigate tough
conversations calmly are trusted with bigger challenges.
Speaker 2: Keep refining these skills and you'll turn every incident into an opportunity
for growth, both for the system and for yourself.

Metrics To Monitor

Speaker 1: Measuring outcomes is how we prove our process works. Without numbers
it's just opinion.
Speaker 2: Exactly. Leadership wants to see trends like recovery times getting shorter,
not just hear that we "did better".
Speaker 1: We'll focus on MTTR, recurrence rates and whether action items actually get
done.
Speaker 2: Tracking these may sound tedious, but it quickly shows if fixes stick or if the
same outages come back.

Speaker 1: MTTR stands for Mean Time to Recovery. It's the average duration from
detection to full service restoration.
Speaker 2: Recurrence rate tracks how often the same type of incident reappears within
a set period.
Speaker 1: Action item completion ratio shows what percentage of agreed fixes were
actually carried out.
Speaker 2: We also watch the age of open tasks so nothing lingers forever.

Speaker 1: ServiceNow and Jira both let you export incident data straight into reports.
Speaker 2: For smaller teams, a shared spreadsheet works fine, as long as someone
updates it each week.
Speaker 1: Grafana or Kibana are great for graphing MTTR trends alongside deployment
metrics.
Speaker 2: Whatever tool you pick, make sure it can export to CSV for audits and
compliance.

Speaker 1: Numbers only help if you act on them. Review trends monthly and ask why
any spike occurred.
Speaker 2: If the same issue recurs, escalate and revisit your fixes—maybe a root cause
was missed.
Speaker 1: Celebrate when MTTR drops or when all action items close on time. Share
those wins widely.
Speaker 2: And when tasks slip, discuss roadblocks in stand-ups and adjust priorities
rather than ignoring them.

Speaker 1: Consistently measured metrics turn one-off fixes into lasting improvements.
Speaker 2: When leadership sees recovery times shrinking and fewer repeat incidents,
support for your process grows.
Speaker 1: Keep tracking completion rates so action items don't fade away once the
spotlight moves on.
Speaker 2: The data speaks for itself; use it to drive decisions and keep refining your
operations.

Post Mortem Agenda

Speaker  1:  Post-mortems  can  drift  into  rambling  war  stories  if  no  one  sets  a  clear
agenda. A simple structure keeps the conversation productive and short.
Speaker 2: In this segment we'll outline a repeatable agenda that teams of any size can
follow. You'll learn who should be in the room and what each person contributes.
Speaker  1:  We'll  also  cover  how  thorough  documentation  helps  future  investigators
understand  what  went  wrong  and  why.  By  the  end  you'll  have  a  framework  you  can
adapt to your own organisation.
Speaker 2: You might think it's overkill to formalise a meeting about one incident, but
the  agenda  keeps  everyone  focused  on  facts  instead  of  opinions.  Without  it,  people
tend to jump around the timeline or get stuck debating blame.

Speaker 1: Begin every post-mortem by walking through a concise timeline of events.
Note when alerts fired, when the first responder acknowledged them and when service
was restored. This sets a factual foundation and keeps speculation in check.
Speaker  2:  After  the  timeline,  cover  the  business  impact—how  many  users  were
affected  and  what  the  cost  might  be.  Then  move  on  to  root  cause  analysis  using  a
structured method like five whys or a fishbone diagram.
Speaker  1:  Capture  action  items  as  they're  discussed  and  assign  owners  on  the  spot.
End  the  meeting  by  confirming  due  dates  and  when  follow-up  reviews  will  happen.  A
typical agenda can wrap up in under 30 minutes when everyone sticks to these steps.

Speaker  1:  A  successful  post-mortem  needs  the  right  mix  of  perspectives.  Incident
responders bring the technical details and know exactly what they tried in the heat of
the moment.
Speaker  2:  Service  owners  speak  to  business  impact  and  can  decide  which  fixes  are
worth prioritising. A facilitator keeps the conversation moving and makes sure quieter
voices are heard.
Speaker  1:  You'll  also  want  a  scribe  to  capture  notes  and  action  items  in  a  shared
document or ticketing system. Finally, invite at least one stakeholder from the business
side so the discussion stays grounded in user impact rather than just technical details.

Speaker 1: Documentation is often the most overlooked part of a post-mortem. Use a
central  template  so  every  incident  report  looks  the  same.  Include  the  timeline,  root
cause details, impact assessment and action items.
Speaker 2: Link your ServiceNow tickets or JIRA issues, plus any GitHub commits or pull
requests that contain the fixes. This makes it easy for future teams to trace the history
if something resurfaces.
Speaker  1:  Add  metrics  like  MTTR  and  number  of  affected  users.  Summarise  lessons
learned in plain language so they can be reused in training or onboarding. Finally, store
the  report  in  a  shared  location  and  announce  it  to  the  team  so  knowledge  spreads
beyond those who attended the meeting.

Speaker  1:  A  consistent  agenda  turns  post-mortems  into  a  learning  tool  instead  of  a
finger-pointing session. It ensures every incident gets the same level of scrutiny.
Speaker  2:  Documenting  who  attended  and  what  was  decided  makes  follow-up  easier
and helps new team members learn from past mistakes. Keep the reports concise and
accessible so they actually get read.
Speaker  1:  With  clear  roles,  a  repeatable  agenda  and  solid  records,  post-mortems
become a catalyst for improvement rather than a dreaded meeting.
Speaker  2:  Set  a  calendar  reminder  to  revisit  unresolved  action  items  every  quarter.
Nothing kills trust faster than open tasks left hanging indefinitely.

Post Mortem Culture

Speaker 1: Remember when our payment system crashed last month? The
post-mortem felt like a witch hunt.
Speaker 2: Right! That's exactly what we want to avoid. Today we'll learn how to turn
failures into learning opportunities instead of finger-pointing sessions. We'll practice
staying curious about how our process let the issue happen so we can fix it together.

Speaker  1:  Psychological  safety  means  no  one  gets  punished  for  admitting  an  honest
mistake.
Speaker  2:  Exactly!  When  teams  feel  safe,  they  surface  the  real  issues  quickly.
Remember  how  Sarah  skipped  the  deployment  checklist?  Instead  of  firing  her,  we
improved the process so it's impossible to skip.
Speaker  1:  Think  of  it  like  a  confession  booth  for  code—people  need  to  feel  safe
admitting their digital sins.

Speaker 1: When something breaks, the first instinct is often "Who messed up?".
Speaker 2: But blaming shuts the conversation down. Instead of asking "Why did you
delete the database?" try "What led you to run that command?".
Speaker 1: Great example! We focus on how the process allowed the mistake, not who
pushed the button. The goal is debugging the system, not making someone cry.

Speaker  1:  Invite  everyone  who  was  involved—engineers,  managers,  and  even
customer support.
Speaker  2:  Right,  because  each  role  sees  a  different  part  of  the  picture.  Junior  devs
might notice missing tests, while support teams capture real user impact.
Speaker  1:  And  drawing  out  quiet  participants  makes  sure  the  action  items  reflect
reality, not just the loudest voices.

Speaker 1: Let's start every post-mortem by reviewing the ServiceNow ticket and the
exact timeline of events.
Speaker 2: Then we map each step to the ITIL incident-management flow and dig into
root causes with the five-whys technique.
Speaker 1: Document action items as GitHub issues so we can track them. DORA
metrics like MTTR show if our fixes actually work.

Speaker  1:  One  big  pitfall  is  rushing  to  solutions  before  we  really  understand  the
problem.
Speaker  2:  Absolutely.  Another  is  letting  one  "hero"  take  all  the  blame  or  glory.  We
need the whole team learning, not just one person.
Speaker  1:  And  of  course  the  blame  game  spiral—once  finger-pointing  starts,  people
shut down and hide information.

Speaker 1: Here's a scenario: the website crashes right after a big marketing blast.
Speaker 2: I'd pull in the on-call engineer, the database admin, and support to map the
timeline. Then we'd ask what in our process allowed the traffic spike to take us down.
Speaker 1: Exactly. Keep the questions neutral so we discover the real gaps instead of
assigning blame.

Speaker  1:  When  tensions  rise,  try  saying,  "Help  me  understand  what  led  up  to  this"
instead of "Who did it?"
Speaker  2:  Right.  Redirect  accusations  toward  the  workflow.  Ask,  "What  monitoring
failed us?" or "What review step was missing?"
Speaker  1:  Inviting  quiet  voices  with  "Anything  we  missed  from  your  side?"  keeps
everyone engaged and prevents defensiveness.
Speaker 2: Over time, using phrases like these shows you're ready for leadership roles
because you focus on improving the system, not blaming people.

Speaker 1: To dig deeper, check out Google's post-mortem template and Amy
Edmondson's book *The Fearless Organization*.
Speaker 2: We also have ServiceNow guides and a DORA metrics cheat sheet linked in
the notes. Use them to strengthen your next post-mortem.

Rca Frameworks

Speaker  1:  We've  spent  time  exploring  how  to  hold  post-mortems  without  pointing
fingers.  Now  we  need  a  toolkit  for  digging  into  the  technical  reasons  behind  failures.
We'll  cover  two  proven  methods:  the  Five  Whys  and  the  fishbone  diagram.  Each
provides  a  step-by-step  path  to  go  beyond  symptoms  and  uncover  the  underlying
system weaknesses.
Speaker 2: Using a framework keeps the conversation grounded in evidence rather than
opinions.  Everyone  participates  by  examining  facts,  which  builds  a  shared
understanding of the incident. These methods also help with documentation, since the
process itself guides what to record. We'll see how they fit into the overall post-mortem
workflow and how you can apply them in your own environment.

Speaker 1: Without a framework, teams often jump straight to a fix and miss patterns
hiding  in  the  data.  When  we  slow  down  and  follow  a  structured  approach,  every  step
requires  evidence.  This  keeps  the  analysis  objective  and  prevents  the  conversation
from veering into blame or guesses.
Speaker 2: Frameworks also save time over the long run because they are repeatable.
If each incident is analysed differently, new members struggle to learn. By following a
shared  checklist,  we  build  a  knowledge  base  of  root  causes,  which  in  turn  improves
reliability metrics like MTTR. Studies show companies using consistent RCA methods cut
repeat incidents by over fifty percent.

Speaker  1:  The  Five  Whys  method  begins  with  the  problem  statement.  Let's  say  a
website went offline. We ask, "Why was the site down?" The first answer might be "The
database  became  unreachable."  We  then  ask,  "Why  was  the  database  unreachable?"
Maybe "Because a deployment script changed the network settings." The next why digs
deeper:  "Why  did  the  script  change  them?"  Because  there  was  no  peer  review  before
the  deploy.  "Why  wasn't  there  a  review?"  Because  our  automation  pipeline  doesn't
enforce it. At the fifth why we discover the real issue: the pipeline needs a mandatory
approval step.
Speaker 2: The trick is not to stop after the second or third why. Each answer should be
backed  by  evidence  so  we  avoid  speculation.  It's  easy  to  fall  into  solution  mode  too
early, but the point is to expose the hidden weakness in the process. Document each
question and answer chain so others can follow the logic later.

Speaker  1:  A  fishbone  diagram  looks  like  the  skeleton  of  a  fish,  with  the  issue  at  the
head and major cause categories branching off the spine. Common categories include
People,  Process,  Technology,  Environment,  Materials,  and  Methods.  For  each  branch
you  list  possible  contributing  factors.  In  an  IT  context,  the  "Technology"  branch  might
include  network  configuration,  while  "Process"  could  reveal  gaps  in  change
management.
Speaker  2:  This  technique  works  well  when  failures  have  several  intertwined  causes.
The  visual  layout  helps  teams  brainstorm  systematically  without  losing  track  of  ideas.
As  you  fill  in  the  branches,  patterns  emerge  that  highlight  where  to  investigate  first.
Draw  the  diagram  on  a  whiteboard  or  in  collaboration  software  so  everyone  can
contribute. It's also a good way to record findings for future reference.

Speaker  1:  So  how  do  you  decide  which  tool  to  use?  Start  with  the  Five  Whys  if  the
problem  seems  to  follow  a  single  chain  of  events.  It's  fast  and  requires  nothing  more
than  a  whiteboard.  If  the  conversation  stalls  or  new  branches  appear,  switch  to  a
fishbone diagram to capture the wider context. Sometimes you'll use both: the fishbone
to map categories, then a Five Whys on each branch.
Speaker  2:  Keep  in  mind  that  no  tool  fits  every  situation.  Highly  complex  or  political
issues may need a formal investigation beyond these techniques. Always document the
questions asked, the evidence gathered, and the conclusions. That record becomes part
of your post-mortem notes, and it helps the next team understand how you arrived at
the root cause.

Speaker 1: Whether you prefer the simplicity of Five Whys or the visual power of a
fishbone diagram, the goal is the same: uncover the underlying cause so you can fix it
for good. Treat each incident as a chance to strengthen your system, not as a failure to
hide. When done consistently, these frameworks build a culture of learning.
Speaker 2: Make it a habit to share your findings with the whole team and to track the
resulting action items. Over time you'll see patterns in your incident trends and you'll
develop a more robust improvement process. Mastering these analysis techniques is a
valuable skill for any IT professional who wants to lead problem management efforts.

Rca Servicenow Github

Speaker 1: We've talked about how to run a blameless post‑mortem, but what happens
to those findings afterward? We've all seen post‑mortems that become "post‑mortem"
themselves—dead and buried in someone's email within a week.

Speaker 2: Exactly. The best place for that information is a ticketing system like
ServiceNow. A problem record stores the timeline, contributing factors, and any
workarounds so nothing slips through the cracks.

Speaker 1: From there we link follow‑up work in GitHub. Today we'll walk through that
flow so you can turn lessons learned into real improvements.

Speaker 1: When you create a problem record in ServiceNow, start with a short title
that hints at the business impact. Then lay out a clear timeline, the contributing factors,
and any workarounds discovered during the incident.

Speaker 2: Link related incident tickets and the change request that eventually fixes
the issue. A good record might read, "Checkout failure during Black Friday—DB
connection pool exhausted; manual order processing used until 4:30 PM."

Speaker 1: Assign an owner, set a target date, and capture the final resolution.
Managers appreciate the audit trail, and the team can easily revisit the record when a
similar issue crops up.

Speaker 1: After the problem record is in place, open a GitHub issue for each
improvement task. A helpful title might be "Increase DB connection pool size -
PRB0001234," not just "Fix database."

Speaker 2: In the description, reference the ServiceNow ticket and explain the business
impact. That cross-link lets developers see why the work matters without leaving
GitHub. As code changes move through pull requests, mention the issue number so
everything stays connected.

Speaker 1: Once the fix is deployed and verified, close the GitHub issue and update the
ServiceNow record. Now both systems tell the same story.

Speaker 1: Putting it all together starts with a shared document right after the incident.
Capture timelines, logs, and team observations while the details are fresh.

Speaker 2: Within 24 hours, summarise those findings in a ServiceNow problem ticket
so managers have a clear view. Then create GitHub issues for each action item and link
them back to that ticket.

Speaker 1: During improvement meetings, review the problem record and its linked
issues to check progress. For example, the outage occurred at 2:45 PM, was resolved
by 4:30 PM, and the fix was deployed two days later. Keeping that timeline in one place
ensures nothing from the RCA gets forgotten once the incident fades.

Speaker 1: Integrating your RCA notes with ServiceNow and GitHub keeps everyone on
the same page, from engineers fixing code to managers tracking risk.

Speaker 2: It also helps during audits or handovers because every decision and
follow‑up lives in one place with clear links to the code changes.

Speaker 1: When teams consistently link these systems, improvements actually get
implemented instead of disappearing into a folder.

Speaker 2: That habit turns each incident into documented learning rather than another
"we should totally fix that someday" conversation.

Speaker 1: Plus, seeing past action items accomplished builds trust and motivates the
team to keep improving the process.

Tracking Improvement

Speaker 1: If you've ever wondered whether your quick fix actually solved anything or
simply moved the problem somewhere else, this module is for you. We've all attended
post-mortems where action items pile up, yet nobody checks whether those items had
any real impact.
Speaker 2: That's why we track deployment metrics and incident trends. Numbers give
us an unbiased view of progress. We'll look at tools like GitHub Insights, JIRA reports
and monitoring dashboards that make collecting this data easier than it sounds.
Speaker 1: We'll also talk about establishing a baseline before changes and how long it
typically takes to see meaningful trends. By the end you'll know which metrics matter,
common pitfalls to avoid and how these measurements help teams improve week after
week.

Speaker 1: Let's start with the basics—DORA metrics. Track deployment frequency, lead
time  for  changes,  change  failure  rate  and  mean  time  to  recovery.  These  reveal  how
smoothly code travels from commit to production.
Speaker  2:  To  gather  them,  use  GitHub  Insights  or  your  CI/CD  dashboard  for
deployment stats and JIRA or ServiceNow for incident logs. Establish a baseline before
you roll out new processes so you can see the effect over time.
Speaker 1: Good values differ by organisation, but watch for high failure rates or long
recovery times. They often hint at inadequate testing or rushed releases. Connect these
numbers  to  user  experience:  slower  recovery  means  customers  stuck  on  error  pages
longer.
Speaker 2: Don't forget incident counts and severities. Plot everything on a timeline. If
deployments  spike  but  incident  severity  climbs  with  them,  it  might  be  time  to  revisit
your quality gates rather than celebrate extra releases.

Speaker 1: Once you've collected a few sprints of data, compare it to your baseline. Did
your deployment lead time shrink? Are rollbacks less frequent?
Speaker 2: If the numbers improve, highlight them in a quick dashboard demo during
your  post-mortems.  Showing  a  trend  line  dropping  from  two‑week  deployments  to
two‑day cycles can convince leadership to keep investing in automation.
Speaker 1: When the metrics move the wrong way, dig into the timeline around each
spike.  Maybe  a  new  testing  tool  slowed  the  pipeline  or  a  Friday  release  pattern
correlated  with  more  incidents.  Invite  the  team  to  suggest  fixes  rather  than  assign
blame.
Speaker  2:  Present  findings  to  management  in  plain  language:  "Our  recovery  time
increased last month, likely due to rushed hotfixes. We propose adding a staging step."
Real data helps secure approval for those changes and keeps everyone accountable.

Speaker 1: Metrics turn vague promises into measurable progress. They show whether
you're really improving or just churning through tasks.
Speaker  2:  Keep  your  dashboards  visible  and  review  them  regularly.  Patterns  often
emerge  after  a  month  or  two,  so  be  patient.  Remember,  correlation  doesn't  imply
causation—but it sure waves its arms to get your attention.
Speaker 1: Watch for gaming. If someone deploys "fix typo" fifty times on Friday just to
boost  counts,  you're  measuring  the  wrong  thing.  Balance  speed  metrics  with  quality
indicators like change failure rate.
Speaker  2:  Use  these  numbers  to  justify  resources.  Showing  a  40 %  drop  in  recovery
time helped one team secure budget for an extra SRE. With clear data, you can pivot
quickly when things don't work and celebrate when they do.