Mention “alert fatigue” to a SOC analyst. They would immediately recognize what you are talking about. Now, take your time machine to 2002. Find a SOC analyst (much fewer of those around, to be sure, but there are some!) and ask him about alert fatigue — he would definitely understand what the concern is.
Now, crank up your time machine all the way to 11 and fly to the 1970s where you can talk to some of the original NOC analysts. Say the words “alert fatigue” and it is very likely you will see nods and agreement about this topic.
So the most interesting part is that this problem has immense staying power, while cyber security industry changes quickly. Yet this one would be familiar to a person doing a similar job 25 years apart. Are we doomed to suffer from this forever?
I think it is a bit mysterious and worth investigating? Join me as we uncover the dark secrets behind this enduring pain.
Why Do We Still Suffer?
First, let’s talk about people’s least favorite question: WHY.
An easy answer I get from many industry colleagues is that we could have easily solved the problem at 2002 levels of data volumes, environment complexity and threat activity. We had all the tools, we just needed to apply them diligently. Unfortunately, more threats, more data, more environments came in. So we have alert fatigue in 2024.
Personally, I think this is a part of why this has been tricky, but I don’t think that’s the entire answer. Frankly, I don’t recall any year during which this problem was considered close to being solved, pay no heed to shrill vendor marketing. The early SIM/SEM vendors in the late 1990s (!) promised to solve the alert fatigue problem. At the time, these were alerts from firewalls and IDS systems. The problem was not solved with the tools at the time, and then again not solved with better tools, better scoring. I suspect that throwing the best 2024 tools at the 2002 levels of alerts will in fact solve it, but this is just a theoretical exercise…
False positive (FP) rates increased? I frankly don’t know and don’t have a gut feel here. In theory they should have decreased over the last 25 years, if we believe that security technology is improving. Let me know if anybody has any data on this, but any such data set would include a lot of apples/oranges (1998 NIDS vs 2014 EDR vs 2024 ADR anybody?)
Some human (Or was it a bot? Definitely a bot!) suggested that our fear of missing attacks is driving false positives (FP) up. Perhaps this is also a factor adding to high FP rates. If you have a choice of killing 90% of FPs by increasing FNs by 10%, would you take it? After all, merely 1 new FN (aka real intrusion not detected) may mean that you are owned…
Manual processes persisting at many SOCs mean that even a 2002 volume of alerts would have overrun them, but they hired and covered the gap. Then alert volumes increased with IT environment (and threat) growth, and they were not able to hire (or transform to a better, automation-centric model).
More tools that are poorly integrated probably contributed to the problem not being solved. IDS was as the sole problem child of the late 1990s. Later, this expanded and evolved to EDR, NDR, CDR, and other *DR as well as lots of diverse data types flowing into the SIEMs.
All in all, I am not sure there is one factor that explains why “alert fatigue” has been a thing for 25+ years. We are where we are.
Where are we exactly?
Some [Bad] Data
With the help of a new Gem-based agent, I collected a lot of data on alert fatigue, and let me tell you…. based on the data, it is easy to see why we struggle. A lot of “data” is befuddling, conflicting and useless. Examples (mix of good and bad, your goal is to separate the two):
“70% of SOC teams are emotionally overwhelmed by the volume of security alerts” (source)
“43% of SOC teams occasionally or frequently turn off alerts, 43% walk away from their computer, 50% hope another team member will step in, and 40% ignore alerts entirely.” (source)
“55% of security teams report missing critical alerts, often on a daily or weekly basis, due to the overwhelming volume.” (source)
“A survey found that SOC teams deal with an average of 3,832 alerts per day, with 62% of those alerts being ignored.” (source)
“56% of large companies (with over 10,000 employees) receive 1,000 or more alerts per day.” (source)
“78% of cybersecurity professionals state that, on average, it takes 10+ minutes to investigate each alert” (source)
“Security analysts are unable to deal with 67% of the daily alerts received, with 83% reporting that alerts are false positives and not worth their time.” (source)
In brief, the teams are flooded with alerts, leading to burnout and pain. While the exact figures vary across studies (like, REALLY, vary!), a pattern emerges: teams are overwhelmed by the volume of alerts, and often a majority of them are false. The data barely teaches us anything else…
What Have We Tried?
The problem persists, but the awareness of this problem is as old as the problem (see the hypothetical 2002 SOC analyst conversation above). In this section, let’s go quickly through all the methods we’ve tried, largely unsuccessfully.
First, we tried aggregation. Five (50? 500? 5000? 5 gazillion?) of alert such and such get duct-taped together and shipped off to pester a human. That clearly did not solve the problem. Don’t get me wrong, aggregation helps. But clearly this 1980s trick has not fixed alert fatigue.
Then we tried correlation where we try to logically related and group alerts and assign priority to the “correlated event” (ah, so 2002!) and then give them to an analyst. Nah, didn’t do it.
We also tried filtering both on what goes in the system that produces alerts (input filtering; just collect less telemetry) and also filter alerts (output; just suppress these alerts).
We obviously tried tuning i.e. carefully turning off alerts for cases where such alert is false or unnecessary. This has evolved to be one of the least popular advice in security ops (“just tune the detections” is up there with “just patch faster” and “just zero trust it”).
We tried — and are trying — many types of enrichment where the alerts are deemed to be extra fatigue-inducing because context was missing. So various automation was used to add things to alerts. IP became system name, became asset role/owner, past history was added and a lot of other things (hi 2006 SIEM Vendor A). Naturally, enrichment on its own does not solve anything, but it reduces fatigue by letting machines do more of the work.
We tried many, many types of so-called risk prioritization of alerts. Using ever-more-creative algorithms from the naively idiotic threat x vulnerability x asset value to more elaborate ML-based scoring systems. It sort of helped, but also hurt when people focused on top 5 alerts from the 500 they needed to handle. Ooops! Alert #112 was “you are so owned!” Prioritization alone is not a solution to alert fatigue.
Then there was a period of time when beautiful, hand-crafted, artisanal SOAR playbooks were the promised way to solve alert fatigue.
Meanwhile, some organizations thought that the SIEM system itself is the problem and they needed to focus them on narrow detection systems such as EDR where alerts are supposedly easier to triage. Initially, there was some promise … and now you can see more and more people who complain about EDR alert fatigue. So narrow focus tools also weren’t the answer. BTW, as EDR evolved to XDR (whatever that is) this solution “unsolved” itself (hi again, SIEM).
Today, as I’m writing this in 2024, many organizations naively assume that AI would fix it any day now. I bet some 2014 UEBA vendor already promised this 10 years ago…
So:
- Aggregation
- Correlation
- Filtering (input [logs] and output [alerts])
- Alert source tuning
- Context enrichment
- “Risk-based” and other prioritization
- SOAR playbooks
- Narrow detection tools (SIEM -> EDR)
- AI…
Good try, buddy! How do we really solve it?
Slicing the Problem
Now, enough with whining and towards something useful. I want to start by suggesting that alert fatigue is not one problem. Over the years, I’ve seen several distinct cases for alert fatigue.
To drastically oversimplify:
You may have alert fatigue because a high ratio of your alerts are either false positives (or: other false alerts), or they indicate activities that you simply don’t care to see. In other words, bad alerts type A1 (false) and bad alerts type A2 (benign / informational / compliance).
A1. FALSE ALERTS
A2. BENIGN ALERTS
You also have alert fatigue when your alerts are not false, but a high ratio of them are particularly fatigue-inducing and hard to triage (it’s not the volume, but the poor information quality of the alert that kills; also bad UX, or, as Allie says, AX). In other words, bad alerts, type B (high fatigue).
NEW: this also applies to malicious (i.e. not benign and not false) alerts where the risk is accepted by the organization (“yes, this student machine always gets malware, no action” kinda thing)
B. HARD TO TRIAGE ALERTS
Finally, there’s the scenario where you have perfectly crafted alerts indicating malicious activities, but your team just isn’t sufficient for the environment you have. In other words, good alerts, but just too many.
C. SIMPLY TOO MANY ALERTS
Naturally, in real life we will have all problems blended together: high ratio of bad alerts AND high overall volume of alerts AND false alerts being hard to triage, leading to (duh!) more high fatigue.
Frankly, “false alerts x hard to triage alerts x lots of them = endless pain.” If you are taking 2 hours to tell that the alert is a false positive, I have some bad news for you: this whole SOC thing will never work…
Alert fatigue dimensions (Anton, 2024)
Anything I missed in my coarse grained diagnosis?
What Can We Do
Now, I don’t promise to solve the alert fatigue problem with one blog, even a long one. But I do propose a framework for diagnosing the problem that we face and for trying to sort the solutions into more promising and less promising for your situation.
For example, if you are specifically flooded with false positive alerts (e.g. high severity alert that triggers on an unrelated benign activity), unfortunately the answer is the one you won’t like: you do need to tune. Aggregation, correlation, etc are not the answer; “fix the bug in your detection code” is. If some alerts are false in bulk, these just should not be produced. If you rely on vendor alerts and your vendor alerts suck, change your vendor. Perhaps in the future some AI will tune your detection content based on the alerts for you, but today, sorry buddy, you are doing it…
So the answer here is not to use excessively more complicated SOAR playbooks. It is about actually making sure that alerts with high false positive ratios are not produced.
Huh? You think, Anton? Yup, in the case of proper false positives, “fix the detection code” really is the answer (or otherwise tune by limiting which systems are covered by the detection, this of course has tradeoffs…). I cringe a bit since I feel that I am dispensing 2001-style advice here (“tune your NIDS!”) but it does not change the fact that it is the right thing to do. BTW, most clients are just not brutal enough with their vendors in this regard…
What about the alerts that are just not useful, but also not false. In this case, the main solution avenue is enrichment. That is, after you take a good look at those that serve no purpose whatsoever, not even informational, and turn these off. You are adding enrichment dimensions so that alerts become more useful, and easy to triage.
For example, logging in after hours may not be bad over the entire environment (a classic useless alert 1996–2024), but may be great for a subset (or perhaps one system, like a canary or a honeypot, actually). Enriched alerts are dramatically easier to process via automation (so a SIEM/SOAR tool may do both for you).
Another scenario involves alerts that, while valid, are exceptionally difficult, painful to triage. This is where again enrichment combined with SOAR is the right answer. I remember a story where a SOC analyst had to open tickets with 3 different IT teams to get the missing context and then conclude (after 2 days — DAYS!) that the alert was indeed an FP.
Another situation is that alerts are hard to triage and cause fatigue because alerts just go to the wrong people. All the modern federated alerting frameworks where alerts are flowing down the pipelines to the correct people seek to fix it, but somehow few SOC team discovered this approach (we make heavy use of this in ASO, of course). For (a very basic) example, routing DLP alerts to data owners instead of the SOC can be more efficient, but this requires careful consideration and planning (not diving into this flooded rathole at the time…)
Naturally, some lessons from other fields where the alerting problem is “more solved” help. In this case, I am thinking of SREs. In our ASO/MSO approach, we have spent lots of time on the relentless drive to automation. “Study what SREs did and implement in SOC/D&R” is essentially the essence of ASO (here is our class on it). Relating to the problems of alert fatigue we covered, automation (here including enrichment) and a rapid feedback loop to fix bad detection content is the whole of it, basically. No magic! No heroes!
I do want to present the final case for more alert triage decisions to be given to machines. “Human-less”, fully automated, “AI SOC” is of course utter BS (despite these arguments). However, the near future where AI will help by handling much of the cognitive work with the alert triage is coming. This may not always reduce the alert volume, but likely reduces human fatigue.
Despite all these efforts, alert fatigue may persist. In some cases, the issue might simply be a lack of adequate staffing and that’s that…
Summary
So, a summary of what to do
Diagnose the fatigue: Begin by identifying the root cause of your specific alert fatigue. Is it due to false positives, benign alerts, hard-to-triage alerts, or simply an overwhelming volume of alerts? Or wrong people getting the alerts perhaps?
Targeted treatment: Once diagnosed, apply the appropriate solutions based on the symptoms identified:
- False positives: Focus on tuning detection rules, improving alert richness/quality, and potentially changing vendors if necessary.
- Benign alerts: Implement enrichment to add context and make alerts more actionable. Then SOAR playbooks to route.
- Hard-to-triage alerts: Utilize enrichment and SOAR playbooks to streamline the triage process. This item has a lot more “it depends”, however, to be fair…
- Hard-to-triage alerts for specific analysts: Start adopting federated alerting for some alert types (e.g. DLP alerts that go to data owners)
If in doubt, focus on developing more automation for signal triage.
Expect some fun AI and UX advances for reducing alert fatigue in the near future.
Wish for some luck, because this won’t solve the problem but it will make it easier.
Share your experience with security alert fatigue and — ideally — how you solved it or made it manageable…
Final thought: Let’s collectively aim for Security Alert Fatigue (1992–202x)
v1.1 11–2024 (more updates likely in the future)
v1.0 11–2024 (updates likely in the future)
Related resources:
- By Anton Chuvakin (Ex-Gartner VP Research; Head Security Google Cloud)
Original link of post is here