The SolarWinds Hack: What Went Wrong With Missing Alarms and How To Fix It

A few days ago, on February 23, the US Senate Intelligence Committee held a hearing with executives from SolarWinds, FireEye, CrowdStrike and Microsoft about the SolarWinds hack.

It’s worth listening in full, but we want to focus on one particular aspect described by the participants – the malware shutting down endpoint monitoring agents. At 00:49:03 the FireEye CEO, Kevin Mandia, says:

Another thing it did, it looked for nearly 50 different products and shut them down when it ran. So people are like “Why didn’t anybody detect the implant”. Because when it executed, it looked to see if the CrowdStrike/Microsoft/… agent was on the endpoint and shut it off. And you don’t make a backdoor as a bad guy as a regular user, you make one as the root user.

This triggered Senator Gillibrand to ask the logical question (at 01:43:47):

So why was there no alarm and how were they shut down? And related, why were there no alarms in the solarwinds and anti-virus software logs which should have shown the unusual behavior access on or other traces of unauthorized access?

The explanation by CrowdStrike’s CEO, George Kurtz, was similar – that the attackers waited patiently and managed to shut down agents because they (the attackers) were running as root.

It is not clear whether CrowdStrike or any other endpoint protection agent was running within the compromised SolarWinds infrastructure – in fact, CrowdStrike seems to have entered SolarWinds only after the hack, according to news reports. But the idea that “when you are root, you can stop everything and remain undetected” is only part of the picture and we want to address that issue.

It’s been a common problem in the industry that log sources and endpoints sometimes stop sending data, and that remains unnoticed. That’s a big issue that needs to be tackled. Time and time again across sectors and geographies, someone discovers months later that there are missing data from some source.

Every time a log source stops sending logs, or an endpoint protection client goes silent and doesn’t send keep-alive messages to its server(s), it means something is wrong and an alarm should be triggered on the server that is responsible for aggregating the data.

Missing log data means either a malicious actor is stopping the service, or there is a bug, or there’s a network failure, but it’s always worth investigating immediately.

Why, then, it happens so often that these issues remain ignored? There are several reasons:

  • The tools don’t support such healthcheck / keep-alive alerting procedures – that’s rarely the case, but there are vendors that haven’t figured out this is important
  • It’s complicated or non-obvious to configure these alerts – if the product doesn’t create them by default, it’s usually a hidden switch somewhere in the dozens of configuration options and it’s easy to miss. With new log sources and endpoints added constantly, it’s easy to leave at least some of them out of the scope of healthcheck rules. Another option here is the notification settings – if notification settings are not properly set, an alert might be generated but nobody would be explicitly notified
  • Alert fatigue – if, for various reasons, “missing logs” or “service stopped” alerts are happening too often, analysts start to ignore them. Intermittent network issues, rules with too short “fail” periods or other issues may lead to too many false positives that either get turned off or being ignored

Because there are so many sources and so many endpoints, these healthcheck/keep-alive rules should be created and enabled by default. And it doesn’t matter if it’s endpoint protection, antivirus or SIEM, the idea is the same. We as an industry have built complicated security tools that too often fail to provide simplicity for executing best practices. The machine learning improvements in the threat detection toolkit are great, but if they can’t spot an oddly silent sensor, are they properly chosen and trained?

One approach to alleviate those issues that we have implemented is – regardless of the collection type (agentless or agent-based), create a default healtcheck rule for each data (log) source. The healthcheck rule has some reasonable defaults which can be tweaked (e.g. if a source/endpoint is supposed to be quiet outside working hours, this can be specified, so that false positives are not generated). The agentless collection has the benefit of not having anything to be stopped or killed on the endpoint, and stopping a whole collector instance would certainly trigger all sorts of alarms. In addition to that, for agent-based deployments, we rely on an open-source agent (OSSEC/Wazuh) which by default sends keep-alive messages every minute. Several violations of that expectation trigger an alarm as well.

Is this approach bulletproof? No approach is – sophisticated threat actors can go an extra mile and spoof keep-alive messages (they may have reverse-engineered the communication protocol, or in some case even decompile a binary or dump the memory to extract a cryptographic key, but that’s not impossible). They may copy a sample of the logs for the target endpoints and replay them with minor changes to make things look okay. But remember that heatlhcheck/keep-alive is a very basic best practice. Detection cannot be based solely on it, but it should definitively be there all the time.

The fact that expects know some best practices doesn’t automatically make these best practices enforceable. We should be building tools that have these best practices set by default, or at least very easy to set up and nudging users to set them up. Security visibility depends on being able to easily get the most signal (and filter out the noise). And visibility is a key step in preventing breaches.



Like this article? Share it with your network!