Beyond Algorithms: Getting to the Heart of Successful ML Solutions

Within the AI hype bubble, information security vendors sometimes make sensational claims about their technology, often suggesting unique algorithms that are smarter than those of rivals. However, these boasts play on the misperception that algorithms are the only thing that set successful AI/machine learning solutions apart, which is generally not the case.

In Singapore, AI is well established as a concept, and applications are becoming more widespread. In the public sector, Minister-in-charge of Smart Nation Initiative Vivian Balakrishnan recently said that the country will “double down” on its efforts to build up the AI sector, and equip its workforce to use these tools to “participate meaningfully” in a future where the economy is driven by the technology.
AI is also already gaining a firm foothold in the private sector. Mastercard, for example, is investing in AI innovations and law firm Rajah & Tann has set up a digital arm - Rajah & Tann Technologies (RTT) - that harnesses artificial intelligence (AI)-driven technology to deliver legal services digitally.

The risk is that AI becomes a lazy buzzword, too often conflated with machine learning (ML).  It is essential that a distinction between these two terms is established and their relationship is understood, particularly in the cybersecurity field.

Connecting AI with ML

It is, of course, true to say that an AI system can process large amounts of real-time, historical, structured and unstructured data much faster, and in more intensive ways, than humans. This speed and depth replaces manual efforts with the potential to make a rapid, accurate decision—again based on the training the system has had. But it is ML - which is a specific component of AI – that is "trained," and that uses that training to make sense of the data. AI adds on to this idea by letting the machine either suggest or take action based on its models and observations.

Imagine, for example, that a malicious user logs in to a network-connected PC with admin rights that immediately runs a tool to search for open file shares across the network. Then this user starts to copy several files from a shared volume to a new folder. Next, the user begins sending these files to a previously unused FTP server: this could be a perfectly reasonable activity, or it could be a signal that credentials have been hacked and a data breach is taking place.

In this scenario, each of these steps might only be noticed in hindsight after separate alerts and examination of the associated log files. Also, each step might take place days apart and may not even be correlated as a sequence. In fact, some of these actions are unlikely to be captured within logs or by agents.

And as Jisc noted, under the General Data Protection Regulation, the justification for retaining logs is that the benefits to users of detection are greater than the risks caused to them by retention. But the longer an intruder stays in your systems undetected, the fewer benefits there will be for detailed investigations. Once the intruder has taken all the data accessible from a system, backdoored it to give them full and persistent control, and modified applications and logs to conceal these activities, the benefit of all those log files is pretty small.

A machine learning solution could spot and recognize the risk of each of these events, generate a priority alert, and potentially automatically quarantine that PC from the rest of the network. For this automated process to be effective and be permitted in the risk-aware culture of security, the solution needs to have a high level of confidence that this is, in fact, an attack and not just a real admin going about their legitimate duties. Comparison of behavior against peer group behavior, roles, and rights can provide a secondary confirmation.  These are all tasks that a machine can do well that humans cannot.

High-Quality Data is Crucial for Accuracy

ML models evolve over time, based on what they observe or how they are trained. Used on authoritative data sets, ML helps prioritize those indicators that are materially interesting and automate aspects of the investigation that slow and complicate the security operations center (SOC). The critical enabler for success is that the data available for building the ML model must be diverse, contextual, timely, and reliable.

Having more high-quality data for the machine learning to analyze allows the AI to make better judgments.  High-quality data should start with accurate data about users, devices, systems on the network, and workflow patterns. So, in this example, if the models had been fed network device discovery information which made the AI aware that the ‘PC’ the malicious user logged in from is a print server, then tasks other than managing print jobs would be considered as highly suspicious.

Historical Patterns Valuable for Training

A historical understanding of user and device behavior along with real-time access to current network activity is also beneficial in training the underlying models. For example, if the "admin" account in our scenario had always logged in between 9 a.m. and 10 a.m. and logged out mostly between 6 p.m. and 7 p.m., but this activity was taking place at 10 p.m., this break from the established pattern could also cause a red flag. Or if this admin had never previously used FTP or had any interaction with this file server—again red flags aplenty.

To be able to gain these insights, you need both a broad array of baseline data and constant flow of real-time information beyond what is available from historical logs or agents, or scheduled scans. The last point is particularly critical as the next generation of IoT devices often don’t have either logs or agents, and application owners really don’t want agents or scans running on their finely tuned systems.

Avoiding Garbage In

What does this all mean in terms of practical application? Garbage in, garbage out. Quality in, quality out. Security operations now include many data and behavioral analytics applications, running with Security Information and Event Management software (SIEMs).

All analytics will be more effective when provided with rich, high-fidelity sources of data. ML liberates analytics from laborious rules maintenance, permitting higher resolution models that can help find real-time event correlations to identify anomalies and predict and prevent security issues. The more relevant the data, the better—not to drown the system, as alerts do today, but to facilitate the training and optimization of the system for maximum on-going accuracy and confidence.

Although this is a broad summary of the issue, the next time somebody tries to convince you that it’s all about the algorithm, it's almost certainly the data that holds the key.

Jeff Costlow, chief information security officer, ExtraHop contributed this article.

The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends.