Finding the Elusive Outlier

Finding outliers motivate data scientists and analytics-minded digital leaders. They are crucial when training algorithms and building the right AI models.

But finding outliers depends on many factors. The most important is the quality of the labeled data. A good set can help companies to train their algorithms into good AI models. They can then use these to discover abnormal behaviors.

High-quality label data is also essential to address bias and erroneous conclusions up front. It helps in explainability, which is finding how AI models arrived at a conclusion.

For many data scientists, getting large high-quality labeled data sets is wishful thinking. They are expensive to start with. Many companies also do not have ready access to these data sets.

“If you do not have a lot of labeled data, how do you find outliers in multivariate data? This is a general problem,” said Michael O’Connell, chief analytics officer, TIBCO Software during the recent TIBCO Now.

One way to tackle this "problem" is to look for AI models that are transferable.

O'Connell pointed to fraud detection as a good example. "This applies to different industries. There are similarities in the modeling part, where you are looking for outliers," he said.

The same model used for identifying money laundering can find faults during manufacturing. The data is different, and the actions will vary. But the models are similar, and the outcomes are well-known.

In fact, it was the principle that fueled TIBCO’s Spotfire rise. “It was about ‘spotting the fire.’ We used Spotfire to identify outliers. We saw an opportunity as a lot of companies did not do this well,” said O’Connell.

For example, TIBCO Spotfire allows two visualizations to be "brush linked." "You do not have to do this manually. It is why users see Spotfire as a data discovery platform rather than a reporting platform. We are about insights and not about pretty pictures,” said O’Connell.

What happens when you have no labeled data? It is where unsupervised learning triumphs. While supervised learning assumes labeled data, unsupervised assumes there are very little.

It is still early days. But O’Connell saw unsupervised learning gaining from neural networks and autoencoders.

“We started off in this space doing principal component analysis (PCA),” he explained.

PCA uses orthogonal transformation to convert a set of observations into a set of values of uncorrelated variables called principal components. But it is a linear process.

Autoencoders are not, and which are what TIBCO is adding support to. Often seen as repurposed feedforward neural networks, they optimize the learning process using vector calculus. Essentially, they reduce the number of data dimensions.

In reality, companies find themselves in the middle. Their data scientists will have access to cheaper, but low-quality labeled data and lots of unlabeled data.  In some cases, the data may be “noisy or biased.”

Enter weak supervision. “And at TIBCO, we like weak supervision,” said O’Connell.

Weak supervision allows companies to use low-quality labeled data from nonindustry experts. These are cheaper and more numerous. It also allows them to use unlabeled data more effectively. Often a human can manage the output, decide the final actions, and retrain the model.

“So, you are gradually building out your labeled data set,” said O’Connell.

Humans will still be vital for finding outliers. Whether you use strong supervised, weak supervised or unsupervised training, their input counts, said O’Connell.

“In fact, in a lot of applications, like money laundering, you want to have a case manager. The wrong decision can impact livelihoods. Yes, in some instances [like managing manufacturing equipment], you can automate. But it takes a bit of an act of faith and a lot of proving to yourself to automate. Often times, you will still want someone involved,” he added.