Avoiding Common Mistakes in Machine Learning

Mistakes can occur when implementing machine learning. Unlike programming or arithmetic errors, the inherent nature of ML models can make resulting problems or bias difficult to detect.

To address this, Michael Lones, an Associate Professor at Heriot-Watt University wrote a paper that walked through the five stages of the machine learning process, outlining how to reliably build models, evaluate them, how to compare models fairly and correctly report the results.

Though pitched at academic researchers dabbling in ML, it offers fledging data scientists and business managers dipping their toes into AI-centric solutions an excellent overview of the most important considerations when working with ML, including challenges and areas where missteps are likely to happen.

Start with quality data

It is essential to ensure that data is from a reliable source and has been collected using a reliable methodology, says Lones. Training a model using bad data will most likely generate a bad model (popularly known as garbage in garbage out). This might require some exploratory data analysis and sieving out missing or inconsistent records before training a model.

It makes sense to ensure you have enough data prior to training, though this can be tricky to ascertain. According to Lones, this depends on the “signal to noise ratio” of the data set and might not be apparent until one starts building the models. Inadequate data might limit the complexity of the ML models used, such as having to keep away from deep neural networks – which will require many parameters.

Speak with experts

When solving problems using ML in a certain field, don’t forget to approach the experts working within those fields themselves. Domain experts can help one to understand the data and offer inputs on features that are likely to be predictive, notes Lones. It can also identify useful problems to solve – a consideration that is highly applicable to shortlist the most pertinent business problems to work on.

Use the right model

While it can be fun to experiment with multiple approaches to see “what sticks”, it can result in a disorganized mess of experiments that can be hard to justify. An organized approach is ideal, with proper optimization of hyperparameters, which are numbers or settings that affect the configuration of the model.  

One common problem is allowing test data to leak into the configuration, training, or selection of models writes Lones. (A common mistake that medical researchers working on the use of AI in medical imaging made) He wrote: “leakage of information from the test set into the training process is a common reason why ML models fail to generalize.”

In closing

It is worth noting that a higher figure for accuracy does not imply a better model. Lones notes that this might be due simply to the use of different data sets or a different hyperparameters configuration.

Ultimately, the ML field continues to evolve, with new tools and exponentially more powerful AI processing capabilities that can be accessed in the cloud or deployed on-premises. Doubtlessly, the growing adoption of AI will result in new types of mistakes that necessitate new strategies to guard against.

“You have to approach ML in much the same way you would any other aspect of research: with an open mind, a willingness to keep up with recent developments, and the humility to accept you don’t know everything,” he summed up.

You can access “How to avoid machine learning pitfalls: a guide for academic researchers” here (pdf).

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].​

Image credit: iStockphoto/mkitina4