To Win Big at Data, Start Small

Hundreds of AI tools have been built to catch COVID, noted a new report on the MIT Technology Review. But in what sounds like a cautionary tale against blind reliance on AI, it noted that none of them made a real difference, and some were even potentially harmful.

The assertion that AI tools made little impact in the global fight against COVID was drawn from various studies and reviews, including a report (pdf) from the Alan Turing Institute, the national institute for data science and AI in the UK.

What went wrong, exactly?

The problem in the data

The AI-based tools built to fight COVID were broad and diverse. Some attempted to predict how sick patients might get while others attempted to determine patient risk from medical images such as X-rays and CT scans.

Yet the conclusion was clear as well as startling: None were fit for clinical use today, and just two were found promising enough to warrant further testing.

This was not for lack of effort but boils down to rudimentary errors in how AI models were trained or tested, says Laure Wynants, an epidemiologist who is the lead author of one such review study published in the British Medical Journal.

In a nutshell, trained models did not perform as claimed due to the poor quality of the underlying data and incorrect assumptions made about it. For instance, data were spliced from multiple sources and duplicate data were used. The latter culminated in the testing of models against the same data they were trained on – resulting in models that appear more accurate than they were.

Other discovered problems include scans of children treated as those from adults, or because the severely sick were far more likely to be scanned while lying down, the introduction of bias merely from the position of patients. Finally, the origins of some data sets were also murky but used nonetheless in a desire to fight the pandemic.

AI for the rest of us

The study was not a condemnation about what AI can do but casts the spotlight on how mundane yet vital aspects of the machine learning lifecycle such as data preparation, is not adequately emphasized. This is the reason good data means so much to data scientists, and why the global data analytics market is projected to grow at a CAGR of 25% from 2021 to 2030.

And we hence come to the crux of the problem: The inability to access the right data is also the reason why AI isn’t more broadly deployed today.

What’s worse is how success stories told by the top echelon of data companies “drown out” innovation from smaller-scale data teams. This was observed by Shaun McGirr, AI evangelist at Dataiku who wrote about data science and AI in a contributed blog post titled “Beware the 1% view of data science”.

Though much has been written about by top individuals working at big tech firms such as Google, Facebook, and others, they depend on effectively “infinite” resources. After all, the top tech firms don’t face issues with legacy IT systems, siloed data backends, and the dearth of data experts that the remaining 99% face daily.

“Access to the right data, in a reasonable time frame, is still a top barrier to success for most data scientists in traditional companies,” McGirr noted.

Massive data sets not needed for success

So, a huge volume of good data equals AI success, right?

The answer is “no”, according to Andrew Ng, co-founder of Google Brain, the tech giant's AI and deep-learning team. Speaking in an interview with Fortune, Ng says massive data sets are not essential for AI innovation.

Ng argues that the value of vast data sets wielded by the top tech firms is of limited application outside of consumer Internet companies. Ng drew attention to the next frontier of AI, which is to build algorithms around much smaller data sets.

Using the analogy of a system designed to recognize scratched or defective smartphones on an assembly line as an example, he pointed to the incongruity of attempting to train an AI system using a million scratched or damaged smartphones.

The focus on smaller data sets means there will come a time where the quality of data matters more than quantity. The bottom line? Data science can deliver substantial value to organizations without headline-grabbing breakthroughs, and the ability of AI to help traditional industries will be far greater than its value in consumer-centric endeavors such as improving users’ experience in Netflix or Spotify.

Finally, McGirr, the AI evangelist offers a glimmer of hope for the rest of us. Organizations don’t need to be Facebook to start on innovative and advanced data science or AI projects, he says. Instead, citizen scientists working on scores of projects can achieve a cumulative victory on a massive front, allowing them to win big at data by starting small.

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].​

Image credit: iStockphoto/PongsakornJun