Drain Your Data Swamp First

Image credit: iStockphoto/Nastco

It is time for companies to “drain their data swamp” before embarking on any AI projects. Else, they are setting themselves up for failure if the data was not adequately prepared.

This consensus view came from a panel discussion on implementing AI at the recent NexGen Connectivity Forum, which comprised both industry participants and solution providers.


Michael Lang, solutions architecture manager at NVIDIA, said he had seen many companies set to embark on their AI journeys, only to be frustrated by having the data for training in the wrong format.

“As well as data sovereignty and designing the right infrastructure, data preparation is the first step,” said Lang.

“We have found that with many organizations, their data is disaggregated or is just not in the right state, and when you deal with disaggregated data sources, it can be very hard and challenging to get them in the right format.”

Lang said that many companies were beginning their AI journeys with this challenge in mind by buying solutions “off the shelf” with training already done.

“This might not tick all the boxes, but it gets you going faster, and that can be good,” he said.

“So, the issue is, do you want to get going and achieve some of your goals, or take longer and achieve them all? It’s often a cost to value consideration.”

Lang said many companies were taking a “hybrid” approach to AI, beginning with an off-the-shelf approach and developing their own proprietary IP.

“People think about AI as one thing, but you can have multiple stacks of AI,” said Lang. “So, you can add layers and have one or two things that are unique.”


Eric Hui, the director of IoT business development for Asia Pacific at Equinix, said he had observed an evolution in the type of data used for AI projects.

He described three different data types: event-driven, which is close to real-time data, interactive, and batch.

“The trend for data is that it’s starting to move from reactive to proactive,” said Hui. “And at the same time, you want to derive intelligence from data, and data is moving from pattern to more predictive.”

The business use for the data was also a determining factor for decisions around architecture, said Hui.

The need for data to be captured and analyzed fast, even in real-time, demands low latency, which requires a dedicated physical infrastructure to handle the data.


A third panelist, Laurence Liew – the director of AI innovation and makerspace at AI Singapore – outlined his company’s work with Singapore companies.

AI Singapore had so far engaged with around 500 companies and approved nearly 70 AI projects.

“Of those, half of them have data in a state where we need to do additional intervention to get them ready for machine learning,” said Liew.

“The data can be dirty and inoperable and not machine-readable, or it can be in some database where it can’t be easily integrated. So, this preparation of data is the most time-intensive and boring, but also the most important.”

Training of algorithms with data sets is the “easy part,” said Liew, and there are automated tools that can enable this “with a few clicks.”

But still, the data needs to be ready first.

Lachlan Colquhoun is the Australia and New Zealand correspondent for CDOTrends and HR&DigitalTrends, and the editor of NextGen Connectivity. His fascination is with how businesses are reinventing themselves through digital technology and collaborate with others to become completely new organizations. You can reach him at [email protected].

Image credit: iStockphoto/Nastco