The Data Conundrum of AI
- By Paul Mah
- March 08, 2023
Trained on vast amounts of data to learn patterns and make complex decisions, large language models such as ChatGPT promise to deliver stunning new capabilities and enable new possibilities for innovation.
But while the potential benefits of AI are indisputable, important ethical, legal, and social questions remain unaddressed. As AI evolves and its use become ubiquitous, what are some of the disruptive impacts and risks that we should be aware of?
Good data, bad data
For one, AI experts are already warning of the risk of data-poisoning attacks on the datasets commonly used to train deep-learning models used by AI services, which are often pulled from Internet websites or crowd-sourced knowledge repositories like Wikipedia.
The concern revolves around attackers tampering with publicly available data in such a way as to affect the decisions AI models make after being trained on it. And the repercussions can be dire: Imagine a driverless car fooled into driving through stop signs without slowing, an AI-enabled search leading to a malware-infested site, or an AI-powered security system manipulated into letting the wrong person in.
To be clear, there is no evidence of such attacks in the real world yet. However, experts warn that even very small amounts of “adversarial noise” in training sets can introduce targeted mistakes in the behavior of the AI model.
More worryingly, the opaqueness of current AI models means that such poisoning is next to impossible to detect. What’s more, a ZDNet article raised the possibility of contaminating a Wikipedia page just before a web scrape is done, ensuring that the malicious content lives perpetually within the data repository of the organization training the AI model.
The start of the end of sharing?
When it comes to ChatGPT, one elephant in the room we often ignore was how OpenAI never sought permission for using the data it scrapped for the training of its AI model. I am hardly famous, but a query about myself from ChatGPT brought up the following response.
As with all responses generated by ChatGPT, it consists partly of truth (the personal profile I wrote), half-truths (wrong assumptions based on some of the work I did), and outright fabrications (AI “hallucination”). Given that ChatGPT was trained on data from 2021 and earlier with no live Internet access, this means that at least some of the things I wrote were used to train it.
Ironically, I recently came across AI-powered writing services that promised to keep the data of its users segregated and private. This was touted as a feature to reassure business users – and allow the service to charge a premium. Yet the service won’t exist without access to hundreds of gigabytes of publicly-accessible data, which would presumably include the content on your web pages, company annual reports, and other freely available content.
Anyway, the fact remains that everyone wants a private copy of ChatGPT without having to share their data. As AI use becomes more pervasive, are we headed down a slippery slope where less information will be shared, and new barriers built to guard against unauthorized use and web scraping?
The road ahead
I felt it worth noting how bad the technology behind ChatGPT is at helping data scientists with data manipulation, despite being built using data. It apparently works fine on a very small scale – such as a single record but starts making things up or modifying correct data very quickly.
Moreover, it is too slow and merely trades “one form of manual labor for another”, says independent data journalist Brandon Roberts who documented his one-week experience using ChatGPT to extract information from PDF files here.
What will the future bring? One report on Time painted an ominous picture of growth at the expense of safety. “As companies hurry to improve the tech and profit from the boom, research about keeping these tools safe is taking a back seat. In a winner-takes-all battle for power, Big Tech and their venture-capitalist backers risk repeating past mistakes, including social media’s cardinal sin: prioritizing growth over safety.”
For others, AI is just another tool that will slowly influence how we work, but one that will take years to make itself felt.
In an article by David Karpf, an associate professor at the George Washington University explained: “Institutions, over time, adapt to new technologies. New technologies are incorporated into large, complex social systems. Every revolutionary new technology changes and is changed by the existing social system; it is not an immutable force of nature. The shape of these revenue models will not be clear for years, and we collectively have the agency to influence how it develops.”
Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].
Image credit: iStockphoto/diegograndi
Paul Mah
Paul Mah is the editor of DSAITrends, where he report on the latest developments in data science and AI. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose.