Cited more than 2,500 times in research literature over the last 17 years, the open-source CoNLL-2003 dataset is one of the most widely used datasets for building NLP (Natural Language Processing) systems.
And according to a new report, inherent bias in CoNLL-2003 might have influenced an entire generation of ML models trained using the dataset, including other datasets benchmarked against it.
First, the context. It was created in 2003 with Reuters newswire articles as source material and annotated by hand as part of a supervised machine learning effort. Unsurprisingly, bias wasn’t something that the researchers gave much thought about at that time.
According to a recent experiment by a data annotation firm Scale AI and shared with OneZero, the 20,000 news wire sentences it contained were biased in at least one area; it contained many more men’s names than women’s names.
“Using its own labeling pipeline – the process and tech used to teach humans to classify data that’ll then be used to train an algorithm – Scale AI found that, by the company’s own categorization, male names were mentioned almost five times more than female names in CoNLL-2003,” noted the report.
When Scale AI tested a model trained using CoNLL-2003 on a set of names, the model was 5 percent more likely to miss a woman’s name than a man’s name. Crucially, this meant that it would also have trouble recognizing the names of minorities and immigrants, given that these are groups not regularly covered in news two decades ago.
The impact is far-reaching. With a dataset that is widely considered to be one of the most popular around, it will be unsurprising for it to be used in any number of general-purpose systems. Indeed, the sheer popularity of CoNLL-2003 meant that it is also used as an evaluation tool to validate some of the most-used language systems, which means that the bias might have a ripple effect on other NLP models such as BERT.
Unfortunately, there is no easy way to determine its impact. Businesses tend to be tight-lipped about the training data they use, even if users were to ask. In many cases, the AI models are veritable black boxes that few will question as long as it works.
And the bigger issue isn’t even about CoNLL-2003, but the implications of bias within ML systems in general. As more ML systems trained with curated datasets incorporating various types of bias are deployed, the unanticipated effects will only increase – be it in financial services, human resource departments, or elsewhere.
The ML black box
Take Joy Buolamwini in her first semester as a graduate student at the MIT Media Lab. As reported on New Scientist, she noticed how commercial face-recognition software failed to “see” her face while detecting her light-skinned classmates without issue. And when she dug further, it turned out that racial and sexist bias in face-recognition software and other artificial intelligence systems are hardly rare.
In a similar vein, Amazon in 2014 developed an experimental in-house tool for screening job applications for technology roles. Though it was designed to be neutral, the tool was fed data about their current software engineers. Overwhelmingly male, the gender imbalance was picked up and inherited by the tool, which went on to discriminate against women applying for technical roles.
To be clear, the tool was never used to evaluate actual job candidates and was eventually scrapped. Though the team worked hard to fix the bias that surfaced, including manually tweaking it, the firm concluded that there was no guarantee that it would not devise other ways of sorting candidates that could prove discriminatory.
We are still at the nascent stage of the AI journey, of course. If anything, the various examples underscore the dangers of relying too heavily on AI systems without first developing the relevant controls, a deeper understanding, and ethical guidelines.
Photo credit: Screenshot/mahod84