AI Can Identify People Even in Anonymized Datasets

Advancements in AI might soon render phrases such as “hidden in the crowd” or “stay hidden in plain sight” a curious relic of the past, according to new research published last week on Nature Communications.

In a paper titled “Interaction data are identifiable even across long periods of time,” researchers used geometric deep learning and triplet loss optimization to successfully identify a majority of individuals from an anonymized mobile phone dataset of 40,000 people.

Why it matters

The research is notable because fine-grained records of people’s interactions, both offline and online, are collected at scale today.

Tech giants such as Facebook and Google, telecommunication operators, and other businesses are known to collect and either resell data wholesale or leverage it to power data-centric services.

The technique relies on how people tend to stick to established social circles and that such regular interactions form a stable pattern over time. By leveraging mobile phone interaction data and Bluetooth close-proximity data, the researchers successfully connected the dots between user interactions to identify people.

The ability to roll back the anonymity in anonymized data using AI has repercussions on how data is collected and used. For one, it means that businesses reselling customer data might be unwittingly breaking laws such as the European Union’s General Data Protection Regulation and the California Consumer Privacy Act.

Currently, both sets of regulations only permit collecting information about people’s daily interactions to share or sell without users’ consent if the data is anonymized. Organizations might assume they meet this standard through the use of pseudonyms, which the research demonstrates to be false.

“Our results provide evidence that disconnected and even re-pseudonymized interaction data remain identifiable even across long periods of time,” noted the report.

Breaking the code

The researchers built a neural network to recognize patterns in users’ weekly social interactions, relying on the bandicoot open-source Python library to compute a set of behavioral features from an individual’s list of interactions. On this front, the team successfully identified 52.4 percent of people.

Data from an unidentified mobile phone service detailing 43,606 subscribers’ interactions over 14 weeks was used. This ranges from the other party’s unique identifier, type of communication (calls or text messages), to sophisticated statistics such as the percentage of an individual’s contacts that account for 80% of their interactions.

Within the same study, the researchers also examined whether contact tracing apps that rely on Bluetooth to collect close-proximity data between users are vulnerable. Using real-world Bluetooth data collected over four weeks from 587 university students, the researchers say they succeeded in identifying an individual 26.4 percent of the time.

While multiple contact tracing designs exist, the authors concluded that mitigation strategies relying on changing pseudonyms of both the person and their contacts could fail to adequately protect people’s privacy.

“While [the attack technique] does not target a specific application, protocol, or type of protocol (centralized, decentralized, or hybrid), it could form an effective basis for an attack against any system where an attacker has access to a user’s social graph over two or more time periods,” wrote the researchers.

Era of lost privacy

The researchers pointed to how other researchers have previously demonstrated how algorithms can predict a person’s significant other, their wealth, demographics, the propensity to overspend, personality traits, and other attributes from interaction data.

More advanced works even rely on homophily, or network ties, to make predictions say the authors. Homophily is a concept in sociology that describes the tendency of individuals to associate and bond with others who are similar to them.

“Interaction data are deeply personal and sensitive. They record with high precision who we talk to or meet, at what time, and for how long. Sensitive information can furthermore often be inferred from interaction data,” wrote the authors.

Underscoring the sensitivity of their findings, the researchers cited consensus with their ethics reviewers to not publish the source code from their research, as is the norm. Instead, it will only be made available upon request to researchers in the field for scientific purposes.

The full report can be accessed here.

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].​

Image credit: iStockphoto/Julien Viry