In 2013, Yaniv Erlich’s genetics lab at MIT (now at Columbia) called the entire possibility of genetic anonymity into question when they discovered the identities of DNA donors by cross-referencing their genetic data with publicly available information from genealogy databases. Their article “Identifying Personal Genomes by Surname Inference”(1) published in Science created a stir across privacy and medical research communities.
Heather Dewey-Hagborg: In your own words, can you give us a brief explanation of the study? What did you do and what did it mean to you?
Yaniv Erlich: We showed that it is possible in some cases to infer the surnames of males from their allegedly de-identified DNA samples. In most societies, a male receives his surname from his father, who received his surname from his own father and so own. Now, since males receive their Y chromosome from their father and the father of their father, this process creates a correlation between surnames and y chromosomes.
Our technique exploits this correlation to identify the surname of individuals and uses open genetic genealogy databases to infer the right surname. Surnames are strong identifiers. Correctly inferring them dramatically narrows the search space. We specifically showed that if the age and state of the targeted individual are known (HIPAA does not protect these two identifiers), then a surname inference can virtually resolve the identity of the person.
To show that this technique works, we were able to identify with extremely high probabilities close to 50 people that were part of a large scale study, called the 1000 Genomes.