It’s a common misconception that online comments and interactions that appear seemingly useless or harmless can actually contain a wealth of personal data that can be captured and analyzed by advanced language models. Take, for example, the random comment “I remember watching Twin Peaks after school”. Although this comment may not seem to reveal much about the writer to an average reader at first glance, a large language model can extract a wealth of information about the writer.
First, language models can infer that the writer is likely between 45 and 50 years old based on the fact that Twin Peaks aired on television between 1990 and 1991 and was likely not watched by younger students. By combining several such posts by the same person, the models may also be able to determine other personal data such as hometown, gender or income. With enough data, it might even be possible to determine the exact identity of the author.
Historically, the method of extracting information from seemingly innocuous online interactions is time-consuming and costly. A 2002 study found that about half of the US population could be identified by a small amount of information such as location, gender and age. However, the cost and time required to collect this information is considerable.
However, a recent study at ETH Zurich found that large-scale language models such as GPT-4 are exceptionally well suited for this type of information extraction. The study found that these models can determine the three most important attributes for identifying a person with an accuracy of over 95%. What’s more, they can do this at a fraction of the cost and 240 times faster than humans.
It is important to recognize that the personal data collected through this method can be misused in a number of ways. For example, advertisers could create detailed profiles of users or fraudsters could use this technique to identify anonymous users. Theoretically, chatbots could even be specially trained to elicit seemingly harmless information from users, from which personal data can then be extracted.
Despite the considerable potential for data misuse, there are currently no effective countermeasures. The authors of the study therefore hope that their work will initiate a broader discussion about the impact of large-scale language models on data protection.