Machine learning algorithms can identify anonymous programmers
MACHINE LEARNING can figure out different quirks in programmers' styles, helping identify the "fingerprints" of anonymous code wranglers.
Source: Roland Moore-Colyer
Rachel Greenstadt, associate professor of computer science at Drexel University, and Aylin Caliskan, an assistant professor at George Washington University, have found that code can be a form of stylistic expression, a bit like writing, reported Wired.
As such, the researchers developed a machine learning algorithm to recognise the coding structure used by individual programmers based on samples of their work and spot their traits in compiled binaries or raw source code.
The boffins will present their research a the DefCon hacking conference and noted such tech could be used to help with investigating the authors of malware.
"Many hackers like to contribute code, binaries, and exploits under pseudonyms, but how anonymous are these contributions really? In this talk, we will discuss our work on programmer de-anonymization from the standpoint of machine learning," the researchers said.
"We will show how abstract syntax trees contain stylistic fingerprints and how these can be used to potentially identify programmers from code and binaries. We perform programmer de-anonymisation using both obfuscated binaries, and real-world code found in single-author GitHub repositories and the leaked Nulled.IO hacker forum."
The machine learning smarts can make an accurate identification 83 per cent of the time, based on a sample size of 600 programmers. So while it's not the most accurate machine learning algorithm ever, it could still help with cyber forensics or simply get an idea of who might be contributing to open source code repositories.
The tech could also be sued to sniff out plagiarism in code, and Wired points out that it could more worryingly be used by an oppressive government to identify people creating code and tools that can get around state censorship.
Use of such a smart algorithm could end up putting off privacy-conscious programmers from contributing to open source code, so it looks like some balance between privacy and security will need to be found if the researchers' work gets put into action.