Google Tech Talk (more info below) April 22, 2011 Presented by Nathan Rosenblum, UW-Madison ABSTRACT Where did this binary come from? How was it compiled? What language did the programmer choose? Who wrote this code? These questions rarely occur to most computer users, but for analysts working in forensics, reverse engineering, and software theft, they are of paramount importance. The provenance of a program binary — the specific process through which an idea is transformed into executable code — can provide valuable insight, yet it is in the very domains where such information would be most useful that it is least likely to be available. At the University of Wisconsin, we have investigated techniques to recover these provenance details from program binaries, filling in the gaps in the production process. Provenance recovery occupies the intersection of program analysis, security, and statistical machine learning research; in this talk, I will describe probabilistic models of provenance in the context of compiler toolchain identification and both closed- and open-world solutions to the difficult task of program authorship attribution: picking out stylistic characteristics of executable code that reveal the identity of the programmer. Our work integrates a range of machine learning techniques, from support vector machines to conditional random fields to metric learning and large-margin clustering. I will discuss how we leverage large-scale computing resources to solve …
Go here to read the rest:
Where Did This Code Come From? Discovering the Provenance of Program Binaries