Combining sampled partial data flows to recognize function similarity


Betreuer: Jannik Pewny

Beginn: ab sofort

Dauer: 6 Monate

Weitere Details:


Recent research has shown that executing a piece of code with randomly generated input and recording it's output (point-wise evaluation) can be used to semantically fingerprint functions and compare them. The rationality behind this is that this sampling generates input/output-pairs, and that similar functions will have more of these pairs in common than dissimilar ones.

However, this approach suffers from different function granularity: If a function uses a piece of another function and extends it a little, the approach will report no similar I/O-pairs and therefore no similarity.

In this thesis, not only the start values (input) and end values (output) of a function shall be used for comparison, but intermediate values as well. The main task is to evaluate to what extent it is possible to explain the complete behavior of a function by puzzling together different building blocks of other functions.

Tasks that need to be solved include:

  • Testharness to feed functions with data
  • Log input/intermediate/output values
  • Separate data flows
  • Find partial data flows
  • Evaluate the implementation


Recommended knowledge:

  • Good programming skills
  • Experience with instrumenting code
  • Knowledge about data flow