Tobias Huppertz defended his Master's thesis on 'Ein Indexansatz für ähnlichkeitserhaltende Hashwerte mittels Bloomfilter'

Ein Indexansatz für ähnlichkeitserhaltende Hashwerte mittels Bloomfilter

A major challenge in digital forensic investigation is the automated presorting of data to be examined, in order to minimize the manual work. Black- and whitelists are used to presort suspicious and non-relevant data according to their hash value. For a reliable file identification, cryptographic hash functions are used. These have the property of changing every bit of the output with a probability of 50% when the input is altered. As a result, similarity of files is not recognized anymore. This has a negative effect on the investigation because files need to be categorized manually.

In order to be able to also presort similar files, approximate matching is used. Thus the similarity of two files can be calculated. When a hard drive containing v files needs to be compared with a database of u entries, for every file the hash value needs to be calculated and compared with every hash from the database. The reason for this is the structure of approximate matching hash values. They can not be saved to a list in a sorted order for efficient access. This results in v · u = 1011 comparisons, which takes too long for investigation. Breitinger et al. describe the proposal to utilize the probabilistic data structure of bloom filters to organize the hash values in a binary tree. This creates a logarithmic search instead of a linear one. The amount of necessary comparisons reduces to log2 u · v = 2 · 10^6.

Based on this concept, a prototype is developed and evaluated in a test scenario that demonstrates its benefits and problems in practice. For this scenario, 500 GB of images were downloaded from flickr. Out of those images 100GB were used synonymously with suspicious images such as malware or child pornography, while the remaining 400 GB were handled as neutral. During evaluation it turned out that it takes the prototype twelve hours to insert and search for the specified data, while the linear comparison of lists using mrsh v2 would approximately take close to four years. Speed was gained through imprecision in the identification. Still suspicious files were reliably identified and neutral files were categorized correctly by nearly 98,9%. Furthermore fragments of files and similar files can be searched for. Fragments of files are interesting to find, because files may have been saved fragmented to a hard disk. When quick formatted, the allocation of these fragments is lost. Using suitable parameters, the success rate of correctly categorizing those fragments is about 99% for both suspicious and neutral data. Thus during digital forensic investigation, the prototype allows for saving considerable amounts of time.