thoughts scribbles images from silicon plateau
RSS icon Home icon
  • Simhashing in Erlang – beauty with binary comprehension

    Posted on August 23rd, 2009 jebu Comments

    Simhashing is popular technique to detect near duplicates in content. Given two files the similarity in their simhashes gives a mathematical way to compute the similarity of the documents. The algorithm works like this

    • split the content into a set of features
    • for each of the feature compute a hash of fixed width
    • for each bit in the hash let 1 be a positive increment and 0 a negative increment
    • sum up the bits in hashes using the above translation
    • for each bit position if the value is positive the sim hash has 1 in the corresponding bit position
    • for a negative value the resultant hash has 0 in that bit position

    Working this out in a simplified form, if the text is “Twitter is littered with spam”, I break this into features which are the individual words “twitter” “is” “littered” “with” “spam”. Suppose the hashes for these are 10101, 11001, 11000, 01100, 01000. This translates then to 1 -1 1 -1 1, 1 1 -1 -1 1, 1 1 -1 -1 -1, -1 1 1 -1 -1, -1 1 -1 -1 -1. Adding these up 1 3 -3 -5 1. The simhash for this is 11001.

    The hamming distance between the hashes of two documents can be used to figure out the similarity of the docs. This presentation has a detailed explanation of the simhashing technique.

    Now tweets are not an ideal place to experiment with the simhashes given we are limited to 140 chars. But its a pretty effective technique for deciding similarity, and can be expressed very easily. The power of Erlang binary comprehensions and the bit syntaxes allows us to calculate simhashes in a very nifty way. Without further blabber here is an Erlang implementation.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    
    -module(simhash).
     
    -define(HASH_RANGE, 1 bsl 128).
    -define(HASH_ACCU, 12).
    -define(HASH_WIDTH, 128).
    -export([hash_file/1]).
     
    hash_file(File) ->
      {ok, Binary} = file:read_file(File),
      Tokens = re:split(Binary, "\\W"),
      calculate_simhash(Tokens).
     
    calculate_hash(A) ->
      %% Hash = erlang:phash2(A, ?HASH_RANGE),
      << Hash:(?HASH_WIDTH) >> = erlang:md5(A),
      Hash.
     
    calculate_simhash(Tokens) ->
      FeatureHashes = [calculate_hash(A) || A <- Tokens, A =/= <<>>],
      {HashAcc, Len} = lists:foldl(fun accumulate_simhash/2, {0,0} , FeatureHashes),
      << <<(is_hash_valid(B, (Len / 2))):1>> || 
         <<B:(?HASH_ACCU)>> <= <<HashAcc:(?HASH_WIDTH * ?HASH_ACCU)>> >>.
     
    accumulate_simhash(Hash, {Accum, L}) ->
      <<A:(?HASH_WIDTH * ?HASH_ACCU)>> = 
              << <<B:(?HASH_ACCU)>> || <<B:1>> <= << Hash:(?HASH_WIDTH) >> >>,
      {Accum + A, L + 1}.
     
    is_hash_valid(E, Len) ->
      case(E > Len) of 
        true -> 1; 
        _ -> 0 
      end.

    The beauty of this is in accumulate_simhash function. Here I expand each bit of the feature hash into an arbitrary width representation and just add it to the accumulator. Depending on the number of features change the HASH_ACCU which is the number of bits for each bit of the resultant accumulator. At line 21 the resultant hash is calculated by reducing HASH_ACCU bits back to a 1 or 0 based on the number of hashes found.

    Is that not beautiful?

    • Pretty cool... would be interesting to know the actually clubbing process in http://nowwhat.in as well.. I assume u maintain a corpus of X fingerprints.. and every new topic from the the streaming API is simhashed and compared (using hamming distance) with the existing X fingerprints.. if near-match, increase size of topic font..
      great idea...
    • Hearing about Simhashing for the first time. Interesting idea.
    blog comments powered by Disqus