thoughts scribbles images from silicon plateau
RSS icon Home icon
  • Simhashing in Erlang – beauty with binary comprehension

    Posted on August 23rd, 2009 jebu View Comments

    Simhashing is popular technique to detect near duplicates in content. Given two files the similarity in their simhashes gives a mathematical way to compute the similarity of the documents. The algorithm works like this

    • split the content into a set of features
    • for each of the feature compute a hash of fixed width
    • for each bit in the hash let 1 be a positive increment and 0 a negative increment
    • sum up the bits in hashes using the above translation
    • for each bit position if the value is positive the sim hash has 1 in the corresponding bit position
    • for a negative value the resultant hash has 0 in that bit position

    Working this out in a simplified form, if the text is “Twitter is littered with spam”, I break this into features which are the individual words “twitter” “is” “littered” “with” “spam”. Suppose the hashes for these are 10101, 11001, 11000, 01100, 01000. This translates then to 1 -1 1 -1 1, 1 1 -1 -1 1, 1 1 -1 -1 -1, -1 1 1 -1 -1, -1 1 -1 -1 -1. Adding these up 1 3 -3 -5 1. The simhash for this is 11001.

    The hamming distance between the hashes of two documents can be used to figure out the similarity of the docs. This presentation has a detailed explanation of the simhashing technique.

    Now tweets are not an ideal place to experiment with the simhashes given we are limited to 140 chars. But its a pretty effective technique for deciding similarity, and can be expressed very easily. The power of Erlang binary comprehensions and the bit syntaxes allows us to calculate simhashes in a very nifty way. Without further blabber here is an Erlang implementation.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    
    -module(simhash).
     
    -define(HASH_RANGE, 1 bsl 128).
    -define(HASH_ACCU, 12).
    -define(HASH_WIDTH, 128).
    -export([hash_file/1]).
     
    hash_file(File) ->
      {ok, Binary} = file:read_file(File),
      Tokens = re:split(Binary, "\\W"),
      calculate_simhash(Tokens).
     
    calculate_hash(A) ->
      %% Hash = erlang:phash2(A, ?HASH_RANGE),
      << Hash:(?HASH_WIDTH) >> = erlang:md5(A),
      Hash.
     
    calculate_simhash(Tokens) ->
      FeatureHashes = [calculate_hash(A) || A <- Tokens, A =/= <<>>],
      {HashAcc, Len} = lists:foldl(fun accumulate_simhash/2, {0,0} , FeatureHashes),
      << <<(is_hash_valid(B, (Len / 2))):1>> || 
         <<B:(?HASH_ACCU)>> <= <<HashAcc:(?HASH_WIDTH * ?HASH_ACCU)>> >>.
     
    accumulate_simhash(Hash, {Accum, L}) ->
      <<A:(?HASH_WIDTH * ?HASH_ACCU)>> = 
              << <<B:(?HASH_ACCU)>> || <<B:1>> <= << Hash:(?HASH_WIDTH) >> >>,
      {Accum + A, L + 1}.
     
    is_hash_valid(E, Len) ->
      case(E > Len) of 
        true -> 1; 
        _ -> 0 
      end.

    The beauty of this is in accumulate_simhash function. Here I expand each bit of the feature hash into an arbitrary width representation and just add it to the accumulator. Depending on the number of features change the HASH_ACCU which is the number of bits for each bit of the resultant accumulator. At line 21 the resultant hash is calculated by reducing HASH_ACCU bits back to a 1 or 0 based on the number of hashes found.

    Is that not beautiful?

    • seven jeans
      we like seven jeans,
      seven 7 jeans is our favourite.
    • Burgessdhdfj
      Hearing about Simhashing for the first time. Interesting idea.
    • Pretty cool... would be interesting to know the actually clubbing process in http://nowwhat.in as well.. I assume u maintain a corpus of X fingerprints.. and every new topic from the the streaming API is simhashed and compared (using hamming distance) with the existing X fingerprints.. if near-match, increase size of topic font..
      great idea...
    • Hearing about Simhashing for the first time. Interesting idea.
    blog comments powered by Disqus