thoughts scribbles images from silicon plateau
RSS icon Home icon
  • Simhashing in Erlang – beauty with binary comprehension

    Posted on August 23rd, 2009 jebu 15 comments

    Simhashing is popular technique to detect near duplicates in content. Given two files the similarity in their simhashes gives a mathematical way to compute the similarity of the documents. The algorithm works like this

    • split the content into a set of features
    • for each of the feature compute a hash of fixed width
    • for each bit in the hash let 1 be a positive increment and 0 a negative increment
    • sum up the bits in hashes using the above translation
    • for each bit position if the value is positive the sim hash has 1 in the corresponding bit position
    • for a negative value the resultant hash has 0 in that bit position

    Working this out in a simplified form, if the text is “Twitter is littered with spam”, I break this into features which are the individual words “twitter” “is” “littered” “with” “spam”. Suppose the hashes for these are 10101, 11001, 11000, 01100, 01000. This translates then to 1 -1 1 -1 1, 1 1 -1 -1 1, 1 1 -1 -1 -1, -1 1 1 -1 -1, -1 1 -1 -1 -1. Adding these up 1 3 -3 -5 1. The simhash for this is 11001.

    The hamming distance between the hashes of two documents can be used to figure out the similarity of the docs. This presentation has a detailed explanation of the simhashing technique.

    Now tweets are not an ideal place to experiment with the simhashes given we are limited to 140 chars. But its a pretty effective technique for deciding similarity, and can be expressed very easily. The power of Erlang binary comprehensions and the bit syntaxes allows us to calculate simhashes in a very nifty way. Without further blabber here is an Erlang implementation.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    
    -module(simhash).
     
    -define(HASH_RANGE, 1 bsl 128).
    -define(HASH_ACCU, 12).
    -define(HASH_WIDTH, 128).
    -export([hash_file/1]).
     
    hash_file(File) ->
      {ok, Binary} = file:read_file(File),
      Tokens = re:split(Binary, "\\W"),
      calculate_simhash(Tokens).
     
    calculate_hash(A) ->
      %% Hash = erlang:phash2(A, ?HASH_RANGE),
      << Hash:(?HASH_WIDTH) >> = erlang:md5(A),
      Hash.
     
    calculate_simhash(Tokens) ->
      FeatureHashes = [calculate_hash(A) || A <- Tokens, A =/= <<>>],
      {HashAcc, Len} = lists:foldl(fun accumulate_simhash/2, {0,0} , FeatureHashes),
      << <<(is_hash_valid(B, (Len / 2))):1>> || 
         <<B:(?HASH_ACCU)>> <= <<HashAcc:(?HASH_WIDTH * ?HASH_ACCU)>> >>.
     
    accumulate_simhash(Hash, {Accum, L}) ->
      <<A:(?HASH_WIDTH * ?HASH_ACCU)>> = 
              << <<B:(?HASH_ACCU)>> || <<B:1>> <= << Hash:(?HASH_WIDTH) >> >>,
      {Accum + A, L + 1}.
     
    is_hash_valid(E, Len) ->
      case (E > Len) of 
        true -> 1; 
        _ -> 0 
      end.

    The beauty of this is in accumulate_simhash function. Here I expand each bit of the feature hash into an arbitrary width representation and just add it to the accumulator. Depending on the number of features change the HASH_ACCU which is the number of bits for each bit of the resultant accumulator. At line 21 the resultant hash is calculated by reducing HASH_ACCU bits back to a 1 or 0 based on the number of hashes found.

    Is that not beautiful?

    • http://blog.poundbang.in Harish Mallipeddi

      Hearing about Simhashing for the first time. Interesting idea.

    • http://twitter.com/arun_suresh Arun Suresh (KuTtZ)

      Pretty cool… would be interesting to know the actually clubbing process in http://nowwhat.in as well.. I assume u maintain a corpus of X fingerprints.. and every new topic from the the streaming API is simhashed and compared (using hamming distance) with the existing X fingerprints.. if near-match, increase size of topic font..
      great idea…

    • http://twitter.com/arun_suresh Arun Suresh (KuTtZ)

      Pretty cool… would be interesting to know the actually clubbing process in http://nowwhat.in as well.. I assume u maintain a corpus of X fingerprints.. and every new topic from the the streaming API is simhashed and compared (using hamming distance) with the existing X fingerprints.. if near-match, increase size of topic font..
      great idea…

    • http://www.filecabinetkey.net/hon-file-cabinet-keys HON File Cabinet Keys

      That's the great article! I just pass 'n read it, two thumbs up! ;)

    • http://www.discount-air-jordan.com air jordan shoes

      Mark S. is definitely on the right track. If you want to get a professional looking email address, Id recommend buying your name domain name, like or
      ajf 6
      If its common it might be difficult to get, however, be creative and you can usually find something.

    • http://www.moncler-down-jackets.com Nike Air Force 1 Low

      Yeah,air force 1 shoes I have to admire the landlord's unique point of view,ugg boots for sale this article is very comprehensive and considerable on the analyse, and greatly inspired me. In addition, I would like to share that some other blog's article, content is also very good, if you scan it,there will be a suprise!ugg boots for sale

    • http://www.gucci-outlet-store.com gucci outlet

      The post of content is very interesting and exciting. I learned a lot from here.The content from simple to complex, so all of you can come in . No matter you want to see what can be found.By the way ,there are some websites is also very wonderful,you can go and see.such as air jordan 16.5

    • superflatiron1

      chi flat iron, chi hair strighteners

      chi hair straighteners, chi flat iron

      chi nano ceramic flat iron, chi nano ceramic flat iron

    • superflatiron

      nice post

    • Burgessdhdfj

      Hearing about Simhashing for the first time. Interesting idea.

    • seven jeans

      we like seven jeans,
      seven 7 jeans is our favourite.

    • http://www.myshoxsneakers.com nike shox R4

      Take my breath away.I’ve loved everything in your blog, thanks for always sharing such wonderful post. Have a good week!

    • http://www.estetikbiz.com Estetik merkezleri

      good idea, great post, thanks for code

    • Guest

      Shouldn’t it be?: Adding these up 1 3 -1 -5 -1. The simhash for this is 11000.

    • Abhay_sena

      I have employed it successfully for medium length texts i.e, 20+ words but for small texts i.e <10 words I didnt find it too effective.