thoughts scribbles images from silicon plateau
RSS icon Home icon
  • Simhashing in Erlang – beauty with binary comprehension

    Posted on August 23rd, 2009 jebu Comments

    Simhashing is popular technique to detect near duplicates in content. Given two files the similarity in their simhashes gives a mathematical way to compute the similarity of the documents. The algorithm works like this

    • split the content into a set of features
    • for each of the feature compute a hash of fixed width
    • for each bit in the hash let 1 be a positive increment and 0 a negative increment
    • sum up the bits in hashes using the above translation
    • for each bit position if the value is positive the sim hash has 1 in the corresponding bit position
    • for a negative value the resultant hash has 0 in that bit position

    Working this out in a simplified form, if the text is “Twitter is littered with spam”, I break this into features which are the individual words “twitter” “is” “littered” “with” “spam”. Suppose the hashes for these are 10101, 11001, 11000, 01100, 01000. This translates then to 1 -1 1 -1 1, 1 1 -1 -1 1, 1 1 -1 -1 -1, -1 1 1 -1 -1, -1 1 -1 -1 -1. Adding these up 1 3 -3 -5 1. The simhash for this is 11001.

    The hamming distance between the hashes of two documents can be used to figure out the similarity of the docs. This presentation has a detailed explanation of the simhashing technique.

    Now tweets are not an ideal place to experiment with the simhashes given we are limited to 140 chars. But its a pretty effective technique for deciding similarity, and can be expressed very easily. The power of Erlang binary comprehensions and the bit syntaxes allows us to calculate simhashes in a very nifty way. Without further blabber here is an Erlang implementation.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    
    -module(simhash).
     
    -define(HASH_RANGE, 1 bsl 128).
    -define(HASH_ACCU, 12).
    -define(HASH_WIDTH, 128).
    -export([hash_file/1]).
     
    hash_file(File) ->
      {ok, Binary} = file:read_file(File),
      Tokens = re:split(Binary, "\\W"),
      calculate_simhash(Tokens).
     
    calculate_hash(A) ->
      %% Hash = erlang:phash2(A, ?HASH_RANGE),
      << Hash:(?HASH_WIDTH) >> = erlang:md5(A),
      Hash.
     
    calculate_simhash(Tokens) ->
      FeatureHashes = [calculate_hash(A) || A <- Tokens, A =/= <<>>],
      {HashAcc, Len} = lists:foldl(fun accumulate_simhash/2, {0,0} , FeatureHashes),
      << <<(is_hash_valid(B, (Len / 2))):1>> || 
         <<B:(?HASH_ACCU)>> <= <<HashAcc:(?HASH_WIDTH * ?HASH_ACCU)>> >>.
     
    accumulate_simhash(Hash, {Accum, L}) ->
      <<A:(?HASH_WIDTH * ?HASH_ACCU)>> = 
              << <<B:(?HASH_ACCU)>> || <<B:1>> <= << Hash:(?HASH_WIDTH) >> >>,
      {Accum + A, L + 1}.
     
    is_hash_valid(E, Len) ->
      case(E > Len) of 
        true -> 1; 
        _ -> 0 
      end.

    The beauty of this is in accumulate_simhash function. Here I expand each bit of the feature hash into an arbitrary width representation and just add it to the accumulator. Depending on the number of features change the HASH_ACCU which is the number of bits for each bit of the resultant accumulator. At line 21 the resultant hash is calculated by reducing HASH_ACCU bits back to a 1 or 0 based on the number of hashes found.

    Is that not beautiful?

  • Erlang talking to apache via AJP mod_jk

    Posted on February 20th, 2009 jebu Comments

    Where have I been all this time since Onam? Its been 5 months since I wrote something, things have been busy at work and I have been exploring a new language, Erlang.

    I was first exposed to Erlang via blog posts popping up on HackerNews and Lambda the Ultimate. Lots of good things were said there and I have had a liking to recursive functions since my Pascal days. So did a deep dive, got the Erlang Bible by Joe Armstrong on my London trip and the book was a revelation. The syntax was weird at first but then I have begun to like it. The pattern matching function syntax is just the one for writing recursive functions. And yeah bit manipulation with pattern matching is like absolute heaven if you want to do some protocol implementation.

    Coupling Erlang to provide a web interface is essential, and I was hacking around with the scgi adapter, which was in disco and lighttpd. But then I needed a plugin to apache and scgi was not an option. Other options cgi, fcgi, ajp. Of these ajp was the best fit since mod_jk is pretty widely used for coupling servlet containers. I could not find an official spec for AJP and Apache’s documentation was the best available. The protocol is pretty straight forward and getting it working from an Erlang implementation was quick. I have a working version up and is available on github. This is the first time I’m putting something out in public domain so forgive any obvious mistakes and feel free to point out or jump in and fix things.

    There is lot of scope for improvement in there, this is just a working skeleton to serve requests from Erlang. I have tried to give it a bit order by having a behaviour defined for any module to plugin. Have your module confirm to the gen_ajp_handler behaviour and call back to gen_ajp_handler methods send_headers, send_data, request_data, end_request for obvious functionality. The dispatch mechanism is not yet filled out. Watch out here and on git hub for updates.