Shevek (shevek) wrote,

sort | uniq -c algorithmic challenge

sort | uniq -c | sort -rn is such a popular combination that it even has a wikipedia entry. It could be done faster. This wouldn't matter if I wasn't working with a very large data set.

I have a HUGE unsorted stream or file of text strings. I want to know how many of each string there are. How do I best do this?

Sorting is slow - it's n.log(n). However, the alternative linear time algorithm, to make a very large hash table and just count, requires random access, which in practice, tends to be slower for this size of data set. I experimented with gdbm, but on building a string->integer map, it managed about 5% of my small sample data set before falling over.

A middle-ground alternative would be to combine sort/uniq to build a number of smaller sorted hash tables, spill them to disk, and then merge them in a later step. This reduces the data set size, and while it remains n.log(n), the constant is MUCH smaller. But I can't find an implementation of this.

I discovered sortu which is similar, except because it uses heap only, it's not nearly as scalable as sort | uniq -c and therefore offers no advantage whatsoever. I almost hope that since that program is aimed at a small number of keys, it picks something better than straight comparison sorting (e.g. radix?).

Is there any better tool? Does anyone know of a better approach? Is there a really fast hash table which I could use? Or is there a combination of sort and uniq which spills to disk and merges?

  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.