I have a HUGE unsorted stream or file of text strings. I want to know how many of each string there are. How do I best do this?
Sorting is slow - it's n.log(n). However, the alternative linear time algorithm, to make a very large hash table and just count, requires random access, which in practice, tends to be slower for this size of data set. I experimented with gdbm, but on building a string->integer map, it managed about 5% of my small sample data set before falling over.
A middle-ground alternative would be to combine sort/uniq to build a number of smaller sorted hash tables, spill them to disk, and then merge them in a later step. This reduces the data set size, and while it remains n.log(n), the constant is MUCH smaller. But I can't find an implementation of this.
I discovered sortu which is similar, except because it uses heap only, it's not nearly as scalable as sort | uniq -c and therefore offers no advantage whatsoever. I almost hope that since that program is aimed at a small number of keys, it picks something better than straight comparison sorting (e.g. radix?).
Is there any better tool? Does anyone know of a better approach? Is there a really fast hash table which I could use? Or is there a combination of sort and uniq which spills to disk and merges?