Monday, July 21, 2008

Merits of a System.Collections.Hashtable

I was recently working on a very simple data analysis process involving two lists of items. One is a list of files on disk, another a list of files in a database table. The work sets were on the order of 85,000 items each. I needed to simply scan each list to see if each item existed in the other. I loaded the two lists into a generic string collection (List) and iterated through each one once, looking in the other for the existence of the item (otherList.Contains(item)). I ran the process and it took a good 2 to 3 minutes to complete. Reasonable I though, as it was many 10s of thousands.

I then thought about the way the code was probably running internally. I imagined that the List.Contains method is probably just doing it's own internal loop and comparing the item against each one in the list until it finds a match. Otherwise, it runs through to the end. I remembered that the Hashtable organizes the list of items by the hash of the key so it might have a better and more efficient way of finding items based on that logic. I refactored the code to use Hashtable instead of the List. In addition to finding that I had some duplicates in the list coming from the database (fixed after adding a "DISTINCT" to the query) I found that the the process now took about 5 seconds! That's an insane improvement. I'm curious how the Hashtable does it's thing.

No comments: