Monday, October 10, 2011

Sets of Dictionaries in Python

I may come to regret this post in the future, we'll see. So I was hacking on some sysadmin tool that collects data. Each item is a dictionary of various things, and I had all those dictionaries in a list. Wait! What if I parse another piece of data that results in an identical dictionary? I don't want to keep growing the list to infinity with duplicates, do I? So without much thought I replaced the list with a set, but that doesn't work:
>>> set([{}])
Traceback (most recent call last):
  File "", line 1, in 
TypeError: unhashable type: 'dict'
Of course this makes sense: Python implements sets as dictionaries and dictionaries as hash tables and you cannot use something mutable as a key in a hash table. (If you don't see why that's so, think harder.) But sensible or not, what I certainly don't want to do is search my list of dictionaries for duplicates before every single insertion! So I came up with a little hack to make dictionaries hashable:
def hash_string_dict(obj):
    Return a unique-ish string for a dictionary mapping
    string-ish things to string-ish things.
    import hashlib
    k = "".join(sorted(str(obj.keys())))
    v = "".join(sorted(str(obj.values())))
    digest = hashlib.sha1(k+v).hexdigest()
    return digest
Alright, so the dictionaries are not really hashable, instead I produce a hashable digest of a dictionary's contents. You can now give me a lecture on how this is not very efficient, and some part of me would agree. However, given a long enough list of dictionaries, the time I spend on computing these digests will be less than looking through the whole list for a duplicate.

So instead of a list of dictionaries or a set of dictionaries, I end up with a dictionary of dictionaries: The key in the outer dictionary is the digest of the inner dictionary. If a duplicate comes along it'll produce an identical digest and I can forget about it after one (expected!) constant time lookup.

Of course all of this depends crucially on my dictionaries being immutable as far as my application is concerned. Python itself doesn't have the luxury of wondering about this, it has to "worst-case" it and assume all dictionaries are mutable, period. I wonder: Should I package this as a container class? :-D

Update: Note that it's sort of important that the keys and values you have in that dictionary produce "useful" string representations. In other words, don't apply this trick without thinking through the kind of data you're pushing around and whether the digest has a reasonable chance of being accurate enough for duplicate detection.

No comments:

Post a Comment