Of course this makes sense: Python implements sets as dictionaries and dictionaries as hash tables and you cannot use something mutable as a key in a hash table. (If you don't see why that's so, think harder.) But sensible or not, what I certainly don't want to do is search my list of dictionaries for duplicates before every single insertion! So I came up with a little hack to make dictionaries hashable:>>> set([{}]) Traceback (most recent call last): File "", line 1, in TypeError: unhashable type: 'dict'
Alright, so the dictionaries are not really hashable, instead I produce a hashable digest of a dictionary's contents. You can now give me a lecture on how this is not very efficient, and some part of me would agree. However, given a long enough list of dictionaries, the time I spend on computing these digests will be less than looking through the whole list for a duplicate.def hash_string_dict(obj): """ Return a unique-ish string for a dictionary mapping string-ish things to string-ish things. """ import hashlib k = "".join(sorted(str(obj.keys()))) v = "".join(sorted(str(obj.values()))) digest = hashlib.sha1(k+v).hexdigest() return digest
So instead of a list of dictionaries or a set of dictionaries, I end up with a dictionary of dictionaries: The key in the outer dictionary is the digest of the inner dictionary. If a duplicate comes along it'll produce an identical digest and I can forget about it after one (expected!) constant time lookup.
Of course all of this depends crucially on my dictionaries being immutable as far as my application is concerned. Python itself doesn't have the luxury of wondering about this, it has to "worst-case" it and assume all dictionaries are mutable, period. I wonder: Should I package this as a container class? :-D
Update: Note that it's sort of important that the keys and values you have in that dictionary produce "useful" string representations. In other words, don't apply this trick without thinking through the kind of data you're pushing around and whether the digest has a reasonable chance of being accurate enough for duplicate detection.
No comments:
Post a Comment