Consider this code:
>>> x = "google" >>> x is "google" True >>> x = "google.com" >>> x is "google.com" False >>>
Why is it like that?
To make sure the above is correct, I have just tested on Python 2.5.4, 2.6.5, 2.7b2, Python 3.1 on windows and Python 2.7b1 on Linux.
It looks like there is consistency across all of them, so it’s by design. Am I missing something?
I just find it out that from some of my personal domain filtering script failing with that.
is verifies object identity, and any implementation of Python, when it meets literal of immutable types, is perfectly free to either make a new object of that immutable type, or seek through existing objects of that type to see if some of them could be reused (by adding a new reference to the same underlying object). This is a pragmatic choice of optimization and not subject to semantic constraints, so your code should never rely on which path a give implementation may take (or it could break with a bugfix/optimization release of Python!).
Consider for example:
>>> import dis >>> def f(): ... x = 'google.com' ... return x is 'google.com' ... >>> dis.dis(f) 2 0 LOAD_CONST 1 ('google.com') 3 STORE_FAST 0 (x) 3 6 LOAD_FAST 0 (x) 9 LOAD_CONST 1 ('google.com') 12 COMPARE_OP 8 (is) 15 RETURN_VALUE
so in this particular implementation, within a function, your observation does not apply and only one object is made for the literal (any literal), and, indeed:
>>> f() True
Pragmatically that’s because within a function making a pass through the local table of constants (to save some memory by not making multiple constant immutable objects where one suffices) is pretty cheap and fast, and may offer good performance returns since the function may be called repeatedly afterwards.
But, the very same implementation, at the interactive prompt (Edit: I originally thought this would also happen at a module’s top level, but a comment by @Thomas set me right, see later):
>>> x = 'google.com' >>> y = 'google.com' >>> id(x), id(y) (4213000, 4290864)
does NOT bother trying to save memory that way — the
ids are different, i.e., distinct objects. There are potentially higher costs and lower returns and so the heuristics of this implementation’s optimizer tell it to not bother searching and just go ahead.
Edit: at module top level, per @Thomas’ observation, given e.g.:
$ cat aaa.py x = 'google.com' y = 'google.com' print id(x), id(y)
again we see the table-of-constants-based memory-optimization in this implementation:
>>> import aaa 4291104 4291104
(end of Edit per @Thomas’ observation).
Lastly, again on the same implementation:
>>> x = 'google' >>> y = 'google' >>> id(x), id(y) (2484672, 2484672)
the heuristics are different here because the literal string “looks like it might be an identifier” — so it might be used in operation requiring interning… so the optimizer interns it anyway (and once interned, looking for it becomes very fast of course). And indeed, surprise surprise…:
>>> z = intern(x) >>> id(z) 2484672
x has been
interned the very first time (as you see, the return value of
intern is the same object as
y, as it has the same
id()). Of course, you shouldn’t rely on this either — the optimizer doesn’t have to intern anything automatically, it’s just an optimization heuristic; if you need
intern them explicitly, just to be safe. When you do intern strings explicitly…:
>>> x = intern('google.com') >>> y = intern('google.com') >>> id(x), id(y) (4213000, 4213000)
…then you do ensure exactly the same object (i.e., same
id()) results each and every time — so you can apply micro-optimizations such as checking with
is rather than
== (I’ve hardly ever found the miniscule performance gain to be worth the bother;-).
Edit: just to clarify, here are the kind of performance differences I’m talking about, on a slow Macbook Air…:
$ python -mtimeit -s"a='google';b='google'" 'a==b' 10000000 loops, best of 3: 0.132 usec per loop $ python -mtimeit -s"a='google';b='google'" 'a is b' 10000000 loops, best of 3: 0.107 usec per loop $ python -mtimeit -s"a='goo.gle';b='goo.gle'" 'a==b' 10000000 loops, best of 3: 0.132 usec per loop $ python -mtimeit -s"a='google';b='google'" 'a is b' 10000000 loops, best of 3: 0.106 usec per loop $ python -mtimeit -s"a=intern('goo.gle');b=intern('goo.gle')" 'a is b' 10000000 loops, best of 3: 0.0966 usec per loop $ python -mtimeit -s"a=intern('goo.gle');b=intern('goo.gle')" 'a == b' 10000000 loops, best of 3: 0.126 usec per loop
…a few tens of nanoseconds either way, at most. So, worth even thinking about only in the most extreme “optimize the [expletive deleted] out of this [expletive deleted] performance bottleneck” situations!-)
“is” is an identity test. Python has some caching behavior for small integers and (apparently) strings. “is” is best used for singleton testing (ex.
>>> x = "google" >>> x is "google" True >>> id(x) 32553984L >>> id("google") 32553984L >>> x = "google.com" >>> x is "google.com" False >>> id(x) 32649320L >>> id("google.com") 37787888L