node.example.com Is An IP Address
Hello! Welcome to the once-yearly blog post! This year I'd like to examine the most peculiar bug I encountered at work. To set the stage, let's start with a little background.
When we write URLs with a non-standard port we
specify the port after a :
. With hostnames and IPv4 addresses
this is straightforward. Here's some Python code to show how easy it is.
>>> url = urllib.parse.urlparse("https://node.example.com:8000")
>>> (url.hostname, url.port)
('node.example.com', 8000)
>>>
>>> url = urllib.parse.urlparse("https://192.168.0.1:8000")
>>> (url.hostname, url.port)
('192.168.0.1', 8000)
Unfortunately, when IPv6 addresses are involved some ambiguity is introduced.
>>> url = urllib.parse.urlparse(
... "https://fdc8:bf8b:e62c:abcd:1111:2222:3333:4444:8000"
... )
...
>>> url.hostname
'fdc8'
>>> try:
... url.port
... except ValueError as error:
... print(error)
...
Port could not be cast to integer value as 'bf8b:e62c:abcd:1111:2222:3333:4444:8000'
Since IPv6 addresses use a "colon-hex" format with hexadecimal fields
separated by :
we can't tell a port apart from a normal field. Notice in the
example above that the hostname is truncated after the first :
, not the one
just before 8000
.
Fortunately, the spec for URLs recognizes this ambiguity and gives us a way to handle it. RFC 2732 (Format for Literal IPv6 Addresses in URL's) says
To use a literal IPv6 address in a URL, the literal address should be enclosed in "[" and "]" characters.
Update our example above to include [
and ]
and voilà! It just works.
>>> url = urllib.parse.urlparse(
... "https://[fdc8:bf8b:e62c:abcd:1111:2222:3333:4444]:8000"
... )
...
>>> (url.hostname, url.port)
('fdc8:bf8b:e62c:abcd:1111:2222:3333:4444', 8000)
Armed with that knowledge we can dive into the problem. 🤿
Works On My Machine
A few months ago a co-worker of mine wrote a seemingly innocuous function.
from ipaddress import ip_address
def safe_host(host):
"""Surround `host` with brackets if it is an IPv6 address."""
try:
if ip_address(host).version == 6:
return "[{}]".format(host)
except ValueError:
pass
return host
Elsewhere in the code it was invoked something like this, so that hostnames, IPv4 addresses, and IPv6 addresses could all be safely interpolated.
url = "https://{host}:8000/some/path/".format(host=safe_host(host))
Since my co-worker is awesome they wrote tests to validate their code. ✅
def test_safe_host_with_hostname():
"""Hostnames should be unchanged."""
assert safe_host("node.example.com") == "node.example.com"
def test_safe_host_with_ipv4_address():
"""IPv4 addresses should be unchanged."""
assert safe_host("192.168.0.1") == "192.168.0.1"
def test_safe_host_with_ipv6_address():
"""IPv6 addresses should be surrounded by brackets."""
assert (
safe_host("fdc8:bf8b:e62c:abcd:1111:2222:3333:4444")
== "[fdc8:bf8b:e62c:abcd:1111:2222:3333:4444]"
)
Thank goodness they did. The Python 2 tests failed (don't look at me like that 😒).
✖ FAIL py27 in 1.83 seconds
✔ OK py36 in 2.82 seconds
✔ OK py37 in 2.621 seconds
✔ OK py38 in 2.524 seconds
✔ OK py39 in 2.461 seconds
Both the hostname and IPv6 address tests failed. But why did they fail? And why did the Python 3 tests pass? 🤔
We'll start with the hostname failure and try to isolate the bug.
E AssertionError: assert '[node.example.com]' == 'node.example.com'
E - [node.example.com]
E ? - -
E + node.example.com
The failure says node.example.com
was surrounded by brackets, but that's
only supposed to happen for IPv6 addresses! Let's crack open a Python 2
interpreter for a quick sanity check.
>>> ipaddress.ip_address("node.example.com").version
6
What On Htrae?
If, like Jeff Bridges, you were confused by that result, relax. We're
probably not in a Bizarro World where node.example.com
is a valid IPv6
address. There must be an explanation for this behavior.
Things start to become a little more clear when we see the result of the
ip_address()
function for ourselves.
>>> ipaddress.ip_address("node.example.com")
IPv6Address(u'6e6f:6465:2e65:7861:6d70:6c65:2e63:6f6d')
At first glance that looks like madness. Python 3 behaves in an entirely different manner.
>>> try:
... ipaddress.ip_address("node.example.com")
... except ValueError as error:
... print(error)
...
'node.example.com' does not appear to be an IPv4 or IPv6 address
Python 3 knows that's not an IPv6 address, so why doesn't Python 2? The answer is in how differently the two Python versions handle text.
Text Is Hard
Computers don't operate on text as humans think of it. They operate on numbers. That's part of why we have IP addresses to begin with. In order to represent human-readable text with computers we had to assign meaning to the numbers. Thus, ASCII was born.
ASCII is a character encoding, which means it specifies how to interpret
bytes as text we understand (provided you speak English). So, when your
computer sees 01101110
in binary (110
in decimal) you see n
because
that's what ASCII says it is.
You can see the number to text conversion in action right in the Python interpreter.
>>> ord("n")
110
>>> chr(110)
'n'
In fact, it doesn't matter what numbering system you use. If you specify binary, octal, decimal, hexadecimal, whatever... If it can be understood as the right integer it will be displayed correctly.
>>> chr(0b01101110)
'n'
>>> chr(0o156)
'n'
>>> chr(110)
'n'
>>> chr(0x6e)
'n'
Neat, but what does that information do for us?
It's Numbers All The Way Down
Just for giggles, humor me and let's look at the character-number translations
for node.example.com
. We'll leave out binary and octal, because they make
this table uglier than it already is.
Character | n | o | d | e | . | e | x | a | m | p | l | e | . | c | o | m |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Decimal | 110 | 111 | 100 | 101 | 46 | 101 | 120 | 97 | 109 | 112 | 108 | 101 | 46 | 99 | 111 | 109 |
Hexadecimal | 6e | 6f | 64 | 65 | 2e | 65 | 78 | 61 | 6d | 70 | 6c | 65 | 2e | 63 | 6f | 6d |
Hey, hold on a second... If you tilt your head sideways and squint that last row looks kinda like an IPv6 address, doesn't it?
We should verify, just to be absolutely certain. You've still got that Python 2 interpreter open, right?
>>> # Convert the characters in the hostname to hexadecimal.
>>> hostname = "node.example.com"
>>> hostname_as_hexadecimal = "".join(hex(ord(c))[2:] for c in hostname)
>>> hostname_as_hexadecimal
'6e6f64652e6578616d706c652e636f6d'
>>>
>>> # Convert the "IP address" to text.
>>> address = ipaddress.ip_address(hostname)
>>> str(address)
'6e6f:6465:2e65:7861:6d70:6c65:2e63:6f6d'
>>>
>>> # Remove the colons from that text.
>>> address_without_colons = str(address).replace(":", "")
>>> address_without_colons
'6e6f64652e6578616d706c652e636f6d'
>>>
>>> # Compare the results and see they're equal.
>>> hostname_as_hexadecimal == address_without_colons
True
Sure enough, when you boil them both down to numbers they're the same mess of hexadecimal.
The Belly Of The Beast
If we dig into the source code for the Python 2 version of the
ipaddress
module we ultimately come to a
curious set of lines.
# Constructing from a packed address
if isinstance(address, bytes):
self._check_packed_address(address, 16)
bvs = _compat_bytes_to_byte_vals(address)
self._ip = _compat_int_from_byte_vals(bvs, 'big')
return
It turns out that, under certain conditions, the ipaddress
module can create
IPv6 addresses from raw bytes. My assumption is that it offers this behavior as
a convenient way to parse IP addresses from data fresh off the wire.
Does node.example.com
meet those certain conditions? You bet it does. Because
we're using Python 2 it's just bytes
and it happens to be 16 characters long.
>>> isinstance("node.example.com", bytes)
True
>>> # `self._check_packed_address` basically just checks how long it is.
>>> len("node.example.com") == 16
True
The rest of the ipaddress
lines say to interpret the sequence of bytes as a
big-endian integer. That's magic best left
for another blog post, but the gist is that hexadecimal interpretation of
node.example.com
is condensed into a single, huge number.
>>> int("6e6f64652e6578616d706c652e636f6d", 16)
146793460745001871434687145741037825901L
That's an absolutely massive number, but not so massive it won't fit within the IPv6 address space.
>>> ip_address(146793460745001871434687145741037825901L)
IPv6Address(u'6e6f:6465:2e65:7861:6d70:6c65:2e63:6f6d')
As it turns out, if you're liberal in your interpretation, node.example.com
can be an IPv6 address!
You Will Be Reading Meanings
Obviously that's hogwash. Bizarro might be proud, but that's not what we wanted to happen.
There's a quote about numbers which is apocryphally attributed to W.E.B. Du Bois, but that actually comes from Harold Geneen's book, Managing.
When you have mastered the numbers, you will in fact no longer be reading numbers, any more than you read words when reading a book. You will be reading meanings.
Having not read the book I'm probably taking the quote way out of context, but I think it fits our situation well.
As we've seen above, we can freely convert characters to numbers and back again. The root of our problem is that when we use Python 2 it considers text to be bytes. There's not a deeper, inherent meaning. Maybe the bytes are meant to be ASCII, maybe they're meant to be a long number, maybe they're meant to be an IP address. The interpretation of those bytes is up to us.
Python 2 doesn't differentiate between bytes and text by default. In fact, the
bytes
type is just an alias for str
.
>>> bytes
<type 'str'>
>>> bytes is str
True
To make that even more concrete, see how Python 2 considers n
to be the same
as this sequence of raw bytes.
>>> "n" == b"\x6e"
True
Our Python 2 code doesn't work the way we want it to because raw bytes can have arbitrary meaning and we haven't told it to use our intended meaning.
So now we know why Python 2 interprets node.example.com
as an IPv6 address,
but why does Python 3 behave differently? More importantly, how can we
reconcile the two?
256 Characters Ought To Be Enough For Anybody
ASCII looked like a good idea in the 1960's. With decades of hindsight we know the 256 characters afforded to us by Extended ASCII are insufficient to handle all of the world's writing systems. Thus, Unicode was born.
There are scads of blog posts, Wikipedia articles, and technical documents that will do a better job than I can of explaining Unicode in detail. You should read them if you care to, but here's my gist.
Unicode is a set of character encodings. UTF-8 is the dominant encoding. UTF-8 overlaps with ASCII, so ASCII characters are still just one byte. To handle the multitude of other characters, however, multiple bytes can express a single character.
>>> "n".encode("utf-8").hex() # 1 character (U+006E), 1 byte.
'6e'
>>> "🤿".encode("utf-8").hex() # 1 character (U+1F93F), 4 bytes.
'f09fa4bf'
>>> "悟り".encode("utf-8").hex() # 2 characters (U+609F, U+308A), 6 bytes.
'e6829fe3828a'
Every programming language I know of that respects the difference between raw bytes and Unicode text maintains a strict separation between the two datatypes.
In Python 3 this strict separation is enabled by default. Notice that it
doesn't consider n
and this sequence of raw bytes to be the same thing.
>>> "n" == b"\x6e"
False
Even better, it doesn't consider str
and bytes
to be the same type.
>>> bytes is str
False
>>> bytes
<class 'bytes'>
If we can get Python 2 to understand Unicode like Python 3 does, then we can probably fix our bug.
As an aside, if you want to learn more about how to handle Unicode in Python, check out Ned Batchelder's talk on Pragmatic Unicode.
How Did We Fix It?
Python 2 does actually know about Unicode, but it considers Unicode text to
be separate from "normal" text. At some point in Python 2 history the
unicode
type was bolted onto the side of the language and
not enabled by default. Hard to get excited about it, but it does the trick.
At least they knew it's a pain to type unicode()
all the time, so there's a
handy literal syntax using a u
prefix.
>>> unicode("node.example.com") == u"node.example.com"
True
This is not the best fix, but it did in a pinch. We added a line converting the hostname to Unicode right off the bat. We also applied the same transformation to the line with brackets. This way we always process the hostname as Unicode and we always return a Unicode value.
def safe_host(host):
"""Surround `host` with brackets if it is an IPv6 address."""
+ host = u"{}".format(host)
try:
if ip_address(host).version == 6:
- return "[{}]".format(host)
+ return u"[{}]".format(host)
except ValueError:
pass
Luckily for us the u
prefix also works in Python 3 whereas unicode()
does not
(because all text is Unicode by default, so the type has no
business existing). In Python 3 the u
is treated as a no-op.
The Python 2 interpreter graciously understands the unicode
type is not just
raw bytes
.
>>> isinstance(u"node.example.com", bytes)
False
When we use the unicode
type the ipaddress
module no longer tries to
interpret node.example.com
as bytes
and convert those bytes to an IP
address. We get just what we expect
>>> try:
... ipaddress.ip_address(u"node.example.com")
... except ValueError as error:
... print(error)
...
u'node.example.com' does not appear to be an IPv4 or IPv6 address
and our tests pass!
✔ OK py27 in 1.728 seconds
✔ OK py36 in 2.775 seconds
✔ OK py37 in 2.717 seconds
✔ OK py38 in 2.674 seconds
✔ OK py39 in 2.506 seconds
Reflection
I mentioned above that our fix wasn't the best. Given more time, how can we do better?
The first (and best) solution here is to drop Python 2 support. It's 2020 now and Python 2 is officially no longer supported. The original code worked on Python 3. The best long-term decision is to migrate the code to run on Python 3 only and avoid the hassle of Python 2 maintenance. Unfortunately many of the people running this code still depend on it working on Python 2, so we'll have to make that transition gracefully.
If a migration away from Python 2 isn't possible in the near-term, the next
best thing to do is update our code so that it uses a compatibility layer like
future
or six
. Those libraries are designed to modernize
Python 2 and help smooth over issues like this one.
It also wouldn't hurt for us to take a page from Alexis King's
Parse, don't validate school of thought. When the
hostname enters our program via user input it should immediately be
converted to the unicode
type (or maybe even an IP address type) so we don't
end up solving this problem in several different places throughout the code.
Finally, though our program doesn't currently handle any hostnames in languages other than English, it's probably best to be thinking in Unicode anyway. Again, it's 2020 and internationalized domain names like https://Яндекс.рф are a thing.
If you made it this far, thanks for reading. It was fun to turn a brief debugging session with my co-worker into a treatise on the perils of Python 2 and the value of Unicode. See you next year! 😂