node.example.com Is An IP Address

Hello! Welcome to the once-yearly blog post! This year I'd like to examine the most peculiar bug I encountered at work. To set the stage, let's start with a little background.

When we write URLs with a non-standard port we specify the port after a :. With hostnames and IPv4 addresses this is straightforward. Here's some Python code to show how easy it is.

>>> url = urllib.parse.urlparse("https://node.example.com:8000")
>>> (url.hostname, url.port)
('node.example.com', 8000)
>>>
>>> url = urllib.parse.urlparse("https://192.168.0.1:8000")
>>> (url.hostname, url.port)
('192.168.0.1', 8000)

Unfortunately, when IPv6 addresses are involved some ambiguity is introduced.

>>> url = urllib.parse.urlparse(
...     "https://fdc8:bf8b:e62c:abcd:1111:2222:3333:4444:8000"
... )
...
>>> url.hostname
'fdc8'
>>> try:
...     url.port
... except ValueError as error:
...     print(error)
...
Port could not be cast to integer value as 'bf8b:e62c:abcd:1111:2222:3333:4444:8000'

Since IPv6 addresses use a "colon-hex" format with hexadecimal fields separated by : we can't tell a port apart from a normal field. Notice in the example above that the hostname is truncated after the first :, not the one just before 8000.

Fortunately, the spec for URLs recognizes this ambiguity and gives us a way to handle it. RFC 2732 (Format for Literal IPv6 Addresses in URL's) says

To use a literal IPv6 address in a URL, the literal address should be enclosed in "[" and "]" characters.

Update our example above to include [ and ] and voilà! It just works.

>>> url = urllib.parse.urlparse(
...     "https://[fdc8:bf8b:e62c:abcd:1111:2222:3333:4444]:8000"
... )
...
>>> (url.hostname, url.port)
('fdc8:bf8b:e62c:abcd:1111:2222:3333:4444', 8000)

Armed with that knowledge we can dive into the problem. 🤿

Works On My Machine

A few months ago a co-worker of mine wrote a seemingly innocuous function.

from ipaddress import ip_address


def safe_host(host): 
    """Surround `host` with brackets if it is an IPv6 address."""
    try:
        if ip_address(host).version == 6:
            return "[{}]".format(host)
    except ValueError:
        pass
    return host

Elsewhere in the code it was invoked something like this, so that hostnames, IPv4 addresses, and IPv6 addresses could all be safely interpolated.

url = "https://{host}:8000/some/path/".format(host=safe_host(host))

Since my co-worker is awesome they wrote tests to validate their code. ✅

def test_safe_host_with_hostname():
    """Hostnames should be unchanged."""
    assert safe_host("node.example.com") == "node.example.com"


def test_safe_host_with_ipv4_address():
    """IPv4 addresses should be unchanged."""
    assert safe_host("192.168.0.1") == "192.168.0.1"


def test_safe_host_with_ipv6_address():
    """IPv6 addresses should be surrounded by brackets."""
    assert (
        safe_host("fdc8:bf8b:e62c:abcd:1111:2222:3333:4444")
        == "[fdc8:bf8b:e62c:abcd:1111:2222:3333:4444]"
    )

Thank goodness they did. The Python 2 tests failed (don't look at me like that 😒).

FAIL py27 in 1.83 seconds
OK py36 in 2.82 seconds
OK py37 in 2.621 seconds
OK py38 in 2.524 seconds
OK py39 in 2.461 seconds

Both the hostname and IPv6 address tests failed. But why did they fail? And why did the Python 3 tests pass? 🤔

We'll start with the hostname failure and try to isolate the bug.

E       AssertionError: assert '[node.example.com]' == 'node.example.com'
E         - [node.example.com]
E         ? -                -
E         + node.example.com

The failure says node.example.com was surrounded by brackets, but that's only supposed to happen for IPv6 addresses! Let's crack open a Python 2 interpreter for a quick sanity check.

>>> ipaddress.ip_address("node.example.com").version
6
Confused Jeff Bridges

What On Htrae?

If, like Jeff Bridges, you were confused by that result, relax. We're probably not in a Bizarro World where node.example.com is a valid IPv6 address. There must be an explanation for this behavior.

Things start to become a little more clear when we see the result of the ip_address() function for ourselves.

>>> ipaddress.ip_address("node.example.com")
IPv6Address(u'6e6f:6465:2e65:7861:6d70:6c65:2e63:6f6d')

At first glance that looks like madness. Python 3 behaves in an entirely different manner.

>>> try:
...     ipaddress.ip_address("node.example.com")
... except ValueError as error:
...     print(error)
... 
'node.example.com' does not appear to be an IPv4 or IPv6 address

Python 3 knows that's not an IPv6 address, so why doesn't Python 2? The answer is in how differently the two Python versions handle text.

Text Is Hard

Computers don't operate on text as humans think of it. They operate on numbers. That's part of why we have IP addresses to begin with. In order to represent human-readable text with computers we had to assign meaning to the numbers. Thus, ASCII was born.

ASCII is a character encoding, which means it specifies how to interpret bytes as text we understand (provided you speak English). So, when your computer sees 01101110 in binary (110 in decimal) you see n because that's what ASCII says it is.

You can see the number to text conversion in action right in the Python interpreter.

>>> ord("n")
110
>>> chr(110)
'n'

In fact, it doesn't matter what numbering system you use. If you specify binary, octal, decimal, hexadecimal, whatever... If it can be understood as the right integer it will be displayed correctly.

>>> chr(0b01101110)
'n'
>>> chr(0o156)
'n'
>>> chr(110)
'n'
>>> chr(0x6e)
'n'

Neat, but what does that information do for us?

It's Numbers All The Way Down

Just for giggles, humor me and let's look at the character-number translations for node.example.com. We'll leave out binary and octal, because they make this table uglier than it already is.

Character n o d e . e x a m p l e . c o m
Decimal 110 111 100 101 46 101 120 97 109 112 108 101 46 99 111 109
Hexadecimal 6e 6f 64 65 2e 65 78 61 6d 70 6c 65 2e 63 6f 6d

Hey, hold on a second... If you tilt your head sideways and squint that last row looks kinda like an IPv6 address, doesn't it?

We should verify, just to be absolutely certain. You've still got that Python 2 interpreter open, right?

>>> # Convert the characters in the hostname to hexadecimal.
>>> hostname = "node.example.com"
>>> hostname_as_hexadecimal = "".join(hex(ord(c))[2:] for c in hostname)
>>> hostname_as_hexadecimal
'6e6f64652e6578616d706c652e636f6d'
>>>
>>> # Convert the "IP address" to text.
>>> address = ipaddress.ip_address(hostname)
>>> str(address)
'6e6f:6465:2e65:7861:6d70:6c65:2e63:6f6d'
>>>
>>> # Remove the colons from that text.
>>> address_without_colons = str(address).replace(":", "")
>>> address_without_colons
'6e6f64652e6578616d706c652e636f6d'
>>>
>>> # Compare the results and see they're equal.
>>> hostname_as_hexadecimal == address_without_colons
True

Sure enough, when you boil them both down to numbers they're the same mess of hexadecimal.

The Belly Of The Beast

If we dig into the source code for the Python 2 version of the ipaddress module we ultimately come to a curious set of lines.

# Constructing from a packed address
if isinstance(address, bytes):
    self._check_packed_address(address, 16)
    bvs = _compat_bytes_to_byte_vals(address)
    self._ip = _compat_int_from_byte_vals(bvs, 'big')
    return

It turns out that, under certain conditions, the ipaddress module can create IPv6 addresses from raw bytes. My assumption is that it offers this behavior as a convenient way to parse IP addresses from data fresh off the wire.

Does node.example.com meet those certain conditions? You bet it does. Because we're using Python 2 it's just bytes and it happens to be 16 characters long.

>>> isinstance("node.example.com", bytes)
True
>>> # `self._check_packed_address` basically just checks how long it is.
>>> len("node.example.com") == 16
True

The rest of the ipaddress lines say to interpret the sequence of bytes as a big-endian integer. That's magic best left for another blog post, but the gist is that hexadecimal interpretation of node.example.com is condensed into a single, huge number.

>>> int("6e6f64652e6578616d706c652e636f6d", 16)
146793460745001871434687145741037825901L

That's an absolutely massive number, but not so massive it won't fit within the IPv6 address space.

>>> ip_address(146793460745001871434687145741037825901L)
IPv6Address(u'6e6f:6465:2e65:7861:6d70:6c65:2e63:6f6d')

As it turns out, if you're liberal in your interpretation, node.example.com can be an IPv6 address!

You Will Be Reading Meanings

Obviously that's hogwash. Bizarro might be proud, but that's not what we wanted to happen.

There's a quote about numbers which is apocryphally attributed to W.E.B. Du Bois, but that actually comes from Harold Geneen's book, Managing.

When you have mastered the numbers, you will in fact no longer be reading numbers, any more than you read words when reading a book. You will be reading meanings.

Having not read the book I'm probably taking the quote way out of context, but I think it fits our situation well.

As we've seen above, we can freely convert characters to numbers and back again. The root of our problem is that when we use Python 2 it considers text to be bytes. There's not a deeper, inherent meaning. Maybe the bytes are meant to be ASCII, maybe they're meant to be a long number, maybe they're meant to be an IP address. The interpretation of those bytes is up to us.

Python 2 doesn't differentiate between bytes and text by default. In fact, the bytes type is just an alias for str.

>>> bytes
<type 'str'>
>>> bytes is str
True

To make that even more concrete, see how Python 2 considers n to be the same as this sequence of raw bytes.

>>> "n" == b"\x6e"
True

Our Python 2 code doesn't work the way we want it to because raw bytes can have arbitrary meaning and we haven't told it to use our intended meaning.

So now we know why Python 2 interprets node.example.com as an IPv6 address, but why does Python 3 behave differently? More importantly, how can we reconcile the two?

256 Characters Ought To Be Enough For Anybody

ASCII looked like a good idea in the 1960's. With decades of hindsight we know the 256 characters afforded to us by Extended ASCII are insufficient to handle all of the world's writing systems. Thus, Unicode was born.

There are scads of blog posts, Wikipedia articles, and technical documents that will do a better job than I can of explaining Unicode in detail. You should read them if you care to, but here's my gist.

Unicode is a set of character encodings. UTF-8 is the dominant encoding. UTF-8 overlaps with ASCII, so ASCII characters are still just one byte. To handle the multitude of other characters, however, multiple bytes can express a single character.

>>> "n".encode("utf-8").hex()  # 1 character (U+006E), 1 byte.
'6e'
>>> "🤿".encode("utf-8").hex()  # 1 character (U+1F93F), 4 bytes.
'f09fa4bf'
>>> "悟り".encode("utf-8").hex()  # 2 characters (U+609F, U+308A), 6 bytes.
'e6829fe3828a'

Every programming language I know of that respects the difference between raw bytes and Unicode text maintains a strict separation between the two datatypes.

In Python 3 this strict separation is enabled by default. Notice that it doesn't consider n and this sequence of raw bytes to be the same thing.

>>> "n" == b"\x6e"
False

Even better, it doesn't consider str and bytes to be the same type.

>>> bytes is str
False
>>> bytes
<class 'bytes'>

If we can get Python 2 to understand Unicode like Python 3 does, then we can probably fix our bug.

As an aside, if you want to learn more about how to handle Unicode in Python, check out Ned Batchelder's talk on Pragmatic Unicode.

How Did We Fix It?

Python 2 does actually know about Unicode, but it considers Unicode text to be separate from "normal" text. At some point in Python 2 history the unicode type was bolted onto the side of the language and not enabled by default. Hard to get excited about it, but it does the trick. At least they knew it's a pain to type unicode() all the time, so there's a handy literal syntax using a u prefix.

>>> unicode("node.example.com") == u"node.example.com"
True

This is not the best fix, but it did in a pinch. We added a line converting the hostname to Unicode right off the bat. We also applied the same transformation to the line with brackets. This way we always process the hostname as Unicode and we always return a Unicode value.

 def safe_host(host):
     """Surround `host` with brackets if it is an IPv6 address."""
+    host = u"{}".format(host)
     try:
         if ip_address(host).version == 6:
-            return "[{}]".format(host)
+            return u"[{}]".format(host)
     except ValueError:
         pass

Luckily for us the u prefix also works in Python 3 whereas unicode() does not (because all text is Unicode by default, so the type has no business existing). In Python 3 the u is treated as a no-op.

The Python 2 interpreter graciously understands the unicode type is not just raw bytes.

>>> isinstance(u"node.example.com", bytes)
False

When we use the unicode type the ipaddress module no longer tries to interpret node.example.com as bytes and convert those bytes to an IP address. We get just what we expect

>>> try:
...     ipaddress.ip_address(u"node.example.com")
... except ValueError as error:
...     print(error)
... 
u'node.example.com' does not appear to be an IPv4 or IPv6 address

and our tests pass!

OK py27 in 1.728 seconds
OK py36 in 2.775 seconds
OK py37 in 2.717 seconds
OK py38 in 2.674 seconds
OK py39 in 2.506 seconds

Reflection

I mentioned above that our fix wasn't the best. Given more time, how can we do better?

The first (and best) solution here is to drop Python 2 support. It's 2020 now and Python 2 is officially no longer supported. The original code worked on Python 3. The best long-term decision is to migrate the code to run on Python 3 only and avoid the hassle of Python 2 maintenance. Unfortunately many of the people running this code still depend on it working on Python 2, so we'll have to make that transition gracefully.

If a migration away from Python 2 isn't possible in the near-term, the next best thing to do is update our code so that it uses a compatibility layer like future or six. Those libraries are designed to modernize Python 2 and help smooth over issues like this one.

It also wouldn't hurt for us to take a page from Alexis King's Parse, don't validate school of thought. When the hostname enters our program via user input it should immediately be converted to the unicode type (or maybe even an IP address type) so we don't end up solving this problem in several different places throughout the code.

Finally, though our program doesn't currently handle any hostnames in languages other than English, it's probably best to be thinking in Unicode anyway. Again, it's 2020 and internationalized domain names like https://Яндекс.рф are a thing.

If you made it this far, thanks for reading. It was fun to turn a brief debugging session with my co-worker into a treatise on the perils of Python 2 and the value of Unicode. See you next year! 😂