Hackers News

Three Packets Walk Into a Tunnel

A few weeks ago I stumbled onto an article about Traceroute entitled “Traceroute Isn’t Real” which was reasonably entertaining while also managing to be incorrect or misleading in many places. I assume the title is a deliberate allusion to “Birds Aren’t Real”, a well-known satirical conspiracy theory, so perhaps the traceroute article should also be read as satire. You don’t need me to tell you everything that’s wrong with the article because that task has been taken on by the tireless contributors of Hacker News, who have (on this occasion) done a pretty good job of criticism. 

One line that jumped out at me was the claim that “It is completely impossible for [MPLS] to satisfy the expectations of traceroute”. Not only is this a claim that I know to be false, but I have a vivid memory of how we came to make MPLS support traceroute when we were designing the Tag Switching header among my colleagues at Cisco in 1996. (MPLS is the IETF standard that followed fairly directly from the design of Tag Switching, and the headers are nearly identical.) This was a heated debate which is why I remember it so well today. It was a classic “design by committee” situation and we know how those things generally turn out (48-byte cells, anyone?), although I think this one was better than most in the end. So let’s wind our time machine back to 1996 and I will reconstruct the process that led to the MPLS header being what it is today, complete with its configurable support of traceroute.

Designing Labels at a Router Company

I joined Cisco in 1995 to be part of the team that was tasked with figuring out how the new and exciting (at the time) technology of ATM could be “integrated” into the IP-centric product line of Cisco. There were plenty of ideas already floating around, with IP-over-ATM standards developing at the IETF and the ATM Forum. By early 1996 there were half a dozen engineers at Cisco sharing ideas on what this “integration” might look like when Yakov Rekhter sent around a 2-page document outlining the basic ideas of Tag Switching. When I read it, the idea seemed like a qualitative improvement on everything else I had seen or discussed, and my colleagues agreed. We fairly quickly lined up executive support to flesh out those two pages into an architecture and proceed to implementing it on the Cisco product line of both routers and ATM switches. We started working through the details that would need to be nailed down before any sort of implementation could start. One essential detail was the packet header format for tag switched packets. 

It’s important at this point to acknowledge some of the related ideas that were around at the time. After Yakov’s 2-pager paper had won support of our design team, but before we had said much about it in public, a startup called Ipsilon came out of stealth mode with a flurry of announcements. They had also figured out a way to combine IP routing with ATM switching (cleverly calling their approach IP Switching). Their design was quite different from ours, but they made a splash with it, including the then-novel idea of publishing several informational RFCs to describe the protocols that made their system work. It’s fair to say that the executive support for Tag switching was much easier to obtain thanks to the amount of buzz around Ipsilon.

We later realized that the central idea of Tag switching, which was to associate fixed-length labels with variable-length IP prefixes from the routing table, had been invented and published by Girish Chandranmenon and George Varghese in SIGCOMM 1995. They called it “threaded indices”. That paper definitely pre-dated Yakov’s 2-pager, so I think they can be considered the true inventors of this core aspect of Tag Switching and MPLS. 

But neither Yakov’s paper nor the 1995 Sigcomm paper addressed the issue of how you encode a fixed length label in an IP packet. Ipsilon’s approach relied on the ATM cell header to carry fixed-length labels, which was a fine idea if you were happy to send all your traffic around in 48-byte cells, but that was not what most of our customers wanted. Of course, there was nothing like a single customer viewpoint, but we had a big base of Internet service provider (ISP) customers who bought the fastest routers they could get their hands on in 1996 and they had opinions. Many of them hated ATM with a passion–this was the height of the nethead vs bellhead wars–and one reason for that was the “cell tax”. ATM imposed a constant overhead (tax) of 5 header bytes for every 48 bytes of payload (over 10%), and this was the best case. A 20-byte IP header, by contrast, could be amortized over 1500-byte or longer packets (less than 2%). Even with average packet sizes around 300 bytes (as they were at that time) IP came out a fair bit more efficient. And the ATM cell tax was in addition to the IP header overhead.  ISPs paid a lot for their high-speed links and most were keen to use them efficiently. 

So a problem we faced with Tag Switching/MPLS was that we were about to introduce a “label tax” by putting an additional header on top of the IP header to carry our fixed-length labels. There was an incentive to keep that header as small as possible–for some members of our design committee, that was the most important consideration. But we needed to fit quite a few things aside from a label into the header. Labels were intended to simplify packet forwarding, so you couldn’t (normally) ask a router to look beyond the label header. Hence, any field that influenced forwarding had to be in the label header.  One such field was a “class of service” modeled on the “type of service” (ToS) found in the IP header.  ToS usage was not standardized at this point, but it was used for things like marking routing protocol packets for priority handling on arrival at an overloaded router. (These bits would get thoroughly redefined in the later work on DIfferentiated Services.) The obvious choice would have been to include a full byte of ToS in the label header. But the pressure to minimize the header along with the lack of widespread usage of ToS led to us compromising on 3 bits, initially called “Class of Service” and later renamed to “Experimental” in RFC 3032. This was in recognition of the fact that any attempt to offer different classes of service to IP traffic was decidedly an experiment in 1996. This decision would prove rather painful when the Diff-Serv standards emerged (using 6 bits of the ToS byte) and we tried to map them onto MPLS. (As an aside, I think my work at the intersection of MPLS and Diff-Serv was probably my most productive contribution to the IETF.)

The other field that we quickly decided was essential for the tag header was time-to-live (TTL). It is the nature of distributed routing algorithms that transient loops can happen, and packets stuck in loops consume forwarding resources–potentially even interfering with the updates that will resolve the loop. Since labelled packets (usually) follow the path established by IP routing, a TTL was non-negotiable. I think we might have briefly considered something less than 8 bits for TTL–who really needs to count up to 255 hops?–but that idea was discarded. 

Which brings us to traceroute. Unlike the presumed reader of “Traceroute isn’t real” we knew how traceroute worked, and we considered it an important tool for debugging. There is a very easy way to make traceroute operate over any sort of tunnel, since traceroute depends on packets with short TTLs getting dropped due to TTL expiry. You copy the IP TTL into the label header as the packet enters the tunnel (when the label header is added); decrement the TTL in the outer label header at every hop; and then copy the outer TTL back to the inner header (IP TTL) when exiting the tunnel. This means that the TTL does exactly what it would have done if there were no tunnel, and if it was going to expire mid-tunnel, that is what happens. There is the small matter of what to do with your “ICMP time exceeded” message in the middle of a tunnel, which RFC 3032 explains in detail. In other words, MPLS doesn’t prevent traceroute from working. Interestingly, the earlier tunneling protocol GRE allows the same treatment as MPLS but doesn’t require it (i.e., GRE can break traceroute, or not). 

But there is another twist to this story. ISPs didn’t love the fact that random end users can get a picture of their internal topology by running traceroute. And MPLS (or other tunnelling technologies) gave them a perfect tool for obscuring the topology. First of all you can make sure that interior routers don’t send ICMP time exceeded messages. But you can also fudge the TTL when a packet exits a tunnel. Rather than copying the outer (MPLS) TTL to the inner (IP) TTL on egress, you can just decrement the IP TTL by one. Hey presto, your tunnel looks (to traceroute) like a single hop, since the IP TTL only decrements by one as packets traverse the tunnel, no matter how many router hops actually exist along the tunnel path. We made this a configurable option in our implementation and allowed for it in RFC 3032. We also had an internal joke about giving ISPs the option to increment the TTL on egress, so that a tunnel would appear to have negative hop count. No-one wanted their network looking inefficient by having too many hops. (This is a terrible idea given the real purpose of TTL in discarding looping packets, but we had a good laugh anyway.) Anyway, the non-support of traceroute over tunnels is a choice by operators, not a baked-in feature/bug of MPLS (or other tunnel technologies).

There is plenty more to this story, like how we came to think of labels as a stack, but that can wait for another newsletter. Part of me wishes we hadn’t worked so hard to keep the minimal MPLS label header down to 32 bits. But we didn’t break traceroute except for ISPs who wanted it broken, and we managed to deploy MPLS into the networks of almost every ISP without them complaining about the label tax. We didn’t get everything right by any means but we made a set of tradeoffs that worked for most of our stakeholders.


Ed Zitron has a good (long) piece on the bursting of the Generative AI bubble that pretty much matches our thinking. And just to show that we read some things that don’t confirm our opinions, here is a counterargument from Casey Newton.

We were excited to see Cory Doctorow’s contribution to language, “enshittification”, named word of the year by the Macquarie Dictionary, especially since we called it 12 months ago.

In 2003 I gave a well-received talk at the SIGCOMM Outrageous Opinion session called “MPLS Considered Helpful”. Slides are here (no video, sadly). I even talked about the benefits (for ISPs) of using tunnels to hide their topology.

For the computer scientist on your holiday shopping list, how about a Systems Approach book or a gift subscription to this newsletter?

Preview image by Yohei Shimomae on Unsplash

admin

The realistic wildlife fine art paintings and prints of Jacquie Vaux begin with a deep appreciation of wildlife and the environment. Jacquie Vaux grew up in the Pacific Northwest, soon developed an appreciation for nature by observing the native wildlife of the area. Encouraged by her grandmother, she began painting the creatures she loves and has continued for the past four decades. Now a resident of Ft. Collins, CO she is an avid hiker, but always carries her camera, and is ready to capture a nature or wildlife image, to use as a reference for her fine art paintings.

Related Articles

Leave a Reply