UPDATE: See the packets and find out if you're affected here.
UPDATE 2: Yes, I've reproduced this issue regardless of OS, ASPM state/settings, or software firewall settings. Obviously if you have a layer 2/3 firewall in front of an affected interface you'll be ok.
Packets of death. I started calling them that because that’s exactly what they are.
Star2Star has a hardware OEM that has built the last two versions of our on-premise customer appliance. I’ll get more into this appliance and the magic it provides in another post. For now let’s focus on these killer packets.
About a year ago we released a refresh of this on-premise equipment. It started off simple enough, pretty much just standard Moore’s Law stuff. Bigger, better, faster, cheaper. The new hardware was 64-bit capable, had 8X as much RAM, could accommodate additional local storage, and had four Intel (my preferred ethernet controller vendor) gigabit ethernet ports. We had (and have) all kinds of ideas for these four ports. All in all it was pretty exciting.
This new hardware flew through performance and functionality testing. The speed was there and the reliability was there. Perfect. After this extensive testing we slowly rolled the hardware out to a few beta sites. Sure enough, problems started to appear.
All it takes is a quick Google search to see that the Intel 82574L ethernet controller has had at least a few problems. Including, but not necessarily limited to, EEPROM issues, ASPM bugs, MSI-X quirks, etc. We spent several months dealing with each and every one of these. We thought we were done.
We weren’t. It was only going to get worse.
I thought I had the perfect software image (and BIOS) developed and deployed. However, that’s not what the field was telling us. Units kept failing. Sometimes a reboot would bring the unit back, usually it wouldn’t. When the unit was shipped back, however, it would work when tested.
Wow. Things just got weird.
The weirdness continued and I finally got to the point where I had to roll my sleeves up. I was lucky enough to find a very patient and helpful reseller in the field to stay on the phone with me for three hours while I collected data. This customer location, for some reason or another, could predictably bring down the ethernet controller with voice traffic on their network.
Let me elaborate on that for a second. When I say “bring down” an ethernet controller I mean BRING DOWN an ethernet controller. The system and ethernet interfaces would appear fine and then after a random amount of traffic the interface would report a hardware error (lost communication with PHY) and lose link. Literally the link lights on the switch and interface would go out. It was dead.
Nothing but a power cycle would bring it back. Attempting to reload the kernel module or reboot the machine would result in a PCI scan error. The interface was dead until the machine was physically powered down and powered back on. In many cases, for our customers, this meant a truck roll.
While debugging with this very patient reseller I started stopping the packet captures as soon as the interface dropped. Eventually I caught on to a pattern: the last packet out of the interface was always a 100 Trying provisional response, and it was always a specific length. Not only that, I ended up tracing this (Asterisk) response to a specific phone manufacturer’s INVITE.
I got off the phone with the reseller, grabbed some guys and presented my evidence. Even though it was late in the afternoon on a Friday, everyone did their part to scramble and put together a test configuration with our new hardware and phones from this manufacturer.
We sat there, in a conference room, and dialed as fast as our fingers could. Eventually we found that we could duplicate the issue! Not on every call, and not on every device, but every once in a while we could crash the ethernet controller. However, every once in a while we couldn’t at all. After a power cycle we’d try again and hit it. Either way, as anyone who’s tried to diagnose a technical issue knows the first step is duplicating the problem. We were finally there.
Believe me, it took a long time to get here. I know how the OSI stack works. I know how software is segmented. I know that the contents of a SIP packet shouldn’t do anything to an ethernet adapter. It just doesn’t make any sense.
Between packet captures on our device and packet captures from the mirror port on the switch we were finally able to isolate the problem packet. Turns out it was the received INVITE, not the transmitted 100 Trying! The mirror port capture never saw the 100 Trying hit the wire.
Now we needed to look at this INVITE. Maybe the userspace daemon processing the INVITE was the problem? Maybe it was the transmitted 100 Trying? One of my colleagues suggested we shutdown the SIP software and see if the issue persisted. No SIP software running, no transmitted 100 Trying.
First we needed a better way to transmit the problem packet. We isolated the INVITE transmitted from the phone and used tcpreplay to play it back on command. Sure enough it worked. Now, for the first time in months, we could shut down these ports on command with a single packet. This was significant progress and it was time to go home, which really meant it was time to set this up in the lab at home!
Before I go any further I need to give another shout out to an excellent open source piece of software I found. Ostinato turns you into a packet ninja. There’s literally no limit to what you can do with it. Without Ostinato I could have never gotten beyond this point.
With my packet Swiss army knife in hand I started poking and prodding. What I found was shocking.
It all starts with a strange SIP/SDP quirk. Take a look at this SDP:
o=- 20047 20047 IN IP4 10.41.22.248
c=IN IP4 10.41.22.248
m=audio 11786 RTP/AVP 18 0 18 9 9 101
Yes, I saw it right away too. The audio offer is duplicated and that’s a problem but again, what difference should that make to an Ethernet controller?!? Well, if nothing else it makes the ethernet frame larger...
But wait, there were plenty of successful ethernet frames in these packet captures. Some of them were smaller, some were larger. No problems with them. It was time to dig into the problem packet. After some more Ostinato-fu and plenty of power cycles I was able to isolate the problem pattern (with a problem frame).
Warning: we’re about to get into some hex.
The interface shutdown is triggered by a specific byte value at a specific offset. In this case the specific value was hex 32 at 0x47f. Hex 32 is an ASCII 2. Guess where the 2 was coming from?
All of our SDPs were identical (including ptime, obviously). All of the source and destination URIs were identical. The only difference was the Call-IDs, tags, and branches. Problem packets had just the right Call-ID, tags, and branches to cause the “2” in the ptime to line up with 0x47f.
BOOM! With the right Call-IDs, tags, and branches (or any random garbage) a “good packet” could turn into a “killer packet” as long as that ptime line ended up at the right address. Things just got weirder.
While generating packets I experimented with various hex values. As if this problem couldn’t get any weirder, it does. I found out that the behavior of the controller depended completely on the value of this specific address in the first received packet to match that address. It broke down to something like this:
Byte 0x47f = 31 HEX (1 ASCII) - No effect
Byte 0x47f = 32 HEX (2 ASCII) - Interface shutdown
Byte 0x47f = 33 HEX (3 ASCII) - Interface shutdown
Byte 0x47f = 34 HEX (4 ASCII) - Interface inoculation
When I say “no effect” I mean it didn’t kill the interface but it didn’t inoculate the interface either (more on that later). When I say the interface shutdown, well, remember my description of this issue - the interface went down. Hard.
With even more testing I discovered this issue with every version of Linux I could find, FreeBSD, and even when the machine was powered up complaining about missing boot media! It’s in the hardware; the OS has nothing to do with it. Wow.
To make matters worse, using Ostinato I was able to craft various versions of this packet - an HTTP POST, ICMP echo-request, etc. Pretty much whatever I wanted. With a modified HTTP server configured to generate the data at byte value (based on headers, host, etc) you could easily configure an HTTP 200 response to contain the packet of death - and kill client machines behind firewalls!
I know I’ve been pointing out how weird this whole issue is. The inoculation part is by far the strangest. It turns out that if the first packet received contains any value (that I can find) other than 1, 2, or 3 the interface becomes immune from any death packets (where the value is 2 or 3). Also, valid ptime attributes are defined in
All of a sudden it’s become clear why this issue was so sporadic. I’m amazed I tracked it down at all. I’ve been working with networks for over 15 years and I’ve never seen anything like this. I doubt I’ll ever see anything like it again. At least I hope I don’t...
I was able to get in touch with two engineers at Intel and send them a demo unit to reproduce the issue. After working with them for a couple of weeks they determined there was an issue with the EEPROM on our 82574L controllers.
They were able to provide new EEPROM and a tool to write it out. Unfortunately we weren’t able to distribute this tool and it required unloading and reloading the e1000e kernel module, so it wouldn’t be preferred in our environment. Fortunately (with a little knowledge of the EEPROM layout) I was able to work up some bash scripting and ethtool magic to save the “fixed” eeprom values and write them out on affected systems. We now have a way to detect and fix these problematic units in the field. We’ve communicated with our vendor to make sure this fix is applied to units before they are shipped to us. What isn’t clear, however, is just how many other affected Intel ethernet controllers are out there.
I guess we’ll just have to see...