Friday, February 8, 2013

Packets of Death - UPDATE

UPDATE - Intel has pointed me towards the successor to the 82574, which includes some of the features I suggest here. I think it's safe to say we're all looking forward to this chip hitting the streets!

The last 48 hours has been interesting, to say the least.

My original post gathered much more attention than I originally thought.  I’ll always remember being on a conference call and having someone tell me “Hey, you’re on Slashdot”.  Considering the subject matter I suppose I should have expected that.

As of today, here’s what I know:

Many of you have shared the results of your testing. The vast majority of tested Intel ethernet controllers do not appear to be affected by this issue.

Intel has responded with an expanded technical explanation of the issue.  I also received a very pleasant and professional phone call from Douglas Boom (at Intel) to update me on their assessment of the situation and discuss any ongoing concerns or problems I may have. Thank you Doug and well done Intel!  Note to other massive corporations that could be presented with issues like this: do what Intel did.

To summarize their response, Intel says that a misconfigured EEPROM caused this issue.  The EEPROM is written by the motherboard manufacturer and not Intel.  Intel says my motherboard manufacturer did not follow their published guidelines for this process.  Based on what I’ve seen, how my issue was fixed, and what I’m learning from the crowdsourced testing process this seems like a perfectly plausible explanation.  Once again, thanks Intel!

However, I still don’t believe this issue is completely isolated to this specific instance and one motherboard manufacturer.  For one, I have received at least two confirmed reports from people who were able to reproduce this issue - my “packet of death” shutting down 82574L hardware from different motherboard manufacturers.  This doesn’t surprise me at all.

One thing we’re reminded of in this situation is just how complex all of these digital systems have become.  We’re a long way from configuring ethernet adapters with IRQ jumpers.  Intel has designed an incredibly complex ethernet controller - the datasheet is 490 pages long!  Of course I’m not faulting them for this - the features available in this controller (or many other controllers) dictate this level of complexity - management/IPMI features, WOL, various offloading mechanisms, interrupt queues, and more.  This complexity doesn’t even scratch the surface of the various other systems involved in getting data across the internet and into your eyeballs!

Like any sufficiently advanced product all of these features are driven by a configuration mechanism.  The Linux kernel module for the 82574L (e1000e) has various options that can be passed to modify the behavior of the adapter.  Makes sense.  If I passed some stupid or unknown parameter to this module I would expect it to return with some kind of error informing me of my mistake.  I’m only human, mistakes are going to happen.

At a lower level Intel has exposed these EEPROM configuration parameters to control various aspects of the controller.  As Intel says these EEPROM values are to be set by the motherboard manufacturer.  Here’s where the problem lies - it’s certainly possible this could be done incorrectly.  Motherboard manufacturers are human and they make mistakes too.

Unfortunately, as we’ve learned in this case, there isn’t quite the same level of feedback when EEPROM misconfigurations happen.  In my previous example if I pass unknown parameters to a kernel module it’s going to come back and say “Hey - I don’t know what that is (dummy)” and exit with an error.

As I’ve shown in some cases (mine) if an EEPROM is misconfigured everything appears normal until some insanely specific packet is received.  Then the controller shuts down, for some reason.

Does that behavior make sense to anyone?

I suggest the following:

1)  Make future controllers have as much in-hardware sane behavior as possible when unknown conditions are encountered.  Error checking, basically.  Users can input data on a web form, that’s why there’s error checking.  Everyone knows users do stupid things.  Clearly some of the people programming Intel EEPROMs for motherboard OEMs do stupid things too.  What is sane default behavior?  EEPROM error encountered = adapter shutdown and error message.  Give the user notification and provide some mechanism for EEPROM validation and management...

2)  Put more EEPROM validation in operating system drivers.  Intel maintains ethernet drivers for various platforms.  Why aren’t these drivers doing more validation of the adapter EEPROM?  If my EEPROM was so badly misconfigured, why couldn’t the e1000e module have discovered that and notified me?

3)  Produce and support an external tool for EEPROM testing, programming, and updating.  In the course of working with Intel last fall I was provided a version of this tool for my testing so I know it exists.  While I can understand why you don’t want random users messing with their EEPROM (and causing potential support nightmares) it seems the benefits would clearly outweigh any potential problems (of which there are already plenty).

The reality is Intel has no idea how many systems are affected by this issue or could be affected by issues like it.  How could they?  They’re expecting motherboard OEMs to follow their instructions (and understandably so).  Just look at the combination of variables required to reproduce this issue:

- Intel 82574L
- Various specific misconfigured bytes in the EEPROM
- An insanely specific packet with the right value at just the right byte, received at a specific time

While most people weren’t able to reproduce this issue with their controller and EEPROM combinations it did kick off various discussions of periodic, random, sporadic failures across a wide range of ethernet adapters and general computing weirdness.  A quick Google search returns a wide assortment of complaints with these adapters (and others like it) from a whole slew of users.  EEPROM corruption.  Random adapter resets.  Packet drops.  Various latency issues.  PCI bus scan errors.  ASPM problems.  The list goes on and on.

Perhaps the “packet of death” for a slightly misconfigured Intel 82579 (for example) is my packet shifted 20 bytes in one direction or the other.  Who knows?  Please, please, please Intel - lets do everyone a favor and get these EEPROMs under control.  End users update firmware all of the time - routers, set-top boxes, sometimes even their cars!  Why can’t we have some utility to make sure our ethernet adapters aren’t just waiting to freak out when they receive the wrong packet?

I don’t believe in magic, swamp gas, sun spots, or any of the other “explanations” offered for some of the random strange behavior we often see with these complex devices (ethernet adapters or otherwise).  That’s why I spent so long working on this issue to find a root cause (well that and screaming customers).  I, like anyone else, encounter bugs and general weirdness in devices and software everyday in my life.  Most of the time how do I respond to these bugs?  I reboot, shrug my shoulders, say “that was weird”, and move on.  Meanwhile I know, deep down, that there is a valid explanation for what just happened.  Just like there was with my ethernet controllers.

Even with the explanation offered by Intel we could go much deeper.  Why these bytes at that location?  Why this packet?  What’s up with the “inoculation” effect of some of the values?  There are still many unanswered questions.

I’ve enjoyed reading many others report their tales of “extreme debugging” with the digital devices in their lives.  It seems I’m not the only one that isn’t always satisfied with saying “that was weird” and moving on.


I've said it before and I'll say it again - I love the internet!

4 comments:

Anonymous said...

How about posting which mobo manufacturers are affected by this, that might narrow down the mess a bit.

Anonymous said...

"I don’t believe in magic, swamp gas, sun spots, or any of the other “explanations” offered for some of the random strange behavior we often see with these complex devices (ethernet adapters or otherwise)."

What?!? You don't believe that "swamp gas got stuck in a thermal pocket and refracted the light from Venus"?
Can you step over here and have a look at this...? *Pop!*

Anonymous said...

We currently use Intel DZ77GA-70K mobo, some of the Intel® 82574L mobo stop respond. Need to reboot the mobo to make it work.

This mobo from Intel not OEM.

Anonymous said...

THANK YOU!

Please add to this set of complaints that the automated Intel driver scanning software does NOT pick up drivers from 2011 and recomend newer drivers.

Ive had 2 years of intermitant issues with a 2008 server on an Intel motherboard, POE switch and phone system.
Once I forced the new driver on the server the issues went away !

the LAST thing I thought was the culprit was the network driver.

The problem was just so intermitant :(