Wednesday, February 20, 2013

findpod - Find Your Own "Packets of Death"

After my original post on “packets of death” I’ve spent the last couple of weeks receiving reports from users all over the world ranging from:

- Able to reproduce on 82574L with your “packets of death”
- Not able to reproduce on 82574L or any other controller
- Not able to reproduce with your “packets of death” but experiencing identical, sporadic failures across a wide range of controllers

Of course the last category intrigued me the most.  There seem to be an awful lot of people experiencing sporadic failures of their ethernet controllers but many of them (as I noted in another update) don’t have the time or tools to diagnose the issue further.  In most cases the symptoms are identical to what I described in my original post:

- Ethernet controller loses link (or reports some other hardware error)
- Varying amounts of time since boot (hours, days, weeks)
- Can only be resolved by a reboot or in some cases a complete power cycle

Many of these users have been dealing with these failures in various ways but have been unable to find a root cause.  I’ve created a tool to help them with their diagnosis.  It’s called findpod and it’s been tested on various Debian-based Linux distributions.  Findpod uses the excellent ifplugd daemon, the venerable tcpdump, and screen.

Once installed and started iflpugd will patiently wait to receive link status notifications from the Linux kernel.  Once link is detected on the target interface it will start a tcpdump session running inside of screen.  This tcpdump session will log all packets sent and received on that interface.  Here’s the thing - many of these failures are reported after days or weeks of processed traffic - the tcpdump capture file could easily reach several gigabytes or more!  Here’s where one key trick in findpod comes into play - by default findpod will only log the last 100MB sent or received on the target interface.  As long as ifplugd doesn’t report any link failures tcpdump will keep writing to the same 100MB circular capture file.

What happens when the interface loses link?  This is the second unique feature of findpod.  When ifplugd reports a loss of link it will wait for 30 seconds before stopping the packet capture and moving the capture file to a meaningful (and known) name.  If you think your ethernet controller failures could be related to the types of traffic you’re sending or receiving (as I discovered with my “packets of death”) findpod will help you narrow it down to (at most) 100MB of network traffic, even if the capture runs for weeks and your interface handles GBs of data!

Of course (more than likely) it’s even easier than that; if your link loss is being caused by a specific received packet it will be the last packet in the capture file provided by findpod and you’ll only have a 100MB capture file to work with.  If your issue is anything like mine you should be able to isolate it down to a specific packet that you can feed to tcpreplay; reproducing your controller issue on demand.

Please tell me about your experiences with findpod.  As always, comments and suggestions are welcome!

No comments: