After my original post on “packets of death” I’ve spent the last couple of weeks receiving reports from users all over the world ranging from:
- Able to reproduce on 82574L with your “packets of death”
- Not able to reproduce on 82574L or any other controller
-
Not able to reproduce with your “packets of death” but experiencing
identical, sporadic failures across a wide range of controllers
Of
course the last category intrigued me the most. There seem to be an
awful lot of people experiencing sporadic failures of their ethernet
controllers but many of them (as I noted in another update)
don’t have the time or tools to diagnose the issue further. In most
cases the symptoms are identical to what I described in my original
post:
- Ethernet controller loses link (or reports some other hardware error)
- Varying amounts of time since boot (hours, days, weeks)
- Can only be resolved by a reboot or in some cases a complete power cycle
Many
of these users have been dealing with these failures in various ways
but have been unable to find a root cause. I’ve created a tool to help
them with their diagnosis. It’s called findpod and it’s been tested on various Debian-based Linux distributions. Findpod uses the excellent ifplugd daemon, the venerable tcpdump, and screen.
Once
installed and started iflpugd will patiently wait to receive link
status notifications from the Linux kernel. Once link is detected on
the target interface it will start a tcpdump session running inside of
screen. This tcpdump session will log all packets sent and received on
that interface. Here’s the thing - many of these failures are reported
after days or weeks of processed traffic - the tcpdump capture file
could easily reach several gigabytes or more! Here’s where one key
trick in findpod comes into play - by default findpod will only log the
last 100MB sent or received on the target interface. As long as ifplugd
doesn’t report any link failures tcpdump will keep writing to the same
100MB circular capture file.
What
happens when the interface loses link? This is the second unique feature of findpod. When ifplugd reports a loss of link it will wait for 30
seconds before stopping the packet capture and moving the capture file
to a meaningful (and known) name. If you think your ethernet controller
failures could be related to the types of traffic you’re sending or
receiving (as I discovered with my “packets of death”) findpod will help
you narrow it down to (at most) 100MB of network traffic, even if the
capture runs for weeks and your interface handles GBs of data!
Of
course (more than likely) it’s even easier than that; if your link loss
is being caused by a specific received packet it will be the last
packet in the capture file provided by findpod and you’ll only have a
100MB capture file to work with. If your issue is anything like mine you should be able to isolate it down to a specific packet that you can feed to tcpreplay; reproducing your controller issue on demand.
Please tell me about your experiences with findpod. As always, comments and suggestions are welcome!
No comments:
Post a Comment