PoempelFox Blog

[..] [RSS Feed]
 

Fri, 11. Dec 2009


Fun with in-band remote-management Created: 11.12.2009 23:42
Last modified: 11.01.2010 00:05
The most recent cluster nodes at work use in-band remote-management, that is: The management card that allows resetting the machines and things like that uses the same physical interface the machine uses for its normal ethernet connection. The way this works is that packets tagged with a certain VLAN ID reach only the management card, while everything else goes to the normal network interface that e.g. the linux kernel sees. Though my gut feeling told me this was a stupid idea from the start, we had to accept it for the simple reason that everything else was impractical. There was just no space for another set of cables and switches for dedicated management interfaces.
At first, everything went surprisingly smooth. Everything could be easily configured, and worked flawlessly from the start.
Until we booted a 9.10 Ubuntu instead of the 8.04 LTS that usually runs there. Somewhere in the middle of the messages the kernel writes during boot, the serial console (which is serial-over-lan through the management card) would just stop responding, in fact the whole management card would suddenly not reply to anything anymore, until either the management card or the whole machine was reset.
It took a while to work that one out. The last message on the console before it died was not always the same, but it was usually within the same 10 lines. I finally realized that it always died few seconds after loading the network driver.
Further investigation revealed quite a few amusing details.
  • Ubuntu 8.04 LTS was the only version that would work. All later versions would show the exact same behaviour. The same was true if a current version of the driver for the network card ("igb") was used. Or if the SLES10 SP3 kernel was booted
  • If we compiled the LTS kernel ourself, and used that instead of the official kernel packages, we would have no network at all. The reason is that the normal kernel does not contain any igb module at all, Ubuntu has put that into the "linux-ubuntu-modules" package which is compiled seperately and automagically installed when needed. In that modules package they have in fact two different versions of the igb driver: One is 1.0.8, which is from November 2007 and doesn't even compile with 2.6.24 without heavy patching. The other is version 1.3.28.4, which is not the current version, but a rather current patchlevel of the old "stable" tree. They put this newer driver in as igb-next, and patched it, so that it would only respond to PCI IDs the old igb driver could not handle. In other words, if the network card is recognized by the old driver at all, that driver will be used, regardless of the fact the new driver would probably handle more features of the card. It turns out this approach was great for us, because with 1.3.28.4 the management interface would die. The patched Ubuntu version of 1.0.8 was basically the only version that worked.
  • We found out why the management cards wouldn't respond anymore: When one of the newer drivers was loaded, the linux kernel would suddenly see the packets of the management vlan, so they apparently didn't reach the management card anymore.
  • The reason newer versions of the driver don't work anymore is probably that the network card has hardware support for handling VLAN tags, and they added support for that in the driver some time ago - somewhere between 1.0.8 and 1.3.28 I guess. Every driver which initializes the VLAN handling seems to kill the filter the management card has set up.
  • This generates an interesting effect if you tell the kernel to use that VLAN hardware support: If you load the 801q kernel module, and then tell it to listen on the management VLAN with the command vconfig add eth0 vlanid, the exact opposite happens. The management VLAN is suddenly routed to the management card again, and the kernel cannot see it anymore.
I'm still wondering who is the culprit and who needs to fix the bug here. Is it Intel with the igb driver, because that destroys the VLAN routing during initialization? Or is it IBM, because their management card doesn't realize its interface is gone, and doesn't re-initialize it? I haven't opened a bug report yet, because I'm really not in the mood for a round of "but its THEIR fault" right now.
no comments yet
write a new comment:
name or nickname
eMail adress (optional)
Your comment:
calculate: (2 times 10) plus 3
 

EOPage - generated with blosxom