The most recent cluster nodes at work use in-band remote-management,
that is: The management card that allows resetting the machines and
things like that uses the same physical interface the machine uses
for its normal ethernet connection. The way this works is that
packets tagged with a certain VLAN ID reach only the management
card, while everything else goes to the normal network interface
that e.g. the linux kernel sees. Though my gut feeling told me
this was a stupid idea from the start, we had to accept it for
the simple reason that everything else was impractical. There
was just no space for another set of cables and switches for
dedicated management interfaces.
At first, everything went surprisingly smooth. Everything could
be easily configured, and worked flawlessly from the start.
Until we booted a 9.10 Ubuntu instead of the 8.04 LTS that
usually runs there. Somewhere in the middle of the messages
the kernel writes during boot, the serial console (which is
serial-over-lan through the management card) would just stop
responding, in fact the whole management card would suddenly
not reply to anything anymore, until either the management
card or the whole machine was reset.
It took a while to work that one out. The last message on
the console before it died was not always the same, but it
was usually within the same 10 lines. I finally realized
that it always died few seconds after loading the network
driver.
Further investigation revealed quite a few amusing details.
- Ubuntu 8.04 LTS was the only version that
would work. All later versions would show the
exact same behaviour. The same was true if a current version
of the driver for the network card ("igb") was used.
Or if the SLES10 SP3 kernel was booted
- If we compiled the LTS kernel ourself, and used that instead
of the official kernel packages, we would have no network at
all. The reason is that the normal kernel does not
contain any igb module at all, Ubuntu has put that into the
"linux-ubuntu-modules" package which is compiled
seperately and automagically installed when needed. In that
modules package they have in fact two different versions
of the igb driver: One is 1.0.8, which is
from November 2007 and doesn't even compile with 2.6.24 without
heavy patching. The other is version 1.3.28.4, which is not
the current version, but a rather current patchlevel of the old
"stable" tree. They put this newer driver
in as igb-next, and patched it, so that it would
only respond to PCI IDs the old igb driver could not handle.
In other words, if the network card is recognized by the old
driver at all, that driver will be used, regardless of the
fact the new driver would probably handle more features of
the card. It turns out this approach was great for us, because
with 1.3.28.4 the management interface would die. The
patched Ubuntu version of 1.0.8 was basically the only
version that worked.
- We found out why the management cards wouldn't respond
anymore: When one of the newer drivers was loaded, the linux
kernel would suddenly see the packets of the management vlan,
so they apparently didn't reach the management card anymore.
- The reason newer versions of the driver don't work anymore
is probably that the network card has hardware support for
handling VLAN tags, and they added support for that in the
driver some time ago - somewhere between 1.0.8 and 1.3.28
I guess.
Every driver which initializes the VLAN handling seems to
kill the filter the management card has set up.
- This generates an interesting effect if you tell the
kernel to use that VLAN hardware support: If you load the
801q kernel module, and then tell it to listen on the management
VLAN with the command vconfig add eth0 vlanid,
the exact opposite happens. The management VLAN is suddenly
routed to the management card again, and the kernel cannot
see it anymore.
I'm still wondering who is the culprit and who needs to
fix the bug here. Is it Intel with the igb driver,
because that destroys the VLAN routing during initialization?
Or is it IBM, because their management card doesn't realize its
interface is gone, and doesn't re-initialize it? I haven't
opened a bug report yet, because I'm really not in the
mood for a round of "but its THEIR fault" right
now.
|