PoempelFox Blog

Sat, 26. Nov 2011

Strange coincidences Created: 26.11.2011 13:55
Last modified: 26.11.2011 20:50

At work we have a few external RAID arrays from a major manufacturer, lets call him "HAL", bought as part of a bigger storage system.
From the beginning, they were pretty annoying - aside from their subpar performance caused by intentional castration of the hardware built in (so you would have good reasons to buy the even more expensive systems from the same manufacturer), they threw spurious errors all the time. We soon got used to nonsense-mails from the system, e.g. telling us parts would be 'overheating' - in a 16 degrees cold cold-isle that is - and back to normal temperature a few seconds later. There were also other alerts of equal uselessness, all dissappearing again as fast as they showed up. My personal favorite being messages like:

Event occurred: Thu, 11 Aug 2011 01:19:24 CEST Event Message: Optimal wide port becomes degraded Component type: Enclosure Component (ESM, GBIC/SFP, Power Supply, or Fan) Component location: Enclosure 85, Slot 1

which translates into "something somehow somewhere went wrong, but I won't tell you what or where or how, HAHA!".
So, we were used to getting the occasional nonsense alerts and everything going back to normal without any external intervention just seconds later. Until one evening in 2010 (after office hours, of course) all hell broke lose. Over the course of about 40 minutes, we got more than 200 spurious errors from the 6 arrays. Those errors were not equally spread out, instead one array complained about 20-50 errors in exactly the same second, going back to completely normal a second later, and then a few minutes later the next array would act out in exactly the same way. What was worse was that, in one case, the errors included the "removal" of 8 out of 10 disks in a RAID6 group - which is of course very plausible, because removing 8 disks in exactly the same second is a piece of cake - naturally leading to the failure of that RAID group. Although all those supposedly removed disks were back seconds later, that naturally did not revive the RAID group.
I'm not going to talk about the nightmare with the "support" hotline that followed, although that was a great example of how to not handle support, but instead cut to the end of it: Almost a day later (which is somewhat different from what our service-level-agreement said!), we were in contact with an seemingly very arrogant support engineer from HALs storage division, that told us the magic commands we needed to enter to revive the dead RAID group without destroying all data on it.
Of course we also demanded to know what had caused the major outage, but the only thing we got from HAL on that was that absolutely, clearly, no doubt possible, our power grid was the cause of all evil. It was clear for us that this was nonsense: The server room is powered by an online UPS with tight monitoring of the output lines, and neither the monitoring noticed anything unusual, nor the other systems in the same rack (on the same outlets!) or in the rest of the server room. And even if there had been anything on the power grid (too small for the monitoring to notice), it could not have spread out over 40 minutes and then dissappear in the middle of the night. Nonetheless, HAL was unwilling to consider any other explanations.
So why am I telling stories from 2010? Because a few weeks ago, the exact same thing happened again: Distributed over around 40 minutes, all of the RAID arrays acted out by throwing insane amounts of spurious errors again. And again, one RAID6 group failed because 7 of its 10 disks were "removed" in the same second. Luckily, I remembered the command for reviving them, so the resulting complete system outage only was a few hours until I noticed the problem in the middle of the night.

And then, just out of curiosity, I started to calculate - how long has it been since the last failure? And my calculations revealed: 497 days, and a few hours. That certainly rang a bell in me, but for those who never heard of the Linux uptime bug, I'll explain: Almost all operating systems internally count the time since they were booted up and use it for internal functions, like scheduling things to happen in certain intervals. They do that because it's fast, and doesn't depend on real world time with all its complications like time zones and daylight savings time. At least in FreeBSD and Linux, this internal counter was increased by the timer interrupt 100 times per second. As they both used 32 bit counters, this timer would overflow after 2³² 1/100ths of a second - which works out to 497 days, 2 hours, 27 minutes and a few seconds. In both old Linux and old FreeBSD systems, this would be visible through the "uptime" command, which showed the time the system has been up - and as the counter overflowed, the uptime it showed would wrap around and suddenly start at 0 again after 497 days, 2 hours, 27 minutes...
Of course, these RAID arrays don't run an old linux version, but vxworks - however a quick research on google tells me that vxworks does the exact same tick counting, with a programmable tick rate. They do also offer functions to handle tick overflows, but that of course requires the programmer using these functions to use his brain... So if the tick rate was set to 100 per second, the system in those RAID arrays would exhibit the same behaviour as old Linux/FreeBSD.
Such overflows are also highly likely to cause other complications, because suddenly the values returned are not monotonically increasing anymore, and if used wrongly things can go terribly wrong. One popular example would be in the famous Year 2000 problem, and one similiar problem that is still to come will be the Year 2038 problem when the commonly used Unix timestamp wraps if it is a signed 32 bit counter.
In particular, such an overflow is also very very very likely to cause effects like the "disappearing discs" we saw. It is very easy to construct how things could go terribly wrong: Suppose you poll the responsiveness of all hard discs regularily, and remember the timestamp when you last received a valid reply. To see if the disc is still alive, you could do something like calculate (current_timestamp - last_reply_timestamp) and if that is more than a few seconds, then the disc hasn't replied for a long time and is probably dead. That will work, but explode horribly if the timestamp wraps: The current timestamp is suddenly slightly above zero, the last reply from the disc has a timestamp of close to 2³², so the difference between the two is close to 2³² - which could lead you to wrongly assume that the disc hasn't replied in ages and is dead. The problem would also instantly disappear again on the next poll, because then both timestamps would be in the low range again, causing the "dead" disc to be declared alive again.
Thus, declaring such problems occuring after 497 days, 2 hours, ... as coincidence or the result of a fluke in the power grid is about as plausible as claiming that 6 identical computers crashing on 2000-01-01 00:00:00 are just a coincidence. It is far more likely that this is a major firmware bug.

PS: In case you're not convinced yet: I calculated back 497 days from the time of the first failure. And not surprisingly, I arrived at the day where the racks housing these disc arrays were cabled. What a coincidence, huh?

PS2: We're getting these boxes exchanged for unrelated reasons soon. And I sure hope that will happen before the next 497 days are over...

no comments yet

write a new comment:

EOPage - generated with blosxom