One of our servers, a Dell PowerEdge 1850 which was running CentOS 4.5 with Xen hosting a couple of virtual machines, have start flashing one of its front panel lights, next to the hard disks, and beeping as well. The machine was under three year warranty so I thought best would be to contact Dell UK Server Support.
The e-mail conversation follows bellow with some comments from my side between the messages.
My initial message:
Our PE1850 server is having an amber flashing light next to the second hard disk, on the right. There's also a amber flashing light on the back of the server. The machine is on service so I can't perform any of the Dell diagnostic tests. If there's no other way to gather information about the problem, then I'll arrange to do so.
The first response from Dell:
With regards to the drive failure on the system in question , in order to determine root cause of the drive failure on the system we will need to obtain a log from the raid controller . Attached on this email is an program that allows us to grab the hardware log from the raid controller . This can be done within the operating system so there will be no need to down the system in order to do this . If you could reply with the output logfile we will examine it and determine the next action.
From that reply what I understand is that the technician indicates a hard disk failure. Because I didn’t mention something like that I’m just making it clear on my next e-mail. Also, the file attached was called Creating and Using the LSI Controller Log (TTY Log).oft Not an executable to my eyes:
I have to mention that there's no indication if it's a hard disk failure or something else. The disks seem to be working fine so far. We have a few VMs on one of them and none of them have reported any problems. How do I run this binary file in Linux? Making it executable and trying to run it doesn't work.
After a phone call from Dell’s support, we solve the misunderstanding and they send me the right tool to get the required logs off the server and send them to Dell:
As per our telephone conversation please reply to this mailbox with the controller log file .
During the phone call, I asked the friendly Dell support guy to send an engineer to have a look and replace the faulty disk. So:
As discussed you will not be able to receive an engineer till next week . As a result please reply to this email closer to the date you want the service call to take place .
An hour and a half after I received that e-mail, I received another one from another Dell support guy in response to the server’s logs:
This is not a hard disk problem, this is referring to another problem. I read through your logs, and the good news is that your array is in good health. The normal cause would be not having both power leads attached. If this isn't the case, we will need to obtain the logs from the onboard management chip. This will enable us to see exactly why the light is flashing. Have you installed server administrator on the machine?
I'm glad to hear that there's not disk array failure and I guess it's good that we didn't close the ticket and sending a technician... I'm not aware if server administrator is installed but as far as I can see (by open ports) it's not running. Is there any quick way to figure out? If it's not installed, is there a way to gather information without running the diagnostics?
I had to ask those silly questions as I wasn’t really familiar with Dell’s server administrator tool.
In response to that e-mail, there’s a third guy replying bringing some light with his e-mail:
I went through the controller log file you sent to us previously one more time and although it indeed lists the 2 HDDs in an online state that was true when the controller first started - around the 14th of July. Since then there is a message stating that disk ID0 failed on the 2nd of November. There were no errors logged prior to disk failure so it is not clear if the disk itself is faulty or not. In either case we must first check if the disk needs to be replaced by running some diagnostics on it. If all diags pass then we can rebuild the disk back into the array. If it turns out that the disk is faulty may I note that according to our records the server has been originally shipped with 2x73GB Seagate HDDs whereas presently there are 2x300GB Fujitsu drives in the system. If the drives were purchased from Dell then we will need a Dell order number for these 2 drives before we can replace them. If you had the drives purchased from a 3rd party then you will have to replace them yourself. Please follow the procedure below: 1. First, what we need to do is make sure the drive is seated properly in its slot. I suggest you remove the drive for 1-2 minutes and then reinsert it back into the system. Doing that alone may force the drive to start rebuilding. Monitor the LEDs on the drive and see if there is any activity on it (blinking green LED) after inserting. If so - the drive is rebuilding and you should check the status of it in 2-3 hours. If after several hours the LED on the drive turns green and the LED on the back of the server turns blue then the drive has successfully rebuilt back into the array. You can leave it at than or proceed with the diagnostics below in case you want to be certain the drive is OK. 2. In order to run diagnostics on the disk after reseating it you can wither you Dell 32-bit diags from a bootable CD: (...) or try running Dell PEDiags from within Linux: (...) Run the extended diagnostics on disk ID0 (or both) and let us know if any errors occur. If any of the disks fail please make sure you have your Dell order number when you contact us again so that we could book you a replacement drive.
Before following his advice and removing the hard disk in order to force the array to re-build, I migrated all of the virtual machines running to another Xen server. I then installed Dell’s server administrator tool and ran the diagnostics:
I have installed pediags and run the diagnostics on all the devices except the NICs as they would be disconnected. There were no errors reported at all and I was wondering if I should proceed with the rebuilding of the array? All the hardware we have into Dell machines is ordered directly from Dell, the same goes for those hard disk drives.
Then, there’s a fourth guy replying:
As long as the diags passed i would suggest to proceed with the rebuild.
And so I did, following their previous instructions. The results were send with my next mail:
I forced the server to rebuilt as you instructed me. I took out the one disk, kept it for 1-2 minutes and then re-inserted it. The only lights were, and still are, flashing amber (on the disk itself) and the status LED remains still flashing red, the same as the flashing light at the back of the server.
And then guess what? There’s a fifth guy replying to my last e-mail:
It looks as though this drive is going to need to be replaced. In the CTRL-R we would not be able to verify the drives. Reseating while up should have caused the drive to try and rebuild. In order to get a replacement drive out to you we are going to need a few details. Can you please supply us with two contact people onsite and their phone numbers as well as the complete physical address of the server including the post code. Will you also let us know if you are happy to fit the new drive on your own or would you prefer and engineer onsite to replace the drive? Thank you for running through these tests with us.
Start getting pissed off:
How will you replace a hard drive that you don't know if it's faulty or not? Both HDD had green lights but the system's LED was flashing amber. When I run the diagnostics I didn't receive any errors from the array or any other part of the hardware. From the systems logs, another Dell support person suggested to rebuild and I did so by taking out one of the hard disks.
Then I have the third guy calling me back and then replying:
As discussed on the phone, there was a misunderstanding on the first mail sent, and this has propogated itself throughout this mail thread. There was a small error several months ago on one of the disks, but this can be ignored as the reason for your current situation. You have proven this by successfully rebuilding the array. The onboard management led is flashing to indicate and error. We can pull the onboard management logs using server administrator, and the Dell diagnostics should also access this. You are going to check both power leads are attached, and if so, run diagnostics to pull the oboard logs. You will then email the response back to this address.
For me, the misunderstanding was still going on:
Unfortunately the misunderstanding still continues.... Maybe my fault. I have followed the process that was described to me in order to re-built the array. That was: take out the hard disk for 1-2 minutes, re-insert it and that should force the rebuilding. I just checked the machine and the OS was frozen. I rebooted and there was a message saying that "Logical Drive(s) failed". I had two options from this point (1) Run the configuration utility and (2) continue. I first choose the 2n option to continue but the OS wasn't loading, then the system was trying to perform a network boot. I rebooted and then chose the first option and got into the configuration utility. Both disks in the configuration menu were marked as "fail". I tried to clean the configuration, erase the existing logical volume and then create a new one. The existing one was erased but I wasn't successful to create a new one. Could you please send a technician next week in order to rebuild the array?
I’d guess that was because I followed the earlier instructions for “rebuilding the disk array”.
But still, a technician will not come. I don’t bother if I can get the disk array easily re-build:
Just tried calling you, I left a message. Getting an array created should take a few seconds on the phone, but what you have done will likely have erased what was on the drives. I'll try to get someone to contact you in the morning, as I am off from tonight until Tuesday morning.
So I got a call next morning and given instructions on how to re-build the disk array. To be honest, I was straight forward configuration if you know where to look at.
Finally, the host started up again with no flashing lights or beeping. The OS and the virtual machines were still there (going against all odds). However, to destroy the array and rebuild it took 10 working days and five Dell employees!
I must say that that was the only bad experience I had with Dell (server) support. Apart from that, most of the other requests were handled in the right manner within two days.