Dead Proliant

johnw's picture

Summary : Sick server

This is a summary I (Mark) posted to the mailing list, concerning the problems I was having a while back and resolution, in case anyone sees a similar thing happen to them... Thanks for all your input :)

Hardware/distro

Compaq Proliant ML570, Smart Array controller 5300, RedHat 8.0.

Symptoms

Server "dies" at random intervals. OS is still running - you can ping it, and if you have a SSH/Telnet/FTP session open, it remains open but you can't run any commands. Can't login at the console - you get the Login: prompt, but after entering your username , nothing happens. Seems like the SCSI array has become somehow "disconnected".

Resolution

Called Compaq/HP. Reflashed all firmware to latest revisisons,still got the same thing happening. Eventually, an engineer called, and replaced the motherboard and SCSI back-planes on the two disk cages at the front. All seems to be fine now :) So it was a hardware problem after all... -Mark

Proliants Don't Like the Cold!

We have a similar server in one of our trucks and it is a real pain. We experience a similar issue to the one above but it is only at power-up (no drives detected.) This obviously prevents the machine from starting. Through experimentation we have discovered the problem is thermal. The machine appears to have be designed to operate only in nice air conditioned environments and neither the Smart Array controller nor the SCSI drives will run when cold. This may affect you if you keep/run such a machine in a truck or in a non-temperature-controlled environment where the temperature drops below 20C (e.g. at night) and you turn your machine off. The temperature at startup is really important but once the box has started the ambient temperature can drop and it doesn't seem to stop. To get the machine to start we:

  1. Take the drives out at night and put them on the heater when driving to the job.
  2. Put the drives in on arrival and turn the machine on (usually fails to boot)
  3. Turn the truck aircon onto heat to get the inside of the truck and racks warm.
  4. Leave it displaying the "no system disc found" error for 30-60 minutes.
  5. Reboot and off it usually goes.

I know this isn't the best way to treat a machine or drives but it's the only way we have found of getting it going. We're kind of hoping it will die soon so we can justify buying something that's a bit less cranky! - PaulStimpson