We have had a few servers fail to stay up and running when just one of the PSUs goes down. We brought one of the problematic PSUs back to our test environment and were able to reproduce the issue for a time being. But the PSU is now functioning correctly. Which leads me to believe there is some sort of firmware issue, as an actually faulty PSU should not magically fix itself.
- Even after the OS would crash the iDRAC would only report the PSU with removed AC as having a problem.
- The PSU failing to stay redundant was at least providing enough power to keep the iDRAC running, but not enough for the OS.
- While up the PSUs were sharing the power load evenly.
I know that cold redundancy was a fairly new feature at the time of the R710s release, maybe the PSU thinks its in a cold redundancy state and never ups the power its providing? Is there any way to check the state of the PSU's Cold/Warm Redundancy setting? I know other servers have this available by raw ipmitool commands, but those are undocumented for the public so I don't know how to access them on Dell R710s.