This is more of informative story what happened to us recently.

In the past we had issue with one of the Exadata cell node where the RAID HBA card has failed. After working with support they decided to replace the card and at the same time update the card firmware.

That issue was described in detail in note:

Exadata X5/X6 reports “Disk controller was hung. Cell was power cycled to stop the hang.” and SAS HBA logs report correctable errors on SW images prior to (Doc ID 2176276.1)

Now this time one of the database nodes went down and surprise surprise it was the RAID HBA card again!

Only problem was that the replacement part was 24 hours off and we have some single node databases running on this node. So we had to find feasible workaround for this issue.

We came up with an idea to replace the card from one of the cell nodes as we run ASM high redundancy and thought we can survive this time without one of the cell nodes being up for 24 hours on a weekend.

If you are not familiar with ASM redundancy with Exadata here is good short summary of it:

Once on site we came up with idea to use similar card from our Oracle Platinum Gateway server. It’s still component inside Oracle’s support and we wouldn’t need to touch the cell node which would have had more risks involved.

Oracle’s Field Engineer replaced the part and we got our database node up! Still there was one issue that Infiniband link did not work. This was due to to Infiniband port being autodisabled during node crash.

After enabling the port from Infiniband switch everything was running properly again.

If you have similar critical issue it’s worth remembering there might be alternative way if the replacement part is not near you. Of course the part should be available but that’s a different issue..

Raid card model is Oracle Storage 12 Gb SAS PCIe RAID HBA.

Leave a Reply

Your email address will not be published.