Our case where to gather information was actually related to two disks going OFFLINE at the same time. As I wasn’t sure if this is normal Exadata behavior I went and created Service Request for this.
This is what we saw one day:
4_1 2017-01-04T18:54:05+01:00 warning "Data hard disk entered confinement offline status. The LUN 0_3 changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped. Status : WARNING - CONFINEDOFFLINE Manufacturer : HGST Model Number : XXXXN4.0T Size : 4.0T Serial Number : XXXXXX Firmware : A2D2 Slot Number : 3 Cell Disk : CD_03_exa1celadm03 Grid Disk : RECO_CD_03_exa1celadm03, DATA_CD_03_exa1celadm03, DBFS_DG_CD_03_exa1celadm03 Reason for confinement : service time is high relative to other disks with similar workloads" 4_2 2017-01-04T18:59:03+01:00 clear "Data hard disk status changed to normal. Status : NORMAL Manufacturer : HGST Model Number : XXXXN4.0T Size : 4.0TB Serial Number : XXXXXX Firmware : A2D2 Slot Number : 3 Cell Disk : CD_03_exa1celadm03 Grid Disk : DATA_CD_03_exa1celadm03, DBFS_DG_CD_03_exa1celadm03, RECO_CD_03_exa1celadm03"
After analysis Oracle came back and said this was expected behavior as the disk utilization went to 100% for a while. I still don’t understand why they go OFFLINE in that case though but perhaps there is some good logic for this.
While they said I could use cellcli to identify which database consumed most of I/O on this time using metrics DB_IO_UTIL_LG and DB_IO_UTIL_SM they referenced also note “Tool for Gathering I/O Resource Manager Metrics: metric_iorm.pl (Doc ID 1337265.1)”.
I hadn’t seen this script earlier and while they say in the note that you can use Enterprise Manager to view metrics via browser for me it seems you can do your analysis bit easier when starting with metric_iorm.pl from given time range, identifying problematic database and then running AWR there to see what caused the I/O load.
There are few different ways of running the script. Either you run it without parameters so it will show current cell metrics or you can give time range similar you would give with cellcli and it will produce output with one minute intervals from the cell you run it in. Good to have detailed time of your issue!
./metric_iorm.pl where collectionTime > ‘2017-01-11T19:00:00+01:00’ and collectionTime < ‘2017-01-11T18:40:00+01:00’
Output will be similar to this (with all of your databases, in this example I picked only one):
Database: OBIEE Utilization: Small=0% Large=0% Flash Cache: IOPS=1985 Space allocated=846380MB Disk Throughput: MBPS=3 Small I/O's: IOPS=3.717 Avg qtime=2.2ms Large I/O's: IOPS=11.633 Avg qtime=1.2ms Consumer Group: _ORACLE_MEDPRIBG_GROUP_ Utilization: Small=0% Large=0% Flash Cache: IOPS=1.400 Disk Throughput: MBPS=0 Small I/O's: IOPS=0.300 Avg qtime=1.9ms Large I/O's: IOPS=0.050 Avg qtime=8.0ms Consumer Group: _ORACLE_LOWPRIFG_GROUP_ Utilization: Small=0% Large=0% Flash Cache: IOPS=0.000 Disk Throughput: MBPS=0 Small I/O's: IOPS=0.000 Avg qtime=0.0ms Large I/O's: IOPS=0.000 Avg qtime=0.0ms Consumer Group: OTHER_GROUPS Utilization: Small=0% Large=0% Flash Cache: IOPS=1983 Disk Throughput: MBPS=3 Small I/O's: IOPS=2.100 Avg qtime=2.9ms Large I/O's: IOPS=11.583 Avg qtime=1.2ms Consumer Group: _ORACLE_LOWPRIBG_GROUP_ Utilization: Small=0% Large=0% Flash Cache: IOPS=0.500 Disk Throughput: MBPS=0 Small I/O's: IOPS=0.000 Avg qtime=0.0ms Large I/O's: IOPS=0.000 Avg qtime=0.0ms Consumer Group: _ORACLE_BACKGROUND_GROUP_ Utilization: Small=0% Large=0% Flash Cache: IOPS=0.400 Disk Throughput: MBPS=0 Small I/O's: IOPS=1.317 Avg qtime=1.2ms Large I/O's: IOPS=0.000 Avg qtime=0.0ms
And in the end you will have summary:
CELL METRICS SUMMARY Cell Total Utilization: Small=1% Large=61% Cell Total Flash Cache: IOPS=2488.05 Space allocated=1210491MB Cell Total Disk Throughput: MBPS=452.231 Cell Total Small I/O's: IOPS=108.935 Cell Total Large I/O's: IOPS=461.868 Cell Avg small read latency: 16.05 ms Cell Avg small write latency: 0.06 ms Cell Avg large read latency: 9.00 ms Cell Avg large write latency: 0.11 ms
I guess in this situation I should have analyzed the disk utilization but metrics had been purged already from the time of the issue.
I’ll definitely keep this script for future issues. Seems useful!