ASM „corrupted metadata block“ check via amdu / kfed (Part 1)

corrupt

Last week we had a crash on our Exadata ASM Instance and we are not amused about this but we restart the instance and start working as usually.

About the environment: „GRID Software is Release 12.1 but the diskgroups are compatible 11.2.0.4“

To be save we start a check on the DATA diskgroup.


ALTER DISKGROUP DATA CHECK all NOREPAIR;

The check run online but nearly 25 hours
We saw in the meantime lots of errors in the ASM alert.log

Tue Jun 21 15:47:15 2016
NOTE: disk DATA_CD_10_srv1CD13, used AU total mismatch: DD={514269, 0} AT={514270, 0}
Tue Jun 21 15:47:15 2016
GMON querying group 1 at 567 for pid 52, osid 138892
GMON checking disk 143 for group 1 at 568 for pid 52, osid 138892

A MOS note said this should not be a problem but is this correct …?

The analyze is done via a  dump with amdu of the diskgroup when the „CHECK NO REPAIRS“ is ready.


amdu -diskstring 'o/*/*' -dump 'DATA'

Yes, we start the dump in a directory where we have enough space while the amdu tool creates a lot

of 2GB files dependent from the size of the diskgroup. One small file will also be created during this

dump and it is the  report.txt file.

The report.txt has information about the System, OS, Version, all scanned disks and also a list

about the scanned disks which have „corrupted metadata blocks“.

Here an example

---------------------------- SCANNING DISK N0002 -----------------------------

Disk N0002: '192.168.10.10/DATA_CD_01_srv1cd2">192.168.10.10/DATA_CD_01_srv1cd2'

AMDU-00209: Corrupt block found: Disk N0002 AU [454272] block [0] type [0]

AMDU-00201: Disk N0002: '192.168.10.10/DATA_CD_01_srv1cd2">192.168.10.10/DATA_CD_01_srv1cd2'

AMDU-00217: Message 217 not found;  product=RDBMS; facility=AMDU; arguments: [0] [1024] [blk_kfbl]

           Allocated AU's: <strong>507621</strong>

                Free AU's: 57627

       AU's read for dump: 194

       Block images saved: 12457

        Map lines written: 194

          Heartbeats seen: 0

  Corrupt metadata blocks: 1

        Corrupt AT blocks: 0

The next question was: „How can we check if this metadata block is corrupted?“

The answer is you need the kfed tool and theAllocated AU’s: 507621″ 

from the report.txt files.


[oracle0@srv1db1]$ kfed read <strong>aun=507621</strong> aus=4194304 blkn=0 dev=o/<a href="http://192.168.10.10/DATA_CD_00_srv1cd2%7C" data-saferedirecturl="https://www.google.com/url?hl=de&q=http://192.168.10.10/DATA_CD_00_srv1cd2%257C&source=gmail&ust=1467906685880000&usg=AFQjCNHjeE0eAiO3K3BgawquJRg3dq2V0Q">192.168.10.10/DATA_CD_00_srv1cd2</a>

kfbh.endian:                         58 ; 0x000: 0x3a

kfbh.hard:                          162 ; 0x001: 0xa2

kfbh.type:                            0 ; 0x002: <strong>KFBTYP_INVALID</strong>

kfbh.datfmt:                          0 ; 0x003: 0x00

kfbh.block.blk:              1477423104 ; 0x004: blk=1477423104

kfbh.block.obj:              3200986444 ; 0x008: disk=732492

kfbh.check:                    67174540 ; 0x00c: 0x0401008c

kfbh.fcn.base:                    51826 ; 0x010: 0x0000ca72

kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000

kfbh.spare1:                          0 ; 0x018: 0x00000000

kfbh.spare2:                          0 ; 0x01c: 0x00000000

1EFB9400000 0000A23A 580FB000 BECB2D4C 0401008C  [:......XL-......]

1EFB9400010 0000CA72 00000000 00000000 00000000  [r...............]

1EFB9400020 00000000 00000000 00000000 00000000  [................]

  Repeat 253 times

As you saw in the example the „kfbh.type = KFBTYP_INVALID“ which means the metadata block is corrupt.

So and how can I fix this?

In our situation we have an diskgroup which is compatible 11.2.0.4 so we have to start an


ALTER DISKGROUP DATA CHECK ALL REPAIR;

Yeah this could be very dangerous.

If the „CHECK ALL REPAIR“ find a corruption and try to repair this the diskgroup will be dismounted

This means all databases which are up and running will crash

But keep in mind that a „CHECK ALL REPAIR“ will also run 25 hours.

Is there another solution?
Yes but you need also a dismount of the diskgroup.

Then run the amdu tool „OFFLINE“ again and check the report.txt file again for corrupted metadata blocks

More details will be discussed in Part 2 about ASM kfed and amdu

So stay tuned.

 


 

Advertisements

Über spa

Oracle and Unix Professional, main focus on Oracle HA - Systems also an Exadata enthusiasts
Dieser Beitrag wurde unter ASM, Exadata, Grid Infrastructure, Oracle Database abgelegt und mit , , verschlagwortet. Setze ein Lesezeichen auf den Permalink.

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden / Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden / Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden / Ändern )

Google+ Foto

Du kommentierst mit Deinem Google+-Konto. Abmelden / Ändern )

Verbinde mit %s