ASM „corrupted metadata block“ check via amdu / kfed (Part 1)

corrupt

Last week we had a crash on our Exadata ASM Instance and we are not amused about this but we restart the instance and start working as usually.

About the environment: „GRID Software is Release 12.1 but the diskgroups are compatible 11.2.0.4“

To be save we start a check on the DATA diskgroup.


ALTER DISKGROUP DATA CHECK all NOREPAIR;

The check run online but nearly 25 hours
We saw in the meantime lots of errors in the ASM alert.log

Tue Jun 21 15:47:15 2016
NOTE: disk DATA_CD_10_srv1CD13, used AU total mismatch: DD={514269, 0} AT={514270, 0}
Tue Jun 21 15:47:15 2016
GMON querying group 1 at 567 for pid 52, osid 138892
GMON checking disk 143 for group 1 at 568 for pid 52, osid 138892

A MOS note said this should not be a problem but is this correct …?

The analyze is done via a  dump with amdu of the diskgroup when the „CHECK NO REPAIRS“ is ready.


amdu -diskstring 'o/*/*' -dump 'DATA'

Yes, we start the dump in a directory where we have enough space while the amdu tool creates a lot

of 2GB files dependent from the size of the diskgroup. One small file will also be created during this

dump and it is the  report.txt file.

The report.txt has information about the System, OS, Version, all scanned disks and also a list

about the scanned disks which have „corrupted metadata blocks“.

Here an example

---------------------------- SCANNING DISK N0002 -----------------------------

Disk N0002: '192.168.10.10/DATA_CD_01_srv1cd2">192.168.10.10/DATA_CD_01_srv1cd2'

AMDU-00209: Corrupt block found: Disk N0002 AU [454272] block [0] type [0]

AMDU-00201: Disk N0002: '192.168.10.10/DATA_CD_01_srv1cd2">192.168.10.10/DATA_CD_01_srv1cd2'

AMDU-00217: Message 217 not found;  product=RDBMS; facility=AMDU; arguments: [0] [1024] [blk_kfbl]

           Allocated AU's: <strong>507621</strong>

                Free AU's: 57627

       AU's read for dump: 194

       Block images saved: 12457

        Map lines written: 194

          Heartbeats seen: 0

  Corrupt metadata blocks: 1

        Corrupt AT blocks: 0

The next question was: „How can we check if this metadata block is corrupted?“

The answer is you need the kfed tool and theAllocated AU’s: 507621″ 

from the report.txt files.


[oracle0@srv1db1]$ kfed read <strong>aun=507621</strong> aus=4194304 blkn=0 dev=o/<a href="http://192.168.10.10/DATA_CD_00_srv1cd2%7C" data-saferedirecturl="https://www.google.com/url?hl=de&q=http://192.168.10.10/DATA_CD_00_srv1cd2%257C&source=gmail&ust=1467906685880000&usg=AFQjCNHjeE0eAiO3K3BgawquJRg3dq2V0Q">192.168.10.10/DATA_CD_00_srv1cd2</a>

kfbh.endian:                         58 ; 0x000: 0x3a

kfbh.hard:                          162 ; 0x001: 0xa2

kfbh.type:                            0 ; 0x002: <strong>KFBTYP_INVALID</strong>

kfbh.datfmt:                          0 ; 0x003: 0x00

kfbh.block.blk:              1477423104 ; 0x004: blk=1477423104

kfbh.block.obj:              3200986444 ; 0x008: disk=732492

kfbh.check:                    67174540 ; 0x00c: 0x0401008c

kfbh.fcn.base:                    51826 ; 0x010: 0x0000ca72

kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000

kfbh.spare1:                          0 ; 0x018: 0x00000000

kfbh.spare2:                          0 ; 0x01c: 0x00000000

1EFB9400000 0000A23A 580FB000 BECB2D4C 0401008C  [:......XL-......]

1EFB9400010 0000CA72 00000000 00000000 00000000  [r...............]

1EFB9400020 00000000 00000000 00000000 00000000  [................]

  Repeat 253 times

As you saw in the example the „kfbh.type = KFBTYP_INVALID“ which means the metadata block is corrupt.

So and how can I fix this?

In our situation we have an diskgroup which is compatible 11.2.0.4 so we have to start an


ALTER DISKGROUP DATA CHECK ALL REPAIR;

Yeah this could be very dangerous.

If the „CHECK ALL REPAIR“ find a corruption and try to repair this the diskgroup will be dismounted

This means all databases which are up and running will crash

But keep in mind that a „CHECK ALL REPAIR“ will also run 25 hours.

Is there another solution?
Yes but you need also a dismount of the diskgroup.

Then run the amdu tool „OFFLINE“ again and check the report.txt file again for corrupted metadata blocks

More details will be discussed in Part 2 about ASM kfed and amdu

So stay tuned.

 


 

Advertisements

Exdata Lifecycle / Patching

patching_icon

Operate an Exadata Database Machine means you have to manage the Lifecyle. One major task is the regular patching of the whole Exa Stack.

This blog article give you an overview about the Patching.

First remember which components are part of the lifecycle.

Following the component and the tool.

exadata

  • GRID & RDBMS
    • opatch (oplan)
  • DB Node
    • patchmgr (that’s new since Oct 2015)
  • Storage Grid
    • patchmgr
  • Network
    • patchmgr

Before starting the Patching you need to do a bullet proof planing otherwise you fail.

For a Quarter Rack with lets say 10 Production databases you need a planing phase of more or less 2-3 weeks.

How to setup a recommendation?

  • Analyze your ORACLE_HOMES
  • Check existing SR for every database
  • Meet with your Application Manager
  • Use Oracle Tools like exachk
  • Use the conflict analyzer in MOS

exachk will be your best friend

Check the My Oracle Support Note 1070954.1 and install the latest version

First take a look of the table of contents

exachk_0

and one very important table is the recommended version overview

exachk_1

What will be the best recommendation?

It doesn’t give an easy answer while Oracle has a lot of possibilities for the Patching:

  • the QFSDP the Quarterly Full Stack Download Patch
  • or Standalone Patchsets for every Component like Infinband, Cell Server, DB-Node and so on

So the decision has to be taken by the whole team of Application Manager and Oracle DBA’s and System Administrator

Weiterlesen „Exdata Lifecycle / Patching“