Insights and Strategies for DBA: RAC STARTUP ISSUES

11g R2

1. ocssd.log shows:

2012-01-27 13:42:58.796: [ CSSD][19]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 223132864, wrtcnt, 1112, LATS 783238209,
lastSeqNo 1111, uniqueness 1327692232, timestamp 1327693378/787089065

2. For 3 or more node cases, 2 nodes form cluster fine, the 3^rd node joined then failed, ocssd.log show:

2012-02-09 11:33:53.048: [ CSSD][1120926016](:CSSNM00008JclssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 2 nodes with leader 2, racnode2, is smaller than
cohort of 2 nodes led by node 1, racnode1, based on map type 2
2012-02-09 11:33:53.048: [ CSSD][1120926016]###################################
2012-02-09 11:33:53.048: [ CSSD][1120926016]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread

3. Ocssd.bin startup timeout after 10minutes

2012-04-08 12:04:33.153: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1333911873
…...
2012-04-08 12:14:31.994: [ CSSD][5]clssgmShutDown: Received abortive shutdown request from client.
2012-04-08 12:14:31.994: [ CSSD][5]###################################
2012-04-08 12:14:31.994: [ CSSD][5]clssscExit: CSSD aborting from thread GMClientListener
2012-04-08 12:14:31.994: [ CSSD][5]###################################
2012-04-08 12:14:31.994: [ CSSD][5](:CSSSC00012JclssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

Possible Causes:

1. Voting disk is missing or inaccessible

2. Multicast is not working (for 11.2.0.2+)

3. Private network is not working, ping or traceroute shows destination unreachable.

4. Private network is pingable with normal ping command but not pingable with jumbo frame size (eg: ping –s 8900 ) when jumbo frame is enabled (MTU: 9000+). Or partial cluster nodes have jumbo frame set (MTU: 9000) and the problem node does not have jumbo frame set (MTU:1500)

5. Gpnpd does not come up, stuck in dispatch thread, Bug 10105195

6. Too many disks discovered via asm_diskstring or slow scan of disks due to Bug 13454354 on Solaris 11.2.0.3 only

Solutions:

1. Restore the voting disk access by checking storage access, disk permissions etc.
If the voting disk is missing from the OCR ASM diskgroup, start CRS in exclusive mode and recreate the voting disk:
# crsctl start crs –excl
# crsctl replace votedisk <+OCRVOTE diskgroup>

2. Refer to Document 1212703.1 for multicast test and fix

3. Consult with the network administrator to restore private network access

4. Engage network admin to enable jumbo frame from switch layer if it is enabled at Network card

5. Kill the gpnpd.bin process on surviving node, refer Document 10105195.8 Once above issues are resolved, restart Grid Infrastructure stack. If ping/traceroute all work for private network,

there is a failed 11.2.0.1 to 11.2.0.2 upgrade happened, please check out Bug 13416559 for workaround

6. Limit the number of ASM disks scan by supplying a more pecific asm_diskstring, refer bug 13583387 For Solaris 11.2.0.3 only, please apply patch 13250497, see <document

Issue #3: CRS-4535: Cannot communicate with Cluster Ready Services, crsd.bin is not running

Symptoms:

1. Command ‘$GRID_HOME/bin/crsctl check crs’ returns errors:
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager

2. Command ‘ps –ef | grep d.bin’ does not show a line similar to:
root 23017 1 1 22:34 ? 00:00:00 /u01/app/11.2.0/grid/bin/crsd.bin reboot

3. Even if the crsd.bin process exists, command ‘crsctl stat res –t –init’ shows:
ora.crsd

1 ONLINE INTERMEDIATE

Possible Causes:

1. Ocssd.bin is not running or resource ora.cssd is not ONLINE
2. +ASM instance can not startup
3. OCR is inaccessible
4. Network configuration has been changed causing gpnp profile.xml mismatch
5. $GRID_HOME/crs/init/.pid file for crsd has been removed or renamed manually, crsd.log shows: ‘Error3 -2 writing PID to the file’
6. Ocr.loc content mismatch with other cluster nodes. Crsd.log shows: ‘Shutdown CacheLocal. My hash ids don’t match’

Solutions:

1. Check the solution for Issue 2, ensure ocssd.bin is running and ora.cssd is ONLINE

2. For 11.2.0.2+, ensure that the resource ora.cluster_interconnect.haip is ONLINE, refer to Document 1383737.1 for ASM startup issues related to HAIP.

3. Ensure the OCR disk is available and accessible. If the OCR is lost for any reason, refer to Document 1062983.1 on how to restore the OCR.

4. Restore network configuration to be the same as interface defined in $GRID_HOME//profiles/peer/profile.xml, refer to Document 1073502.1 for private network modifications.

5. Touch the file with .pid under $GRID_HOME/crs/init.
For 11.2.0.1, the file is owned by user.
For 11.2.0.2, the file is owned by root user.

6. Using ocrconfig –repair command to fix the ocr.loc content:
for example, as root user:
# ocrconfig –repair –add +OCR2 (to add an entry)
# ocrconfig –repair –delete +OCR2 (to remove an entry)
ohasd.bin needs to be up and running in order for above command to run.

Once above issues are resolved, either restart GI stack or start crsd.bin via:
# crsctl start res ora.crsd –init

Issue #4: HAIP is not ONLINE (for 11.2.0.2+)

-Symptoms:

Command ‘crsctl stat res –t –init’ shows:
ora.cluster_interconnect.haip
1 ONLINE OFFLINE

-Possible Causes:

1. Bug 10370797: START OF ‘ORA.CLUSTER_INTERCONNECT.HAIP’ FAILED DURING UPGRADE TO 11.2.0.2 (AIX only)
2. The private network information stored in the OCR does not match the actual OS setup, eg: oifcfg getif and ifconfig output mismatch (wrong interface name, subnet etc.)

-Solutions:

1. Apply Patch:10370797
2. Correct the OCR configuration, make sure it matches the OS network configuration

More information about HAIP is provided in Document 1210883.1.

Issue #5: ASM instance does not start, ora.asm is OFFLINE

Symptoms:

1. Command ‘ps –ef | grep asm’ shows no ASM processes
2. Command ‘crsctl stat res –t –init’ shows:
ora.asm
1 ONLINE OFFLINE

Possible Causes:

1. ASM spfile is corrupted
2. ASM discovery string is incorrect and therefore voting disk/OCR cannot be discovered
3. ASMlib configuration problem

Solutions:

1. Create a temporary pfile to start ASM instance, then recreate spfile, see Document 1095214.1 for more details.
2. Refer to Document 1077094.1 to correct the ASM discovery string.
3. Refer to Document 1050164.1 to fix ASMlib configuration.

For further debugging GI startup issue, please refer to Document 1050908.1 Troubleshoot Grid Infrastructure Startup Issues.</document

Insights and Strategies for DBA

Labels

Wednesday, 6 January 2016

RAC STARTUP ISSUES

No comments: