11g R2
6. Using ocrconfig –repair command to fix the ocr.loc content:
for example, as root user:
# ocrconfig –repair –add +OCR2 (to add an entry)
# ocrconfig –repair –delete +OCR2 (to remove an entry)
ohasd.bin needs to be up and running in order for above command to run.
Once above issues are resolved, either restart GI stack or start crsd.bin via:
# crsctl start res ora.crsd –init
Issue #4: HAIP is not ONLINE (for 11.2.0.2+)
-Symptoms:
Command ‘crsctl stat res –t –init’ shows:
ora.cluster_interconnect.haip
1 ONLINE OFFLINE
-Possible Causes:
1. Bug 10370797: START OF ‘ORA.CLUSTER_INTERCONNECT.HAIP’ FAILED DURING UPGRADE TO 11.2.0.2 (AIX only)
2. The private network information stored in the OCR does not match the actual OS setup, eg: oifcfg getif and ifconfig output mismatch (wrong interface name, subnet etc.)
-Solutions:
1. Apply Patch:10370797
2. Correct the OCR configuration, make sure it matches the OS network configuration
More information about HAIP is provided in Document 1210883.1.
Issue #5: ASM instance does not start, ora.asm is OFFLINE
Symptoms:
1. Command ‘ps –ef | grep asm’ shows no ASM processes
2. Command ‘crsctl stat res –t –init’ shows:
ora.asm
1 ONLINE OFFLINE
Possible Causes:
1. ASM spfile is corrupted
2. ASM discovery string is incorrect and therefore voting disk/OCR cannot be discovered
3. ASMlib configuration problem
Solutions:
1. Create a temporary pfile to start ASM instance, then recreate spfile, see Document 1095214.1 for more details.
2. Refer to Document 1077094.1 to correct the ASM discovery string.
3. Refer to Document 1050164.1 to fix ASMlib configuration.
For further debugging GI startup issue, please refer to Document 1050908.1 Troubleshoot Grid Infrastructure Startup Issues.</document
1. ocssd.log shows:
2012-01-27 13:42:58.796: [ CSSD][19]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 223132864, wrtcnt, 1112, LATS 783238209,
lastSeqNo 1111, uniqueness 1327692232, timestamp 1327693378/787089065
2. For 3 or more node cases, 2 nodes form cluster fine, the 3rd node joined then failed, ocssd.log show:
2012-02-09 11:33:53.048: [ CSSD][1120926016](:CSSNM00008JclssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 2 nodes with leader 2, racnode2, is smaller than
cohort of 2 nodes led by node 1, racnode1, based on map type 2
2012-02-09 11:33:53.048: [ CSSD][1120926016]###################################
2012-02-09 11:33:53.048: [ CSSD][1120926016]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
3. Ocssd.bin startup timeout after 10minutes
2012-04-08 12:04:33.153: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1333911873
…...
2012-04-08 12:14:31.994: [ CSSD][5]clssgmShutDown: Received abortive shutdown request from client.
2012-04-08 12:14:31.994: [ CSSD][5]###################################
2012-04-08 12:14:31.994: [ CSSD][5]clssscExit: CSSD aborting from thread GMClientListener
2012-04-08 12:14:31.994: [ CSSD][5]###################################
2012-04-08 12:14:31.994: [ CSSD][5](:CSSSC00012JclssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
Possible Causes:
1. Voting disk is missing or inaccessible
2012-01-27 13:42:58.796: [ CSSD][19]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 223132864, wrtcnt, 1112, LATS 783238209,
lastSeqNo 1111, uniqueness 1327692232, timestamp 1327693378/787089065
2. For 3 or more node cases, 2 nodes form cluster fine, the 3rd node joined then failed, ocssd.log show:
2012-02-09 11:33:53.048: [ CSSD][1120926016](:CSSNM00008JclssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 2 nodes with leader 2, racnode2, is smaller than
cohort of 2 nodes led by node 1, racnode1, based on map type 2
2012-02-09 11:33:53.048: [ CSSD][1120926016]###################################
2012-02-09 11:33:53.048: [ CSSD][1120926016]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
3. Ocssd.bin startup timeout after 10minutes
2012-04-08 12:04:33.153: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1333911873
…...
2012-04-08 12:14:31.994: [ CSSD][5]clssgmShutDown: Received abortive shutdown request from client.
2012-04-08 12:14:31.994: [ CSSD][5]###################################
2012-04-08 12:14:31.994: [ CSSD][5]clssscExit: CSSD aborting from thread GMClientListener
2012-04-08 12:14:31.994: [ CSSD][5]###################################
2012-04-08 12:14:31.994: [ CSSD][5](:CSSSC00012JclssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
Possible Causes:
1. Voting disk is missing or inaccessible
2. Multicast is
not working (for 11.2.0.2+)
3. Private
network is not working, ping or traceroute shows destination unreachable.
4. Private
network is pingable with normal ping command but not pingable with jumbo frame
size (eg: ping –s 8900 ) when jumbo frame is enabled (MTU: 9000+). Or
partial cluster nodes have jumbo frame set (MTU: 9000) and the problem node
does not have jumbo frame set (MTU:1500)
5. Gpnpd does
not come up, stuck in dispatch thread, Bug 10105195
6. Too many
disks discovered via asm_diskstring or slow scan of disks due to Bug 13454354
on Solaris 11.2.0.3 only
Solutions:
1. Restore the voting disk access by checking storage access, disk permissions etc.
If the voting disk is missing from the OCR ASM diskgroup, start CRS in exclusive mode and recreate the voting disk:
# crsctl start crs –excl
# crsctl replace votedisk <+OCRVOTE diskgroup>
Solutions:
1. Restore the voting disk access by checking storage access, disk permissions etc.
If the voting disk is missing from the OCR ASM diskgroup, start CRS in exclusive mode and recreate the voting disk:
# crsctl start crs –excl
# crsctl replace votedisk <+OCRVOTE diskgroup>
2. Refer to
Document 1212703.1 for multicast test and fix
3. Consult with
the network administrator to restore private network access
4. Engage
network admin to enable jumbo frame from switch layer if it is enabled at
Network card
5. Kill the
gpnpd.bin process on surviving node, refer Document 10105195.8 Once above
issues are resolved, restart Grid Infrastructure stack. If ping/traceroute
all work for private network,
there
is a failed 11.2.0.1 to 11.2.0.2 upgrade
happened, please check out Bug 13416559 for workaround
6. Limit the
number of ASM disks scan by supplying a more pecific asm_diskstring, refer bug
13583387 For Solaris 11.2.0.3 only, please apply patch 13250497,
see <document
Issue #3: CRS-4535: Cannot communicate with Cluster Ready Services, crsd.bin is not running
Symptoms:
1. Command ‘$GRID_HOME/bin/crsctl check crs’ returns errors:
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
Issue #3: CRS-4535: Cannot communicate with Cluster Ready Services, crsd.bin is not running
Symptoms:
1. Command ‘$GRID_HOME/bin/crsctl check crs’ returns errors:
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
2. Command ‘ps –ef
| grep d.bin’ does not show a line similar to:
root 23017 1 1 22:34 ? 00:00:00 /u01/app/11.2.0/grid/bin/crsd.bin reboot
root 23017 1 1 22:34 ? 00:00:00 /u01/app/11.2.0/grid/bin/crsd.bin reboot
3. Even if the
crsd.bin process exists, command ‘crsctl stat res –t –init’ shows:
ora.crsd
ora.crsd
1 ONLINE INTERMEDIATE
Possible Causes:
1. Ocssd.bin is not running or resource ora.cssd is not ONLINE
2. +ASM instance can not startup
3. OCR is inaccessible
4. Network configuration has been changed causing gpnp profile.xml mismatch
5. $GRID_HOME/crs/init/.pid file for crsd has been removed or renamed manually, crsd.log shows: ‘Error3 -2 writing PID to the file’
6. Ocr.loc content mismatch with other cluster nodes. Crsd.log shows: ‘Shutdown CacheLocal. My hash ids don’t match’
Solutions:
1. Check the solution for Issue 2, ensure ocssd.bin is running and ora.cssd is ONLINE
Possible Causes:
1. Ocssd.bin is not running or resource ora.cssd is not ONLINE
2. +ASM instance can not startup
3. OCR is inaccessible
4. Network configuration has been changed causing gpnp profile.xml mismatch
5. $GRID_HOME/crs/init/.pid file for crsd has been removed or renamed manually, crsd.log shows: ‘Error3 -2 writing PID to the file’
6. Ocr.loc content mismatch with other cluster nodes. Crsd.log shows: ‘Shutdown CacheLocal. My hash ids don’t match’
Solutions:
1. Check the solution for Issue 2, ensure ocssd.bin is running and ora.cssd is ONLINE
2. For
11.2.0.2+, ensure that the resource ora.cluster_interconnect.haip is ONLINE,
refer to Document 1383737.1 for ASM startup issues related to HAIP.
3. Ensure the
OCR disk is available and accessible. If the OCR is lost for any reason, refer
to Document 1062983.1 on how to restore the OCR.
4. Restore
network configuration to be the same as interface defined in $GRID_HOME//profiles/peer/profile.xml, refer
to Document 1073502.1 for private network modifications.
5. Touch the
file with .pid under $GRID_HOME/crs/init.
For 11.2.0.1, the file is owned by user.
For 11.2.0.2, the file is owned by root user.
For 11.2.0.1, the file is owned by user.
For 11.2.0.2, the file is owned by root user.
6. Using ocrconfig –repair command to fix the ocr.loc content:
for example, as root user:
# ocrconfig –repair –add +OCR2 (to add an entry)
# ocrconfig –repair –delete +OCR2 (to remove an entry)
ohasd.bin needs to be up and running in order for above command to run.
Once above issues are resolved, either restart GI stack or start crsd.bin via:
# crsctl start res ora.crsd –init
Issue #4: HAIP is not ONLINE (for 11.2.0.2+)
-Symptoms:
Command ‘crsctl stat res –t –init’ shows:
ora.cluster_interconnect.haip
1 ONLINE OFFLINE
-Possible Causes:
1. Bug 10370797: START OF ‘ORA.CLUSTER_INTERCONNECT.HAIP’ FAILED DURING UPGRADE TO 11.2.0.2 (AIX only)
2. The private network information stored in the OCR does not match the actual OS setup, eg: oifcfg getif and ifconfig output mismatch (wrong interface name, subnet etc.)
-Solutions:
1. Apply Patch:10370797
2. Correct the OCR configuration, make sure it matches the OS network configuration
More information about HAIP is provided in Document 1210883.1.
Issue #5: ASM instance does not start, ora.asm is OFFLINE
Symptoms:
1. Command ‘ps –ef | grep asm’ shows no ASM processes
2. Command ‘crsctl stat res –t –init’ shows:
ora.asm
1 ONLINE OFFLINE
Possible Causes:
1. ASM spfile is corrupted
2. ASM discovery string is incorrect and therefore voting disk/OCR cannot be discovered
3. ASMlib configuration problem
Solutions:
1. Create a temporary pfile to start ASM instance, then recreate spfile, see Document 1095214.1 for more details.
2. Refer to Document 1077094.1 to correct the ASM discovery string.
3. Refer to Document 1050164.1 to fix ASMlib configuration.
For further debugging GI startup issue, please refer to Document 1050908.1 Troubleshoot Grid Infrastructure Startup Issues.</document
No comments:
Post a Comment