Skip to content

ORA-29740 in Cluster Databases

  • by

ORA-29740

ORA-29740 is an error message only for cluster databases, which indicates that an instance of the cluster is evicted by another member. It could be caused by various kinds of problems, but they are usually caused by the performance or other hardware faults. Let's see a real case in the real world, in which, the instance #1 is evicted by instance #2 due to a performance issue.

In my case, instance #2 found instance #1 was hard to communicate with. So, it ordered the instance #1 to be evicted from the cluster. The alert log of the instance #2 showed the situation:

...
Sun Aug 22 18:27:35 2010
Communications reconfiguration: instance 0
Evicting instance 1 from cluster
Sun Aug 22 18:28:01 2010
Waiting for instances to leave:
1
Sun Aug 22 18:28:08 2010
Trace dumping is performing id=[30224035720]
Sun Aug 22 18:28:21 2010
Waiting for instances to leave:
1
Sun Aug 22 18:28:41 2010
Waiting for instances to leave:
1
Sun Aug 22 18:29:01 2010
Waiting for instances to leave:
1
Sun Aug 22 18:29:21 2010
Waiting for instances to leave:
1
...
Sun Aug 22 18:36:20 2010
Reconfiguration started (old inc 9, new inc 10)
List of nodes:
 1
Sun Aug 22 18:36:20 2010
Reconfiguration started (old inc 9, new inc 11)
List of nodes:
 1
 Nested/batched reconfiguration detected.
 Global Resource Directory frozen
one node partition
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 Resources and enqueues cleaned out
 Resources remastered 14197
 251008 GCS shadows traversed, 20 cancelled, 16610 closed
 234471 GCS resources traversed, 0 cancelled
 set master node info
 Submitted all remote-enqueue requests
 Update rdomain variables
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 251008 GCS shadows traversed, 0 replayed, 16629 unopened
 Submitted all GCS remote-cache requests
 0 write requests issued in 234379 GCS resources
 5 PIs marked suspect, 0 flush PI msgs
Sun Aug 22 18:36:27 2010
Reconfiguration complete
 Post SMON to start 1st pass IR
Sun Aug 22 18:36:27 2010
Instance recovery: looking for dead threads
Sun Aug 22 18:36:27 2010
ARC9: Completed archiving  log 10 thread 2 sequence 869333
Sun Aug 22 18:36:27 2010
Beginning instance recovery of 1 threads
Sun Aug 22 18:36:27 2010
Started redo scan
Sun Aug 22 18:36:29 2010
Completed redo scan
 125469 redo blocks read, 3706 data blocks need recovery
...
The alert log of the instance #1 showed that it met an ORA-29740 situation and it must be shutdown.
...
Sun Aug 22 18:32:15 2010
Trace dumping is performing id=[30224035720]
Sun Aug 22 18:32:53 2010
SMON: terminating instance due to error 481
Sun Aug 22 18:34:15 2010
Errors in file /oracle/admin/ORCL/bdump/orcl_lmon_7234.trc:
ORA-29740: evicted by member 1, group incarnation 10
Instance terminated by SMON, pid = 3452
...
The trace file of LMON of instance #1 recorded the situation before shutdown:
...
*** 2010-08-22 18:29:34.050
kjxgrdtrt: Evicted by 1, seq (10, 9)
IMR state information
  Member 0, thread 1, state 4, flags 0x0040
  RR seq 9, propstate 5, pending propstate 0
  Member information:
    Member 0, incarn 9, version 678769
      thrd 1, prev thrd 65535, status 0x0007, err 0x0000
    Member 1, incarn 9, version 109808
      thrd 2, prev thrd 65535, status 0x0007, err 0x0000
Group name: ORCL
Member id: 0
Cached SKGXN event: 0
Group State:
  State: 9 6
  Commited Map: 0 1
  New Map: 0 1
  SKGXN Map: 0 1
  Master node: 0
  Memcnt 2  Rcvcnt 0
  Substate Proposal: false
Inc Proposal:
  incarn 0  memcnt 0  master 0
  proposal false  matched false
  map:
Master Inc State:
  incarn 0  memcnt 0  agrees 0  flag 0x1
  wmap:
  nmap:
  ubmap:
Submitting asynchronized dump request [1]
*** 2010-08-22 18:30:44.766
error 29740 detected in background process
ORA-29740: evicted by member 1, group incarnation 10

The causes of communication problems could be various possibilities:

  • The system is halted or boots in progress makes the heartbeat stopped.
  • The system is hung due to performance problems.
  • Software or hardware faults from network interface cards.

Luckily, there were something specials found in the dmesg:

...
Aug 22 18:12:50 dbhost cl_runtime: [ID 661778 kern.warning] WARNING: clcomm: memory low: freemem 0x1605
Aug 22 18:23:42 dbhost in.mpathd[2052]: [ID 585766 daemon.error] Cannot meet requested failure detection time of 20000 ms on (inet  ge3) new failure detection time for group "nafo0" is 227332 ms
Aug 22 18:35:34 dbhost in.mpathd[2052]: [ID 585766 daemon.error] Cannot meet requested failure detection time of 20000 ms on (inet  ge3) new failure detection time for group "nafo0" is 933156 ms
Aug 22 18:35:46 dbhost eTAudit GenericRec: [ID 778245 user.error] Failed to submit message to router.
Aug 22 18:36:42 dbhost in.mpathd[2052]: [ID 302819 daemon.error] Improved failure detection time 466578 ms on (inet  ge3) for group "nafo0"
Aug 22 18:37:32 dbhost in.mpathd[2052]: [ID 302819 daemon.error] Improved failure
detection time 233289 ms on (inet  ge3) for group "nafo0"
...

It seemed that the server burdened memory overloading and caused the heartbeat stuck and failed to keep the tempo with other members in cluster. The bottleneck was found and reported to system administrator by DBA. And the system administrator decided to add more physical memory to ease the pressure. After bouncing instance #1, the cluster database is back to normal.

Leave a Reply

Your email address will not be published.