Saturday, February 23, 2008

Database/Instance Recovery in RAC

When a database starts oracle performs two consistency checks (among others).

  1. Check if the start SCN value of each datafile header matches the corresponding stop SCN value in the controlfile.
  2. Check if the checkpoint counter values matches.
If these two checks are successful then no instance recovery is needed.

If datafile header SCNs are out of sync then at least an instance recovery is needed.

If checkpoint counter value check fails then Oracle knows datafile was replaced with a backup copy while it was down and requires a media recovery.

Instance recovery is completed when Oracle has performed

  1. Cache recovery : replays the contents of the online redologs of the failed instance.
  2. Transaction recovery : rollback the uncommitted transactions of the failed instance.
  • During the first phase of the recovery GES (Global Enque Service) remaster the enqueues and GCS (Global Cache Service) remaster its resources from the failed instance among the remaining instances.
  • First step in GCS remastering is for Oracle to assign a new incarnation number.
  • Oracle determines how many more nodes are left in the cluster.
  • In an attempt to recreate the resource master of the failed node all GCS resource request and writer requests are temporarily suspended. GRD (Global Resource Directory) is frozen.
  • All the dead shadow process related to the GCS are cleaned up from the failed instance.
  • After enqueues are reconfigured one of the surviving instances grab the instance recovery enqueue.
  • Same time GCS resources are remastered SMON determines the blcoks that need recovery. This is known as the recovery set. Due to the nature of the cache fusion SMON needs to merge contents of all online redo logs of each failed instance to determine the blocks that need recovery (recovery set) and the order of apply.
  • In this stage buffer space for recovery is allocated and GCS resources identified by reading the online redo logs are claimed as recovery resources. This prevent the other instances accessing these resources.
  • SMON performes roll forward (cache recovery) and roll back (transaction recovery).
  • A new master node is assigned to the cluster if the failed node was the previous master. All GCS shadow processes are traversed and GRD is removed from the frozen state. This complete the reconfiguration process.