Oracle RAC Backup Pieces reported as ‘EXPIRED’ by NetBackup

We have many two-node Oracle RAC clusters (Oracle 12.1.0.2, running on Oracle Linux 6.10). On each of these, as part of our database backup scripts, Oracle rman performs a ‘crosscheck’ with NetBackup, to verify that the backups, which the database’s control file records as having been written, are really present on the NetBackup storage.

Either node of each RAC cluster may have written individual ‘backup pieces’, so the script opens two ‘channels’ to NetBackup, one for each database node.

In one of our data centres, we noticed that this crosscheck was reporting (see below) many backup pieces as ‘EXPIRED’, i.e. not present within NetBackup, despite these backup pieces actually existing. This regularly happened for every RAC cluster in this data centre, while in the other data centre, it never happened.

 crosschecked backup piece: found to be 'EXPIRED'
 backup piece handle=rgqthedf_1_2_105688876 RECID=8209 STAMP=2036458788 

I recreated this problem by running a crosscheck with NetBackup (device type ‘SBT_TAPE’) on the first node of one of our affected RAC clusters, opening channels for both nodes.

 run {
 allocate channel t1 device type 'SBT_TAPE' send 'NB_ORA_POLICY=poplardb.rman,NB_ORA_SCHED=dbfull,NB_ORA_CLIENT=poplar1';
 allocate channel t2 device type 'SBT_TAPE' send 'NB_ORA_POLICY=poplardb.rman,NB_ORA_SCHED=dbfull,NB_ORA_CLIENT=poplar2 ';
 crosscheck backup device type 'SBT_TAPE';
 } 

I repeated this on the second node of the same cluster.

Backup pieces which the first node could see (reported as ‘AVAILABLE’) could not be seen by the second node (reported as ‘EXPIRED’) and vice-versa. This would have serious implications for any attempted restore operation; both nodes must be able to access all backup pieces, regardless of which node originally created the backup piece.

To diagnose whether this was an Oracle rman or NetBackup problem, I used NetBackup’s bplist command.

On the first node, I asked for a list of all recent backup pieces created by the first node:

 /usr/openv/netbackup/bin/bplist -s 11/19/2020 -C poplar1.forest -t 4 -R / 

show me the names of all backups created, starting from November 19th (-s 11/19/2020), by server poplar1.forest (-C poplar1.forest), of type Oracle (-t 4), stored in any part of the NetBackup backup hierarchy (-R /)

This successfully returned a list of Oracle rman backup pieces.

On the first node, I then asked for a list of all recent backup pieces created by the second node:

 /usr/openv/netbackup/bin/bplist -s 11/19/2020 -C poplar2.forest -t 4 -R / 

This returned an error message:

 EXIT STATUS 135: client is not validated to perform the requested operation 

Repeating this test on the second node, the results were reversed.

So, at the NetBackup level, in one data centre, each node, in a RAC cluster, could not see any backups created by the other node in its cluster. Performing this bplist test in the other data centre, each node could always see the other node’s backups.

By sensible default, NetBackup clients can only access backups which they have created. For Oracle RAC, the Veritas NetBackup documentation describes two solutions for this, the first method being the more secure approach:

We found that the /usr/openv/netbackup/db/altnames/No.Restrictions file was present on the NetBackup master server in one of our data centres, but was missing from the data centre where we were experiencing the problem.

The NetBackup administrator created the file on the NetBackup master server and restarted the NetBackup daemons, which resolved the problem.

Our next steps are to replace this approach, in both data centres, with the first approach described above, i.e. create individual files for each database server, and then remove the No.Restrictions file.

RMAN Catalog Resync Failure (RMAN-03009, RMAN-20095)

Our Oracle database backup scripts suddenly started failing, during their last step, a resync with the rman catalog:

 starting full resync of recovery catalog
 RMAN-00571: ===========================================================
 RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
 RMAN-00571: ===========================================================
 RMAN-03009: failure of resync command on default channel at 11/04/2020 02:16:36
 RMAN-20095: invalid backup/copy control file checkpoint SCN

Any attempt to manually resync the catalog failed with the same error:

 rman target / catalog rcat121/xxxx@rmancat
 resync catalog;
  
 RMAN-01005: starting partial resync of recovery catalog
 RMAN-00571: ===========================================================
 RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
 RMAN-00571: ===========================================================
 RMAN-03009: failure of resync command on default channel at 11/09/2020 17:09:19
 RMAN-20095: invalid backup/copy control file checkpoint SCN

We have an Oracle 12.1.0.2 (on Oracle Linux 6.10) two-node RAC database, with Data Guard to a standby two-node RAC database.

In our case, the problem was occurring on the standby database; it can occur on either a primary or a standby database.

We had encountered the problem described in Oracle Support Note 369409.1 “RMAN resync fails Rman-20095: Invalid Backup/Copy Controlfile Checkpoint Scn”:

Two controlfile copies, one at primary and one at standby site, share the same recid and stamp value; this can occur on rare occasions when two backups or copies complete at exactly the same time.

Unfortunately, attempting to follow the steps outlined in the support note did not fix the problem for us. With great assistance from a member of Oracle Support, we resolved it as follows.

On one of the standby database servers, connected to standby database and catalog using rman in debug/trace mode, and attempted resync:

 rman target / catalog rcat121/xxxx@rmancat debug all trace=rman_debug.trc 
 resync catalog;

After this failed, searched the debug trace output file (rman_debug.trc) for error 20095:

 DBGRESYNC:     channel default:   file# 0 [17:09:19.348] (resync)
 DBGRPC:        krmxrpc - channel default kpurpc2 err=20095 db=rcvcat proc=RMANCAT.DBMS_RCVCAT.CHECKBACKUPDATAFILE excl: 133
    DBGRCVCAT: checkBackupDataFile - locked bs_key7840606
    DBGRCVCAT: addBackupControlfile - Inside dup_val_on_index exception
    DBGRCVCAT: addBackupControlfile - ckp_scn 12836518023 ckp_time 04-NOV-20
    DBGRCVCAT: addBackupControlfile - lckp_scn 12836516645 lckp_time 04-NOV-20
 DBGRPC:        krmxrpc - channel default kpurpc2 err=0 db=rcvcat proc=RMANCAT.DBMS_RCVCAT.CANCELCKPT excl: 0
    DBGRCVCAT: cancelCkpt - rollback, release locks
 DBGPLSQL:     EXITED resync with status ORA--20095 [17:09:19.731]

The value highlighted (7840606) is the key, within the rman catalog, of a registered controlfile copy backup set, from the primary database, which was causing the clash with a controlfile copy backup set from the standby database.

On one of the primary database servers, connected to primary database and catalog using rman and deleted this backup set:

 rman target / catalog rcat121/xxxx@rmancat 
 list backupset 7840606; 
 change backupset 7840606 delete;

It was then possible to resync the catalog, on one of the standby database servers, connected to standby database and catalog:

 rman target / catalog rcat121/xxxx@rmancat
 resync catalog;
 starting partial resync of recovery catalog
 partial resync complete

To reduce the likelihood of this happening again, the backup team have offset the primary and standby database backups by half an hour.

Migrating and/or Upgrading Oracle Databases

Over the years I have been involved in the migration and/or upgrade of many Oracle databases.

From these experiences I have built up a checklist of questions and considerations, which I feel should be properly addressed and answered before the migration or upgrade of any Oracle database. Some of these items are on the list from the bitter experience of having overlooked them.

I feel more of the items to be considered arise from the moving (i.e. re-hosting) of a database, rather than the upgrading of it. So, even if we were moving a database to exactly the same Oracle version on a new hardware platform, we would still have to address most of the items on this list.

I hope it’s useful to you and I will update it in future, every time I’m bitten by something new which I have overlooked.

Please point out in the comments anything you feel should be included.

Freeze or Hang of Oracle dbca and “startup nomount” running on Linux on VirtualBox 5

I encountered a problem with database creation freezing/hanging at a very early stage, with the process spinning on 100% CPU while not seeming to be doing anything. This happened both when trying to start up the Oracle database configuration assistant (dbca), where it froze on the splash screen, and when using the “startup nomount” command in sqlplus.

My setup is:

  • Oracle 11.2 (both 11.2.0.3 and 11.2.0.4);
  • Oracle Linux 6.7
  • VirtualBox 5.0.10, installed on Windows 7.

The process occupying 100% CPU was similar to that below:

oracleDBUA5305324 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

I unsuccessfully tried a few things to get to the bottom of this or eliminate possible causes: tracing with strace, switching between ASMM (sga_target) and AMM (memory_target), switching between normal pages and huge pages.

My colleague Frits Hoogland offered to help and he figured it out.

I logged into sqlplus as sysdba, which created the following database session:

$ ps -ef
oracle 3638 3637  0 17:42  oracleMAGPIE (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

We traced this session with the ‘perf top’ command, of which I was previously unaware. perf is a great profiling tool, included in the Linux 2.6 kernel. ‘perf top’ gives a dynamic view of the performance of a process, similar, as the name suggests, to the top command. We passed it the process id of the database session:

$ perf top -p 3638

I issued the ‘startup nomount’ command in the sqlplus session. The ‘perf top’ output, after a few seconds, settled on showing the following:

perf_top_skgvm

This shows that the process was looping executing skgvm_cpuid. Frits recognised this meant it was stuck trying to identify the CPU type.

This led us to check the CPU’s definition in VirtualBox.

The problem was caused by a new performance enhancement feature in VirtualBox 5: ‘paravirtualization’. The ‘Paravirtualization Interface’ was set to ‘Default’:

paravirtualization

Shutting down the VM and changing this to ‘Minimal’ resolved the problem.

I experimented further and found that setting it to ‘None’ and ‘Legacy’ also resolved the problem. Setting it to ‘KVM’ caused the same freeze to occur on trying to identify the CPU type.  On a Linux VM I imagine setting this to ‘Default’ and ‘KVM’ is exactly the same thing. The other setting available, ‘Hyper-V’, is intended for Windows VMs.

It looks like this, hopefully great, new feature in Virtualbox 5 is not completely bug-free yet.

An Oracle DBA meets MySQL

I am an Oracle database administrator and over the last few years I have found myself becoming more involved in MySQL database administration.

I knew absolutely nothing about MySQL and had to figure it out using books and blogs.

I realised there must be many other people in my position so I developed a presentation titled “MySQL: the least an Oracle DBA needs to know”. My aim was to get people up to speed with the basics as quickly and simply as possible, mostly through live demonstrations.

I have had the privilege of presenting this for Oracle user groups in the UK, USA, New Zealand and Finland. It’s a great feeling as the audience moves from sceptical to excited, with lots of “Can it do ….?” questions at the end of the presentation.

To accompany the presentation I wrote this article. I hope you find it useful.