Ordinary Oracle Joe

Just an ordinary DBA's thoughts

Posts Tagged ‘Disaster recovery’

Lessons learnt … again

Posted by oakesgr on March 29, 2010

I’ve just got off of a pretty intense weekend. I had three application releases to support and eight 10.2.0.4 patch upgrades on production dbs and their standbys to perform.  Half of those upgrades were on databases that were also involved in one of the releases. So added complexity around timings, additional backups etc. so that we could roll back the upgrade without having to rollback the release.

By the end of Saturday I was pretty pleased with myself. All 8 upgrades performed, 2 of the 3 releases completed and signed off and the final one running 5 hours late so I handed over to a friend in the states to complete.

What could possibly go wrong? As always happens when you start assuming things are great, you get rudely awakened.

I got a call from an application support manager at around 1pm on the Sunday whilst I was being dragged around Bluewater by my other half (for those lucky souls that have never been there, Bluewater is a large shopping mall in the south east of England).

The global settlements app that runs off of one of the upgraded dbs had been trying to start up for 3 hours and wasn’t going anywhere. Usual start up time is around 60 mins.

A quick explanation to my wife and we start the 1 hour trip back home so that I can logon. Luckily another dba in my team was at home and logged on immediately to start helping the app team.

Once I’d logged on the issue became apparent pretty quickly. A piece of sql that is run multiple times on startup was now consuming CPU like it was going out of fashion. A couple of runs of the awrsqrpt.sql tool and I had two differing plans in front of me for before and after the upgrade.

Basically we had two choices, we either rollback using the backup taken before the upgrade, or we try and fix the issue.

My immediate bright idea was that this issue wasn’t seen during the pre-production test of the upgrade. The app had started fine. Therefore we should be able to create an outline in preprod, export it from there and import into prod to stabilise the plan that we want. I’d used this method when we upgraded from 9i to 10g and was confident it would work. Unfortunately, the plan had also changed in preprod.

The app guy mentioned that the start of the working week for this app is 11pm BST as it’s used in Sydney. He also mentioned that the business impact for not having this app up and running by then could potentially be a loss of reputation to the bank and substantial monetary fines. By this time we were into late afternoon. Time to rollback… uh-oh.

On the saturday after the upgrade I’d released the db back to the app support team to help with their release. I’d assumed that this involved starting the app and verifying that the ugprade was ok. Therefore I’d allowed the Saturday night backup to go ahead as planned and remove the pre-upgrade backup. So we had no backup on disk that I could restore from.

OUCH.

A tape restore request was raised and the waiting process began. I should mention here that this db was the largest of the ugprades that weekend…typical! If it can go wrong, it will go wrong, and if it does go wrong it will go wrong on the hardest db to restore!

The tape restore finished at about 7:45pm. A quick config change to increase the number of restore streams used (we’re using SQL Backtrack not RMAN) and I fired off the db restore. This finished at just gone 10pm. I had the database up and running about 5 mins later and handed over to the app team. They managed to get the app up and running by about 11:20 and we could all breathe a sigh of relief.

Except I still had to rebuild the standby. I scheduled a backup and then took advantage of the fact that I work for a large organisation and asked the dba team in Singapore to monitor the backup and copy it over to the standby once it had completed.

I rebuilt the standby when I got to the office this morning and had it in synch with primary by about 9:30.

 At the end of all this activity what have I learnt

#1 Never assume that everything is fine. Get written confirmation. Certainly if it involves removing the rollback backup.

#2 If it can go wrong it will go wrong, and when it goes wrong it will go wrong on the biggest, most awkward to restore database.

#3 Pay more attention to the app team’s testing. We could have caught this issue in pre-prod testing with a little more attention to detail.

#4 Try and have the app team test upgrades immediately upon completion, not wait until the next day.

 Most of those lessons are ones I’ve learnt before, and if I’m honest I’m pretty disappointed that I had to relearn them.  I guess it just goes to show that you can’t forget the basic principles that you started with years ago.

Posted in Uncategorized | Tagged: | 3 Comments »