Ordinary Oracle Joe

Just an ordinary DBA's thoughts

Mass events – how do you handle them?

Posted by oakesgr on September 23, 2009

As Led Zeppelin once sang… It’s been a long time

August was a ridiculously busy month at work and at home and to be honest September hasn’t been much better. Hence the major break between posts. It seems weird that I haven’t posted for the best part of two months now.

During September we had a quarterly Infrastructure Weekend. The organisation I work for puts aside 4 weekends a year that are then given over to the IT infrastructure groups for maintenance work, for example OS or DB patching. This means that upwards of 1000 instances may need to be shutdown, restarted, checked etc. across Oracle, Sybase and MS SQL.

Throw into that the fact that we use 3rd party tools for replication and backups and this all becomes a very complicated and time consuming process. I actually think that we’re becoming very good at these mass events now though. The core dba team have created a tool that will handle the mass shutdown, startup and post checks for both Oracle and Sybase.

The major work is now in the weeks leading up to these weekends. For every database that is planned to have an outage, the application team needs to be contacted, advised on the outage and given options on housekeeping jobs.

This is a major overhead and I’m interested on hearing how other large organisations cope with events like this. Do they even have them? Is patching done piecemeal over the year? I’m sure there’s a way to improve the process… I just don’t know what it is.

Advertisements

12 Responses to “Mass events – how do you handle them?”

  1. mwidlake said

    Welcome back to the blogging Graham, it’s good to see another post.

    In my experience of large organisations, there seem to be two main ways of handling large numbers of different systems:
    1)Don’t upgrade or patch anything until you absolutely have to due to lack of time and, as you candidly highlight, the overhead of arranging and managing it. {Most core IT teams only admit the time element}.
    2)Split the infrastructure into sections, such as by major business system or types of technology, and carry out maintenance on the units. When an element of your core infrastructure needs upgrading, eg major storage migration, make it a Project.

    The problem with (1) is that such organisations end up with poorly maintained systems that stagnate until maintenance or upgrade is forced by some impending or actual major issue – at which point the work is done in a hurry and with much stress on the staff doing it

    The problem with (2) is that you end up upgrading/maintaining all the time, a different system each weekend, and have issues with tracking what system have been maintained and also interoperability issues between versions where an upgraded system interacts with one that has not. The major infrastructure issues that come along can be a real headache.

    Over all, I prefer 2!

    Maintenance and upgrade always seems to be less painful in those large (and, actually, small) organisations that
    1)don’t fall too far behind in upgrades/patches/regular maintenance but do wait until there is a considerable amount of work to do.
    2)Put effort into automating the tasks (be it version upgrades or archiving old data).
    3)At least consider the infrastructure as a whole rather than just “let’s maintain the DB” or “it’s time to patch the OS”
    4)Review how it went and how to improve next time.

    Something I do if I have to project manage an upgrade or major maintenance task is prepare a plan and then have a meeting with all the core IT teams (Sys admin, dba, network, storage, customer support, application support) and ask the simple question “what could go wrong” about a dozen times. Followed by “how can we back out” half a dozen times. I don’t tend to be popular doing it though…

    What do you think?

    • oakesgr said

      Hi Martin,

      I think (!) that those questions should be asked by default at every meeting 🙂

      Seriously though, we seem to have a good balance of not getting too far behind but also storing up enough work to make it worthwhile. The quarterly timeframe seems to be about the right balance. The one exception seems to be CPU patches which we just don’t seem to get to grips with.

      The step in that list that really interests me though is number 4. Improving from last time. As these big events are coordinated by the core team (I sit with the business) I’m not really party to these debriefing sessions – I assume they occur.

      I’ve been here just over two years now and I’ve definitely seen an improvement so we’re going the right way. We need less dbas to handle the same amount of infrastructure changes so we’re doing something right.

      The whole ‘contacting the business groups’ step is now the biggest overhead and therefore the biggest opportunity to improve (at least in my eyes).

      If I come up with any blinding ideas, I’ll update the post.

  2. Doug Burns said

    • mwidlake said

      Is this Doug saying he knows nothing or has the comment got corrupted?

      • Doug Burns said

        The comment that so (cough) mysteriously went missing was something along the lines of …

        “I have no idea – we have DBAs to take care of that kind of thing for us”

        Which is part of a running joke.

        On the other hand, you’re right. I know nothing.

  3. oakesgr said

    Doug,

    maybe I’m just removing the less useful comments 😉

    As for “I have no idea – we have DBAs to take care of that kind of thing for us” – just another example of ‘ex’ (!!) DBAs absolving themselves from any kind of responsibility 🙂

  4. Doug Burns said

    ‘ex’ (!!) DBAs

    Can I resist the urge to start quoting bits of the Parrot sketch?

    Looking in as a Developer outsider 😉 I think Graham’s site has a pretty good approach to this stuff. Quarterly isn’t very often – a recent client had an infrastructure change window every Saturday night, but they were a much smaller, less international company. However, with all the pre-planning that you mentioned, it wraps up the disruption into fewer, bigger chunks and they can be more cast-in-stone, with less business/developer debate.

    I would say that our shared work-place (it’s so weird not being able to mention company names) is about as good as I’ve come across, possibly with one exception and that’s the versions in use, which often seem quite out of date to me which, in turn, is going to impact CPU availability anyway. Of course, it might be because big companies have Engineering departments, who add to the software upgrade bottle-neck. Contrasting with the last place – much smaller DBA team, no Engineering and, I suppose a much smaller estate to be fair, we were able to stay on top of the patch-set/CPU thing, with a fair amount of will and effort. But I’m really glad we did – less time spent chasing solved bugs and talking to Support about having to apply patch-sets.

    • oakesgr said

      I see you brought up the Engineering department. There are a lot of differing views on the benefits vs overhead that having an engineering department entails.

      A popular view is to blame them for everything. It takes too long… We don’t get to choose which options we install.. etc.

      I wouldn’t subscribe to that view. The larger the enterprise the more important standards become, finding that point whereby the benefits outweigh the overheads incurred is up for debate though.

      Would I like to be on 11g now? – of course I would, but I think they’re essential in our current environment and I’m willing to wait for the ‘engineered’ 11gR2 to become available.

      sorry – rant over, you set me off with the e-word! It’s a subject of hot debate.

      • Doug Burns said

        Come on, you know me from old blog posts – I’m standards daft. Particularly if you have thousands of instances and servers. You can’t just roll out any old thing.

        However, there is *nothing* in that requirement that suggests that the people responsible for trying to deliver the best agreed solutions need to be arrogant, condescending or of the opinion that they are doing some mystical task that *just* DBAs couldn’t do. Of course, I’m not saying that I’ve ever come across any Engineers like that and certainly not that there are any at our current site. If I had, though, I would probably think they need to get over themselves a bit 😉

        … and if you’ve never come across that, then an Engineering department can be a fine thing indeed!

      • oakesgr said

        I don’t know how I got to this position, but I’m going to defend our engineering department! I can’t comment on engineering departments that you’ve worked with previously but I find our one to be pretty helpful.

        As in most things it does seem to be a little about ‘who you know’ but I’ve found that if you make the effort to contact them, raise concerns etc. they’re pretty responsive.

      • Doug Burns said

        Well, IIRC, Jeremy Schneider is probably in that team and I don’t have a bad word to say about him.

        It was really the concept of Engineering departments and past behaviour. No bad comments about this place!

        (Phew, do you think I’ve got away with that one?)

  5. I don’t know If I said it already but …I’m so glad I found this site…Keep up the good work I read a lot of blogs on a daily basis and for the most part, people lack substance but, I just wanted to make a quick comment to say GREAT blog. Thanks, 🙂

    A definite great read….

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: