Baselines and Adaptive Thresholds on Production Systems #2 – Gotcha #1

Posted by oakesgr on August 10, 2009

Not exactly a catchy title is it? If I worked for Oracle I’d probably think of some flash acronym like BATPROD. Much more interesting!

Anyway, it turns out that my initial thoughts were incredibly naive. There really is a lot of depth to this feature. It’s easy to turn on, but, and it’s a great, big, whopping, in bold and italics, BUT, to get anything useful out of this feature I feel like you need to spend a lot of time on the following points

  1. What exactly do you want to measure?
  2. What are the appropriate thresholds?
  3. What are you going to do with those alerts when they do decide to raise their heads.?
  4. Be prepared for some gotchas!

Let me start with point 4 – as this has been vexing me the most. Take a look at this screenshot…


Who else out there things that the huge spike on the 9th August at around 12pm should have raised an alert? Most intriguing. So either

  • I’m missing something
  • The alerting isn’t working
  • Someone cleared the alerts

Of those options, it’s pretty unlikely someone cleared the alerts for the last few days but left the older ones. I definitely feel that my understanding has huge gaps in it at the moment, so I could definitely be missing something. But what I didn’t put on that list of options was

  • The thresholds have been changed after that spike

Let me change the thresholds to 200% for warning and 300% for critical (ok – not very realistic I know!) and show the same graph.



All of a sudden it doesn’t seem like such an issue. Or does it? Now it looks like we are getting alerts on the 6th Aug when we shouldn’t! 

It would be really nice if those red and yellow shaded areas moved up and down at the point in time when they change. But they don’t (at least not in 10g), so be aware! Luckily I knew I’d been messing around with the thresholds like some kind of lunatic so it wasn’t too difficult to guess what was happening. However, if you work in a large organisation with a large number of dbas – this could cause no end of confusion.

Which leads me onto another question. Does changing a threshold get logged anywhere?

I think that’s enough for now. #3 will be on it’s way soon.


