Baselines and Adaptive Thresholds on Production Systems #1
Posted by oakesgr on August 6, 2009
Ever since I attended the 10g performance tuning course I’ve wanted to use the Baselines and Adaptive Thresholds in anger. From some brief discussions I’ve had it seems like this feature has been under utilised, certainly within my organisation.
After a quick chat with the application support team I work with, we decided to use a derivative settlements database as our guinea pig.
My early (probably naive) hope for the use of this feature is something along the lines of
- Set a rolling 7 day window baseline
- Set the adaptive thresholds so that anything above normal usage + X% (X still to be decided!) will generate alerts
- Trap the alerts so that they not only get raised via our generic alerting system (this already takes them from Grid Control), but also get them to post to one of our internal chat channels.
Point 3, is quite important (in my mind) due to the way the DBA structure is setup in my company. We have a team in Singapore that works around the clock monitoring the generic alerts system. We also have regional ‘core’ teams that handle general day to day support issues. Finally, we have ‘aligned’ dba teams that work more closely with the business. I’m one of those.
So basically I want these new alerts to be raised to the aligned dba’s notice (as well as the team in Singapore), with the theory being that we will be noted of abnormal activity on the database ahead of the support team, thereby giving us a little more time to examine any potential issues before they turn into high pressue ‘fix this NOW!’ type situations.
When creating the 7 day rolling baseline there are a number of options to consider. As the usage window of this database extends outside of EMEA (covering both US and APAC time zone) there seems little point in using the Day and Night option. Therefore I’m just using Weekday and Weekends.
The other usage abnomaly to note is a quarterly event known as the CDS roll. For the uninitiated, CDS stands for Credit Default Swap (http://en.wikipedia.org/wiki/Credit_default_swap) and is a phenomenon across the banking sector. During this period of a few days each quarter, the usage profile on this database (and many others) can double, triple or just go off the scale and has caused any amount of havoc in the past. My plan during this time is to create a static baseline during the next event. I will then switch to this baseline during all subsequent events. This isn’t due for a while though so I’ll continue with the rolling window option for the moment.
The rolling baseline is now in place, and I’ve decided to start with thresholds of 120% and 150% for warning and critical alerts respectively. This is simply a kick off point and will probably be amended as we get more experience. I also expect these thresholds to differ between each database we implement baselines on.
I intend to post updates to this subject on a weekly or biweekly basis as I think it will take a while to get a mature solution in place.
Now, I’m off to talk to a UNIX SA that I know has already done something along the lines of alerting through chat channels.
Edit : Doug Burns wrote a great post about Baselines and Adaptive Thresholds, so in an attempt not to repeat everthing he wrote I’ll try and angle this more as a ‘my experiences’ type post as opposed to ‘this is how you do it’ post.