Show Me the Way—with the innovator behind Statspack and AWR
As published in the 102nd issue of the NoCOUG Journal (February 2012)
Show Me the Way
with Graham Wood
Graham Wood has been working with Oracle Database for 25 years. He is currently a product manager for the Oracle RDBMS based in Redwood Shores, Calif. He has architected and tuned some of the largest Oracle databases, and has presented around the world on Oracle performance–related topics.
I have it on very good authority (Tom Kyte in the current issue of Oracle Magazine) that you are the genius and innovator behind Statspack and Automatic Workload Repository. I am in awe. Tell me the story behind that.
Wow, starting with a memory test! When Oracle V6 was introduced it contained the first V$ views, such as V$SYSSTAT and V$FILESTAT. These structures were created to allow database development to understand which parts of the code were being executed, and how often, during the running of the OLTP benchmarks that had started to appear at that time. The database shipped with two scripts that were used to produce a report from the V$ views during a benchmark run. These were bstat.sql, which captured the current contents of the V$ views at the start of the benchmark into a set of tables, and estat.sql, which captured the contents at the end of the benchmark into another set of tables, produced a report from the two sets of tables, and then dropped them. I was working in a small specialist performance group in Oracle UK at the time and it occurred to us, being database guys, that it might be useful for production systems to do regular captures of the V$ views and to keep this data around for rather longer as a repository of performance data. We wrote some scripts and started to distribute them inside Oracle, and they also found their way out to several customers. This was the original “Stats Package,” as we called it. As new releases of the database came out, I upgraded the scripts, probably most notably with the inclusion of V$SQL in Oracle V7 in the Stats7 package. In 1996 I moved to Oracle HQ in Redwood Shores to work in the Server Technologies Performance Group, and one of the goals that I set myself was to get the scripts shipped with the product so that all customers could use them. They finally made it into the database distribution in Oracle 8i as Statspack after being updated and enhanced by Connie Green. And the rest, as they say, is history, with almost all big Oracle sites using Statspack to keep a history of performance data.
When we started development of Oracle 10g, one of the main focus areas for the release was to be manageability, and a key part of that was to simplify performance analysis and to make recommendations for performance improvement. The most important part of this for me was to be able to automate performance analysis for a database instance and to identify the key areas where improvements could be made. Basically, unless the analysis is correct, there is no point in trying to make recommendations. In order to do this we needed to have a common currency for reporting across the components of the database and for quantifying the performance of a system. This led to the introduction of the concept of DB Time, the time spent in the database by user sessions, which allowed us to do quantitative comparisons between different components and also to quantify the impact on the system of an issue—for example that a single SQL statement represents 27% of all of the user time spent in the database. One of the main objectives of this was to make DBAs more effective by directing them to areas where they were likely to be able to make the greatest improvements in performance, rather than them spending time and effort on making changes that produced little benefit. To do all of this needed much more infrastructure than there was in Statspack and in Oracle 10g, and a lot of effort went into ensuring that we had adequate data available to do analysis of a performance problem the first time that it occurred. This resulted in an automatically managed repository of data (AWR), which contained not only data from normal V$ views containing cumulative statistics but also metric data and sampled activity data in the Active Session History. The combination of all of these data sources has taken performance analysis to a different level.
Tom Kyte’s favorite performance story is about a database that was always slow on rainy Mondays. What’s your favorite performance story from your own experiences?
One company that I worked with early on in my Oracle career asked me to help them improve the performance of a large batch report which was produced every night and went out to six people around the organization. It was causing problems for all of the rest of their batch operations by consuming a large amount of resources. The first improvement was to run the report once and print six copies rather than run the same report six times! Then I spoke to the folks who received the report and found out that three of them immediately tossed it in the trash (this was before the days of recycling), and the other three never looked beyond the first four summary pages as they now had an online system that allowed them to look at the details. We ended up changing the report to just produce the summary, and the overnight batch load on the system dropped by about 95% from the start point. It was definitely a case of it always being faster to not do something than to do it.
The most common problem that I see is that of flawed analysis: fixating on a particular statistic or event, which means that you never get to the root cause of the problem.
What are the typical issues you see when you are asked to look at a performance problem? Indexes? Statistics?
Well by the time I get called in to look at a performance problem these days there have probably already been quite a few people looking at it before, so all of the obvious things have already been tried. So, to be honest, the most common problem that I see is that of flawed analysis: fixating on a particular statistic or event, which means that you never get to the root cause of the problem and you end up trying to deal with a long list of symptoms. Much better to take a top-down approach and make sure you have the real cause before trying to fix things. If you have a really bad headache you could try and find a better aspirin or lie down in a darkened room, but you might be better to just stop banging your head against the wall. Having said that, I do still see a lot of problematic SQL, and drilling down to the root cause has become so much easier with the introduction of SQL Monitor. It is one of my top features of Oracle 11g, both for DBAs and developers, as it makes it so easy to find out exactly where in the plan the high resource usage and bad cardinality estimates are coming from, without even having to look at the details of the SQL itself. And, of course, I still see applications that have poor connection management and perform unnecessary parsing, even though we have been telling folks how to do it right for a couple of decades now.
I’ve heard a rumor that attendees of the Real World Performance events are being told that “tune” is a four-letter word. Is that some sort of insider joke? What does it mean?
I think that you have me confused with Cary Millsap! Cary differentiates between “tuning” and “optimizing.” The four-letter word that we talk about in the Real World Performance Day is “hack.” We define hacking as making changes without having diagnosed the root cause of the problem, without having scoped the problem or solution, and without being able to detail the expectation, in terms of what changes can be expected in the database performance statistics, of applying the “fix.” Most commonly these days the supporting argument for applying a hack is “well, I found a website that said if I set _go_faster in the init.ora I will run at least three times faster.” While Google can obviously be a good source of information, you have to remember that not everything that you read on the Internet is true. There really is no good alternative to doing proper performance analysis (although the availability of DB Time and ADDM make it easier) and proper testing, in your environment and with your data.
The title of software professional comes with a requirement to deliver quality product, not just hope that hardware will bail you out
In Oracle on VMware, Dr. Bert Scalzo makes a case for “solving” performance problems with hardware upgrades. What’s your opinion about this approach? [Footnote]
Ah, the “hardware is the new software” approach, as my colleague Andrew Holdsworth calls it. Software was called software because it was the part of the system that was “soft’ and could easily be changed. These days we often see customers who will do anything they can to avoid changing the application, no matter how bad it is. Hardware upgrades can only ever “ameliorate” a subset of performance problems. If the system is CPU bound, then adding more CPU cycles may make things better, but the benefits that you get will be, at best, the 2x every 18 months of Moore’s Law. But most systems with performance problems these days are not CPU bound, and even when they are, there is also a real possibility that adding more CPU will actually further reduce the performance of the system by increasing contention on shared structures. The performance benefits of fixing the software can be orders of magnitude greater and, if done well, make it so that the system is better able to scale with hardware upgrades. The cheap hardware theory primarily applies to CPU, although larger, cheaper memory can also help but often requires that the box is changed anyway. Storage system upgrades are rarely cheap. Although $/GB has been falling rapidly, $/GB/s and $/IOP/s have not, and reducing I/O performance problems will always involve increasing either one or the other of these throughput metrics. I would guess that most of the readers of your magazine would think of themselves as software professionals. To me that title comes with a requirement to deliver quality product, not just hope that hardware will bail you out.
Saying No to NoSQL
Just when I thought I’d finished learning SQL, the NoSQL guys come along and tell me that SQL databases cannot deliver the levels of performance, reliability, and scalability that I will need in the future. Say it isn’t so, Graham.
Well we hear much pontificating about the benefits of NoSQL, but so far I haven’t seen any audited industry-standard benchmark results as proof points. I have seen many claims from NoSQL evangelists that traditional RDBMSs cannot meet their requirements, only to find on further analysis that they tried a single open-source RDBMS, ran into some problems, and generalized from there. It is also interesting in the light of your previous question about using cheap hardware to try and resolve performance problems, that NoSQL solutions are developer intensive, as much of the functionality that would be provided by a SQL RDBMS has to be hand-crafted for each solution. But I’m sure over time we will see winners appear from the current plethora of NoSQL products.
What about Big Data. Can’t SQL databases handle big data then?
To me the case for Big Data comes down to two key areas: unstructured data and high-volume, low-value data such as web logs. This data could be stored in an RDBMS, but more typically what we are seeing customers doing is using Big Data techniques to extract information from these types of data sources and then storing this data in their RDBMS. This is the type of environment that Oracle’s recently announced Big Data Appliance is designed to help with.
The NoSQL salesmen insist that I need “sharding” instead of partitioning. Did they get that right?
Partitioning in the database has the huge benefit of being transparent to your application and your application developer. Using sharding requires that you move the management of the shards into your own application code. Do you want to develop your own code to perform queries across all of your shards and to do two-phase commits when you need to do a transaction that would affect multiple shards? And is such custom code development really cheap?
Professor Michael Stonebraker claimed in the 100th issue of the NoCOUG Journal that traditional SQL databases should be “sent to the home for tired software.” Has innovation really stopped at 400 Oracle Parkway? Has Larry sailed off into the sunset?
There have been many technologies that have claimed that they will replace SQL RDBMS over the last 30 years, including object databases and XML. SQL databases are still alive and well and contain the mission-critical data that is the lifeblood of businesses. Having a standard language, SQL, and a sound basis on relational theory means that SQL databases have stood the test of time in an industry where hype and fashion are rampant. In terms of 400 Oracle Parkway (where most of database development is housed) there are still many new features being built into the Oracle database that will increase the benefit that customers get from using the product. But you will have to wait for the product announcements to hear about those. And, of course, as the next America’s Cup is in San Francisco. Larry is still very much around and involved.
The Whole Truth About Exadata
Is Exadata a case of solving performance problems with hardware upgrades? Put another way: is the performance improvement from Exadata exactly what one might expect from the bigger sticker price, no more and no less?
Well the stock answer is that it is an engineered system that is designed to be capable of very high throughput. The software allows us to utilize the hardware much more effectively. There are customers who have upgraded to Exadata and seen the hardware upgrade benefits, typically 5–10x performance improvement, which is enough to get them into ads in The Economist and airports around the world. But the customers who have fully exploited the capabilities of Exadata have seen orders of magnitude more benefit. In our Day of Real World Performance presentations we load, validate, transform, collect optimizer statistics, and run queries on 1TB of raw data in less than 20 minutes. That sort of performance can transform what IT can deliver to the business and has far greater value than the sticker price.
Is Exadata as good for OLTP workloads as it is for OLAP? (You can be frank with me because what’s said in these pages stays on these pages!)
Well Exadata is certainly a very capable OLTP box. It has fast CPUs and can perform huge numbers of very fast I/Os with large numbers of IOPS by utilizing the flash cache in the storage cells. And OLTP performance is all about CPU horsepower and large numbers of IOPS. But I think it is fair to say that there is less “secret sauce” in Exadata as an OLTP platform than there is for data warehousing.
Show Me the Way
Thank you for answering my cheeky questions today. Someday, I hope to know as much about Oracle Database performance as you. Can you show me the way? Your book, perhaps?
Well I think that the key to being a good performance analyst is making sure that you spend time upfront correctly scoping the problem and then avoid jumping to conclusions while doing a top-down analysis of the data. When you are looking for solutions, make sure that the solution that you are implementing matches the scope of the problem that you started with, as opposed to a mismatched scope. The classic example of scope mismatch is making a database-level change, like changing an init.ora parameter, to solve a problem that is scoped to a single SQL statement. Much better to use techniques like SQL Profiles or SQL Baselines that will only affect the single SQL. Using that approach will get you a long way. As far as my book, I guess I still need to write it; it will be a cheeky book!
Footnote: Here’s the full quote from Dr. Scalzo’s book: “Person hours cost so much more now than computer hardware even with inexpensive offshore outsourcing. It is now considered a sound business decision these days to throw cheap hardware at problems. It is at least, if not more, cost effective than having the staff [sic] tuned and optimized for the same net effect. Besides, a failed tuning and optimization effort leaves you exactly where you started. At least the hardware upgrade approach results in a faster/better server experiencing the same problem that may still have future value to the business once the fundamental problem is eventually corrected. And, if nothing else, the hardware can be depreciated, whereas the time spent tuning is always just a cost taken off the bottom line. So, with such cheap hardware, it might be a wiser business bet to throw hardware at some solutions sooner than was done in the past. One might go so far as to make an economic principle claim that the opportunity cost of tuning is foregoing cheap upgrades that might fix the issue and also possess intrinsic value. Stated this way, it is a safe bet that is where the business people would vote to spend.”
Download the 102nd issue of the NoCOUG Journal