Questioning Method R: An Interview with Cary Millsap
Cary Millsap is the co-creator of Method R, an antidote for Compulsive Tuning Disorder—the disease that afflicts so many of us. His motto is “Nullius in verba” which means “On the words of no one.” In other words, check the facts for yourself.
What’s Method R and how does it cure Compulsive Tuning Disorder?
Method R is a commonsense approach to optimizing a system. The “R” stands for “response time.” Many people have used Method R extensively at Oracle sites throughout the 2000s with tremendous success. Jeff Holt and I first documented the method in the book Optimizing Oracle Performance (O’Reilly 2003). You can read Chapter 1 online for free here.
Method R prescribes four simple steps:
- Identify the task that’s the most important to you.
- Measure its response time (R). In detail.
- Optimize that response time in the most economically efficient way.
- Repeat until your system is economically optimal.
Compulsive Tuning Disorder, or CTD, is a term introduced by Gaja Vaidyanatha, Kirti Deshpande, and John Kostelac in their 2001 book, Oracle Performance Tuning 101. CTD is the compulsion to tinker forever with the performance of a system, inspired by the hope that there might be something in there that needs improving. It’s a disorder instead of a virtue because CTD causes people to waste time and money.
Jeff and I developed Method R for Oracle in response to several big problems with traditional “tuning” approaches, the CTD-inspiring absence of a terminating condition among them.
CTD just doesn’t make sense in a Method R context. With Method R, you know exactly whether or not a computer-executed business task has any time in it that’s been wasted. If there is, you either fix it or prove that fixing it isn’t worth the cost. If there isn’t any wasted time in there, you move on. That’s the key: either way you move on. You don’t linger in a non-productive state, because you know exactly where there’s opportunity for improvement and you know exactly where there’s not.
We love stories. Tell us a performance tuning story. Tell us two.
My favorite performance tuning story at the moment is a note I received yesterday from a customer who has recently bought our Method R Profiler product. He and his team had previously been using an open-source profiling tool, so I had expressed interest in how he was feeling so far about the purchase. Here is the response I received:
“Within one minute of installation the Profiler exceeded all expectations. We went live with a new warehouse management system this Friday because all of the tuning was done ahead of time. The Profiler pointed out all of the application inefficiencies which we turned over to the vendor. They were able to fix the ones with the largest impact on the online response times. That gave our management the confidence they needed to make the Go/No-Go decision. Until I got the Profiler we were looking at postponing it until next quarter. I pity the poor souls that have to optimize in mere mortal ways.”
I love this story because it touches on so many of the reasons we created Method R in the first place, like:
- It hooks right in to the business in a way that makes performance easy for everyone in the company to understand.
- It gives you positive evidence of success—not just some uncertain feeling that maybe you’re okay because you haven’t yet found any problems. This is why the CTD problem never enters into a Method R project.
- It shows you unambiguously what’s really taking your time, so you don’t have to guess what effect your tuning will have on the business task you’re trying to help. Nullius in verba!
- It gives you more power to help your application software vendors help you. It makes you faster and more effective, even when you buy your applications from somebody else.
Which is better, Method R or ASH/AWR?
That’s like asking, “Which is better, Google Maps or fuel injection?” Method R is a process to accomplish a specific goal. ASH/AWR is a functional capability within the Oracle Database software that allows someone to partly implement Method R.
A major confusion I’ve encountered in the field is the misunderstanding in which people equate Method R with Oracle trace data. The method called Method R has nothing specifically to do with Oracle or databases—or even computer software. It’s simply a method for optimizing a process—any process, like driving to work or shopping for groceries.
It does so happen that implementing Method R within the Oracle Database context requires Oracle extended SQL trace data in some (I’d say many) cases. There are two principal reasons for this. First, ASH/AWR is wired to the concept of database time. That’s different from end-user response time. In some cases, it’s very different. In those cases, we need to see the amount of “SQL*Net message from client” time that a process consumes, and even the time that’s completely unaccounted for by database measurement probes.
Second, there are issues of data skew that you can detect in detailed trace data that are impossible to detect with ASH/AWR. For example, 100 calls consumed 100 seconds. How much time will we save if we can eliminate 90 calls? The answer is not necessarily 90 seconds; it could be anything between nearly 0 seconds and nearly 100 seconds. It depends completely upon which specific calls you eliminate. (Hint: Imagine that the 100 call durations were as follows: 99.01s, .01s, .01s, .01s, …, .01s.)
ASH and AWR are good things. But an argument for ASH/AWR is not an argument against Method R. The closer Oracle Corporation comes to giving us the true end-user response-time measurements that Method R requires, the better for everyone.
We spend a lot of time tuning databases. Why does it have to be that way? Is Oracle difficult to tune? Are we building the databases incorrectly?
I began learning how to tune Oracle databases in 1989. At the time, it seemed nearly impossible to me that people could be productive by following these instructions that the “experts” at the time were telling me I had to follow. I would have felt better had I known that it wasn’t just my own ineptitude working against me. In recent years, I’ve been able to prove why the instructions we were taught were fundamentally ineffective against a lot of the everyday performance problems we all have to face.
So is Oracle difficult to tune? If you do it the way we’ve all been taught, then absolutely: it’s nearly impossible. In fact, the better you are as a student, the harder it gets. It’s because the method is flawed. Is it difficult to peel an apple? If the only tool you have is a steel ball bearing, then absolutely: it’s nearly impossible.
I think a lot of databases are built incorrectly because people don’t measure them properly as they’re building them. It’s a result of the same method-is-flawed problem I just mentioned. The things that database architects, application developers, operational administrators—everyone in the stack—have been taught to measure regarding performance are just ludicrous. The measurements you’re taught to use—the counters and the ratios—are not connected to the qualities that real people using the real application in real life will really measure later.
It’s difficult to do anything new—especially when it’s complicated—without immediate feedback. Imagine trying to learn how to putt in golf in an environment where you couldn’t learn where your ball went until three months after you hit it.
That’s what people do with their databases. Everyone in the project has lots and lots of tasks to do, but often it’s nowhere in the plan to measure end-user response time during the construction process. Then, when the system goes live three months later, the only performance metric that most people are going to care about hasn’t even been looked at. Worse yet, many systems (most systems) don’t include instrumentation that allows an analyst in a real-life production environment to even measure what the real response time is as it’s happening.
There’s no way you can expect to succeed in an environment like that.
Won’t upgrading the hardware save time and money in the long run?
It depends. Before you upgrade hardware, you really need to consider Amdahl’s Law:
Your upgrade will improve performance of only those tasks that use a lot of the component you improved.
A lot of people upgrade CPUs to 2× faster devices, or maybe, as I recently wrote on my blog, they upgrade their disk drives to 100× faster flash drives and then expect their tasks to be 2× or 100× faster, and they’re just not. It’s Amdahl’s Law: When you make your device n times faster, only the component of response time where that device was actually used will be n times faster (and that’s the optimistic case). So if your total response time is only 5% disk I/O to begin with, upgrading to flash drives will improve performance by no more than 5%.
Hardware upgrades don’t always end well. In Optimizing Oracle Performance, Jeff and I describe a now-classic case in which a hardware upgrade actually slowed down the performance of an important business task. It’s like a UFO sighting. A lot of people have claimed that they’ve experienced it, but having the actual data to prove what really happened…that’s another thing entirely. In our book, we have all the data it takes to show exactly how this kind of catastrophe happens. You can read the whole story here.
Here’s another story for you. A couple of years ago, I was teaching a private course at a company. In the part of the course where I was supposed to have convinced everyone that the Method R way of doing things was worth learning about, I still had a lot of students with crossed arms and body language that was generally saying, “I ain’t buying it.”
So I suggested that we should work on an actual problem that the company had recently experienced. That way, I could show off the method in a context that was already familiar to them. They agreed.
They told me that they had recently executed a half-million-dollar hardware upgrade that had doubled a system’s number of CPUs from 20 to 40. But the system throughput was just as lousy on the 40-CPU box as it had been on the 20-CPU box. They wanted to know what could possibly cause this application not to scale up at all in spite of the much greater capacity.
They had actually printed the 7MB trace file for an important task that had taken about 100 seconds to run. The document was about two inches think. They motioned in frustration at this stack of paper and explained that they had already run tkprof on the file, and that hadn’t helped their understanding at all.
The task behind the trace file was a batch job that comprised nearly the total workload on this upgraded machine. They had a 40-CPU machine with 300 concurrent copies of this batch job that they had traced running on it.
I ran our Profiler product on the trace data. The total run time for the Profiler itself was about 7 seconds, which was good given that all eyes were on me while it was running. The output was most excellent.
The #1 response time contributor: unaccounted-for time, almost 60%. I expected that before I even looked at the profile. That’s what you’ll get when you run 300 concurrent batch jobs on a 40-CPU machine. (I wrote about this phenomenon many years ago in a paper called “Batch queue management and the magic of ‘2’.”) CPU service for database calls was #2, at about 13%. I expected that too. Sleeps for “library cache” latch acquisitions was #3, at about 10%. There was more; this was enough.
So I looked at the task’s top time-consuming SQL, and it looked pretty good to me. Only five LIO calls per row manipulated per joined table, that kind of thing.
Their performance problem wasn’t bad SQL or a lack of horsepower, it was an inevitability of how the application was designed. Every one of these 300 identical processes was parsing like crazy inside of Pro*C loops. It looked like tens of thousands of prepare calls from these 300 processes were occurring every minute, because I saw hundreds of them per few-second interval in each of the 300 concurrent jobs.
The prepare calls were all blocking all over each other because they were all trying to prepare the same exact SQL query, all in different sessions, all at the same time. Of course, the one hash chain in the library cache was getting scanned over and over again, and those scans are protected by the chain’s “library cache” latch. On top of all the latch waiting was all the CPU service time consumed by the spinning for latches. And on top of that was the waiting in a CPU-preempted state for CPU service that the majority of the 300 concurrent jobs were spending the majority of their time doing.
This is an example of how a hardware upgrade cannot improve performance of an application whose bottleneck is software serialization. The remedy: stop making so many OCIStmtPrepare calls from the application. At least keep cursors open for reuse within a single session. Better yet, use a three-tier transaction monitor architecture that will allow several processes to share a single cursor. This application should have parsed each of its SQL statements once per instance lifetime, not thousands of times per session lifetime.
What’s next for Method R?
Well, Method R is not just a method anymore, it’s a whole new company now. I formed Method R Corporation earlier this year, and I have some fantastic people with me. We can help you if you have needs pertaining in any way to performance of Oracle-based applications.
A major opportunity for us right now is that Method R has enormous untapped potential in preventive contexts. Today I think that people view Method R dominantly as a performance problem diagnostic method. That’s fair, because that’s what the book is about, and I think that’s how most people who know about Method R have come to know about it.
But the method works for preventing performance problems from ever occurring in the first place. Application developers drastically reduce their performance problem vulnerability when they regularly profile their code at each stage of promotion, from development through unit testing, performance readiness testing, and into production. Profiles—expressed plainly in seconds of response time—should be the standard language that architects, developers, database and system administrators, end users, and managers use to discuss application performance.
As the founder of a new business, I understand the natural desire that a business leader has to outsource any service that’s not part of his core business. It’s why my company outsources our HR and payroll functions to a PEO. It’s why we buy Gmail services from Google and accounting software services from QuickBooks Online. Oracle application performance is the same way. Why on earth would an airline or an entertainment company or a bank or a telco want to learn everything that Method R Corporation knows about Oracle performance?
What I think people want is the luxury of forgetting about software performance. At Method R Corporation, we can provide that luxury. We can track the performance of your applications, the way your users perceive them, through a service that you don’t have to think about all the time. We can let you know in advance when your performance is trending toward danger. We have written the software that makes it really easy for us to do that. Of course, when we help you find (or predict) a problem, we can help you fix it, or if you prefer, we can teach you how to fix it yourself. The goal is the same in either case: genuinely satisfying software performance for the users of your system. That’s what’s next for Method R.
Cars need regular maintenance. Ships need regular maintenance. Planes need regular maintenance. Don’t databases need regular maintenance?
They do, but the analogy is a little more twisty than you might think. Physical machines have parts that wear out. Software doesn’t wear out, but from time to time it can need a human hand to help it adapt to the changes that periodically intrude upon it.
Your database changes all the time. Every insert, update, and delete changes the world in which your database server must function. With Oracle software, you do need to perform some types of regular maintenance in response to those changes.
One type of regular maintenance that databases need is software patching. No software that is as complex as a database manager is perfect; there are defects inside. As your software manufacturer finds and fixes those defects, you’re going to benefit from some of the patches that result.
Another type of regular maintenance that some databases need is gathering of statistics that the optimizer uses to make its decisions about query plan assignments. That gathering process is not yet completely foolproof, and so it requires some human decision-making and periodic attention.
One topic of regular maintenance that I should mention specifically because there is so much discussion about it out there is index rebuilding. That’s an area where a lot of people think their database needs regular maintenance, but it really doesn’t. Richard Foote, Jonathan Lewis, and Tom Kyte have some excellent articles explaining why.
Over time, as Oracle adds more features to the database software, the requirement for more and more kinds of regular maintenance will diminish, and probably eventually disappear. An easy example from history is Oracle’s approach to undo storage. In version 6, I spent a lot of my time as a DBA rebuilding people’s rollback segments. That kind of thing is no longer necessary, because with rollback segments, Oracle Corporation has evolved its software to eliminate the need for so much human intervention.
Help! My database is slow!
“Help! My database is slow!” There’s a secret behind this statement, and it’s monstrously important. Do you know what it is?
It’s this: When a user says a word like database or system, she’s not talking about a database or your system. She’s talking about a task.
Here’s the problem: To a DBA or analyst, a database is a complicated network of interrelated components. The most common metaphor for organizing those components into something comprehensible is the “performance dashboard,” which technical people try to rely on especially heavily when they hear the words database and slow in the same sentence. Dashboards show all sorts of statistics about your system, like CPU utilization, I/O latencies, cache hit ratios, latch miss ratios, and redo generation rates.
Analysts think about this stuff all the time. But users never do.
When Nancy in AP says, “My database is slow,” what she really means is, “When I click here, it takes too long before I get what I need.”
The first step for diagnosing Nancy’s problem is obvious to almost anybody who hasn’t benefited from years of classical Oracle training: Observe Nancy while she executes the task she’s complaining about. Classically trained Oracle analysts tend to consult their dashboards.
Now, maybe something on your dashboard will light up when Nancy does the slow thing. And maybe that light will actually reveal the cause of Nancy’s problem. But sometimes nothing lights up when she does the slow thing, and sometimes the wrong thing lights up when she does the slow thing. In Optimizing Oracle Performance, there’s a story on page 327 of one time when the wrong thing lit up.
Cases of wrong things (or no things) lighting up on dashboards happen more often than you think. I can’t quantify exactly how often, but I can tell you that nearly every dollar I’ve earned since 1995 is the result of people who went the dashboard route and failed. One message I have for those who will listen is that they should stop looking at dashboards.
So, what should you do instead? As I mentioned, you should look at Nancy. In the case where I met my real-life Nancy, we fixed her problem without ever touching a system. It was one of those, “Ah! Don’t do it that way—do it this way” problems. Of course, more often, there’s something legitimately wrong with how the system responds to Nancy’s click, so you’ll need more data to figure out what it is.
The data that you need is profile data: the kind of stuff C programmers get when they use “gcc -pg,” and what you get when you use Oracle’s “10046 Level 12” trace facility. Raw profile data is a detailed log of how code path has consumed time. You usually don’t look directly at raw profile data; you usually look at it through software called a profiler (like “gprof” and my company’s Profiler product), which filters and aggregates profile data in useful ways to help you make quick sense of what took so long.
The thing that surprises most database professionals about profile data is that you really don’t need anything else to solve the vast majority of performance problems. The trick about profiling is that some applications make it difficult to generate good profile data. Oracle makes it easier with every database release, but an application’s ability to generate good profile data about itself—its performance instrumentation—is a key feature that determines how easy it’s going to be to make your “database” fast for all of your users.
I like Tom Kyte’s idea that there are good Hints and there are bad Hints. Good Hints, like FIRST_ROWS and ALL_ROWS, give the Optimizer additional information about your intentions that can help it to make better decisions. Bad Hints, like USE_NL and USE_HASH, constrain the Optimizer from choosing an execution plan that might actually be ideal for you.
Bad Hints are like morphine for your database. They’re just right for a very specific purpose, but if your application needs them to run, it’s a problem.
Bad Hints are just right for experimenting with your SQL to see how different plans perform. Hints let you tell the Optimizer exactly what to do, so you can measure how it acts when it does it.
But habitual bad-Hint use is a bad thing. Especially because since version 9.2, the Oracle Optimizer generally makes excellent decisions based upon the information you (or your DBA) give it. Having the Optimizer is a lot like having a really smart SQL performance expert in your office.
And here’s where I think a lot of people mess up. Imagine that a really smart SQL performance expert is saying that your query’s most efficient execution plan is something you find utterly ridiculous. What would you do? If it really were a smart person in your office, you might at least listen respectfully. But with Oracle, a lot of people just roll their eyes and slam a Hint onto their SQL to constrain the Optimizer from choosing the apparently dumb plan.
The problem is that the Optimizer probably made a smart decision based on the data you gave it. Maybe it chose a stupid plan because you accidentally told it that your 10,000,000-row table has only 20 rows in it. If you just tape your Optimizer’s mouth shut with an INDEX Hint, you may never find the 26 other queries that also use bad plans because of this bad assumption.
So when your Optimizer does something crazy, don’t reach for the Hints; find out why a good decision-maker has made a bad decision. The less you can rely on bad Hints, the less time and effort you’ll have to spend hand-tuning individual SQL statements, and the better your odds will be of having stable performance after your next database upgrade.
You can cure your addiction to bad Hints. Start by visiting the Asktom URL that I listed earlier. The full prescription is to read Jonathan Lewis’s outstanding book Cost-Based Oracle—Fundamentals. It covers a tremendous range of information that will help make the Oracle Optimizer your friend. ▲
Cary Millsap is the founder and president of Method R Corporation, a company devoted to genuinely satisfying software performance. He is widely known in the Oracle community as a speaker, educator, consultant, and writer. He is the author (with Jeff Holt) of Optimizing Oracle Performance, for which he and Jeff were named Oracle Magazine’s 2004 Authors of the Year. He is also a contributor to Oracle Insights: Tales of the Oak Table. Cary is the former vice president of Oracle’s System Performance Group, and a co-founder of Hotsos. Cary is also an Oracle ACE Director and a founding partner of the Oak Table Network, an informal association of Oracle scientists that are well known throughout the Oracle community.
Interview conducted by Iggy Fernandez for the August 2008 issue of the NoCOUG Journal (Click here for the PDF version)