The Uses and Limitations of Database Benchmarking

(This  appeared in the March 2001 issue of Oracle magazine.) 


Database benchmarks abound. What do they mean and how well do they address real-world performance questions?

The human desire to measure things is as old as civilization itself. In 3000 B.C. Egypt, measurement of the cubit was so accurate that the pyramids were built within .005 percent of geometric perfection. Five thousand years ago, the Mayans had developed a calendar that precisely accounted for leap years. Chinese astronomical “Oracle Bones” from 1302 B.C. were used by NASA to determine that the length of a day was 47/1000ths of a second shorter then than it is now. (Oracle Bones is NASA’s name, not ours.)

But measuring anything can be fraught with subjectivity and politics. Take the precise Egyptian cubit: It was based on the distance from Pharaoh Khufu’s elbow to his fingertip. Our obsession with measuring continues to this day, but now we measure distances between stars and the weight of subatomic particles. And, of course, in the database industry we measure performance. We want to know how fast a database is and how much it costs to run so we can determine which one is the best value.

Database benchmarking attempts to measure these and other factors. But as with any sort of measurement, the challenge is to devise a test that’s accurate and fair—and that gives truly useful numbers. Sometimes the process seems as complicated and difficult as building the pyramids.


The first database “speed marks”—they can’t really be called “benchmarks”—were provided by the database and systems manufacturers themselves. Each manufacturer measured performance in its own way, so figures from one weren’t comparable to figures from another.

In response, the Transaction Processing Performance Council (TPC), a group of hardware vendors and database companies, formed in the late 1980s to give database benchmarking third-party objectivity. The TPC benchmarks have evolved and multiplied over time, but they’ve always provided two figures: the transaction rate of a database, and the cost per transaction including hardware, software, and maintenance.

The benchmarks are based on a complicated but mutually agreed-upon specification. (The specs are available on the TPC Web site, The final numbers are fully disclosed and audited. For a few years, the TPC benchmarks multiplied and thrived; in 1999, a record 55 TPC-C benchmarks were published.

But the number of published TPC benchmarks is on the decline. “There’s about half as many benchmarks being produced now as there were just a couple of years ago,” according to Jim Enright, Oracle’s director of performance product management. There are several reasons for this decline.

Politics has played a role. On the TPC Web site, Kim Shanley writes that “the TPC’s history is about both benchmark law and benchmark order.” That is to say, it’s about creating benchmarks and making sure they’re applied fairly and used appropriately. Even audited, third-party figures can be used for “benchmarketing”—using benchmark numbers to boost marketing claims.

Another problem is that the benchmarks haven’t kept pace with technology changes. Benchmark specifications take years to create, and benchmark tests take more years (and, often millions of dollars) to run, verify, and publish. Software and hardware are often upgraded before benchmarks of older versions can be completed.

The first TPC-C benchmark, published in September 1992, was 54 tpmC (transactions per minute on TPC-C) at $3,483 per tpmC. In November 2000, a new record—220,000 tpmC at $43.30 per tpmC—was set by an Oracle database running on an IBM UNIX server. Because they seem so concrete, the TPC numbers can be very appealing. But a company can’t simply multiply transaction speed by cost per transaction and get meaningful numbers to plug in to a budget. “You need to look at those numbers relatively; you can’t look at them in absolute terms, because they’re not realistic,” says Carl Olofson, an analyst with research firm IDC. “You can’t really expect that your cost per transaction is going to be so many cents. It’s just not going to be right.” For example, while TPC benchmarks include maintenance costs for five years, they don’t factor in labor costs for staff DBAs, implementation delays, or reliability issues—the sorts of costs that can vary tremendously from site to site. And the systems they were built to measure are dramatically different from those being manufactured today.

Richard Sarwal, Oracle’s vice president of server performance, agrees that the databases have evolved faster than the benchmarks. “The problem is that the application that the benchmark is portraying is basically not real anymore. But since the benchmark was established, hardware and software have evolved, while the TPC-C hasn’t,” Sarwal says. “The way the TPC benchmark is run these days is completely flawed.”

According to Sarwal, the way the benchmarks treat single systems versus multiple-node systems is one of benchmarking’s shortcomings. “The whole technology that’s being deployed to make these numbers in the multinode cases is completely bogus,” he says. “It couldn’t be deployed in the real world.” Sarwal thinks the TPC benchmark’s primary value now lies in comparing I/O throughput on single systems.


Although the TPC process and benchmarks are imperfect, Oracle and other vendors continue to participate in them. And like other vendors, Oracle has continued to develop its own tools and methods for measuring and performance-testing its products. Oracle’s Enright says this has resulted in three types of benchmarks he calls “yours, mine, and ours.”

“Ours” benchmarks, such as TPC benchmarks, may be imperfect, but their benefit is that they’re mutually agreed upon by the hardware manufacturers, the software vendors, and the users. (An example is the Oracle Applications Standard Benchmark, at /apps_benchmark/index.html. It falls into this category because its methods are publicly disclosed and it must be conducted by a third party.)

Benchmarks in the “Mine” category are developed by the database software companies themselves. They’re useful for comparing the same database on different hardware, but not for comparing two different databases on the same hardware.

“We’re not just using applications, but a variety of other internal workloads that we’ve either based on the customer’s environment or developed ourselves,” says Sarwal about Oracle’s internal benchmarks. The goal goes beyond measuring raw processing power; recovery time, performance/availability trade-offs, and the day-to-day tasks of DBAs are included to create a more complete model of a database’s true cost of operation. These models are created with input from working DBAs and consultants, in addition to automatic data-collection modules built in to live, working customer databases. “There’s a whole series of different metrics that we collect and work into our internal tests,” says Sarwal. “We want to give people guidance on the kinds of things customers can do to save both time and management resources.”

“Yours” benchmarks are sometimes created for a specific customer to test a particularly challenging operation. “For us to develop a special test, the customer must be large and have a technically interesting problem,” says Enright. “We’re not necessarily interested in doing another test if it’s not going to prove anything different or drive us to see the deficiency in a product area in a way that helps us to improve it.”

“It’s inherently hard to model your system accurately,” Sarwal says in explaining why these benchmarks are problematic. “You have to write it before you model it. You’re caught in a Catch-22. When we work with customers, instead of trying to figure out the whole application, we try to find a representative sample of what they’re going to do, which would show certain characteristics that are important in scaling and performance.”
The alternative to this sort of representative sampling is to road-test a system. “Given the rate of deployment of applications, for some people that’s really the only way to go,” says Sarwal. “You put up a system that’s the real system, but it’s not anywhere near the peak load that you expect. You basically measure it in real time, and you add hardware, resources, memory, or whatever, to scale it.”


And that scalability has become critical as databases have moved onto the internet. Unfortunately, it’s also an area that many benchmarks can’t measure. Also, a new set of variables arises when databases are exposed to the outside world. “In an internal system, you could predict how many users were going to be logged on and what kinds of transactions they were going to do, and you knew what was going to happen at month’s end and quarter’s end,” says Sarwal. “That’s almost impossible to predict now. You can say ‘I’m going to get a spike in load,’ but you have no idea how big that spike is going to be. Capacity planning becomes a lot harder, if not impossible.”

TPC benchmarks are unable to test for this, according to IDC’s Olofson. “There’s an important factor that’s left out of all TPC-C benchmarks that I’ve ever seen,” he says, “and that’s the effect of concurrent users and resource contention. [The benchmarks] push transactions through serially. They might be running a series of update transactions and doing reads at the same time, but that’s still not the same as doing random, concurrent update transactions and attempting to do queries simultaneously. That’s really where the database technology proves itself.”

The 24/7 demands of the internet also put hard-to-benchmark strains on databases. “You have to integrate functions previously relegated to overnight batch jobs into the business transactions themselves, which is going to cause much more intensive database activity,” says Olofson. “So the system overall becomes much, much more stressed and much less predictable.”


But because conventional benchmarks don’t do a good job of measuring these stressed, unpredictable systems, companies shopping for a database usually look to their peers—and their competitors—for additional guidance. “They tend to look at whatever case studies they can find,” Olofson says of database buyers. “They look at user experiences and anything else that gives a sense of how these things compare in the real world.”

That sort of information often comes from total cost of ownership (TCO) studies. In addition to looking at database cost and performance, TCO studies include a host of other factors, including costs for implementation, training, upgrading, and downtime.
Unlike TPC benchmarks, TCO studies attempt to give solid numbers that companies can use when building budgets. “I would say you can use our numbers in financial planning,” says Peter Cunningham, president of Input, a Washington, D.C.-based market service and research company that does TCO studies ( “We’re trying to address how much it will cost you as an organization if you go down a particular path.”

Rather than building and studying model systems, Input generates TCO numbers that come from surveying statistically significant numbers of actual product users. “These are not theoretical installations or laboratory test installations,” according to Cunningham. “These are real, live environments that people are using. And we’re not measuring; we’re asking the respondents what their measurements are.”

Cunningham thinks that last point is an essential one, especially when attempting to convert “soft” terms like “easy” or “better” into hard numbers. “If you ask about ease-of-use and [respondents] say ‘We went from one support person per twenty to one person per hundred, and our user satisfaction level went from 3.3 to 4.2,’ then you’ve taken something that’s soft and converted it to data,” says Cunningham. “We always try to move from something that’s emotive or soft—like ‘better,’ ‘best,’ ‘beautiful,’ ‘easy,’ ‘hard,’ those kinds of emotive words—toward quantifiable measurement.”

While TCO studies benefit database buyers, they’re also useful to vendors who want to know if product improvements deliver bottom-line value to customers. “TCO studies are useful not just to compare one product with another,” says Sara Wells, a senior analyst at Input. “They’re also useful to track the continued improvement in performance over time for a given product.”

Companies like Input have to earn the cooperation—and trust—of working IT professionals to do TCO work properly. “Everybody is surveying left, right, and center these days,” says Cunningham. “One of the things we do whenever we do research is we share the results with the people who contribute to it. Our attitude has always been not to be just a taker, but to be a sharer.” Cunningham thinks this practice improves not only the level of professional cooperation but also the accuracy of the data. “Once they’re comfortable with the concept, they want to be straightforward and give good data because they want to get good data back,” he says.

By surveying hundreds of professionals at working sites, TCO studies are able to include factors beyond the basics, including costs related to labor and education. “If a company’s DBAs, who are trained in software and implementation, can reuse those skills without an enormous amount of additional training for other projects, that can impact the cost,” Input’s Wells says. “The ability to reuse skills or add onto existing skills can be a factor in total cost of ownership.”

Business recovery—which is so important when a database is tied to e-commerce—is another cost that TCO studies can measure, according to Wells. “What’s the cost if the database is not available?” she asks. “What are the costs in terms of lost revenues and lost work time?” Wells says that TCO studies should answer these critical questions. “Those are strong differentiators in the internet age,” she adds.

This is another area where traditional benchmarks fall short and where modern metrics, such as TCO studies, continue to evolve. Because unlike the pyramids, the database business can’t stand still as time marches on. Instead, the industry—suppliers and users—needs to help the measuring tools keep pace as the products come, go, and change.

Tape measure image by Laney Powell.