The Pith of Performance: Data + Models == Insight

Al Bundy, of the TV show Married with Children, understood it and performance engineers should too. What am I talking about? The theme music for that show is the tune "Love and Marriage" as sung by Frank Sinatra. Just like the song says about love and marriage, so it is with measurements and models ... You can't have one without the other.

Michelson and Measurement
In my acceptance speech for the A.A. Michelson award at CMG'08, I focused on the work of Albert Michelson and tried to convey why I thought it was a good role model for how we should do computer performance analysis today. You can see my entire rant on video.

The 'M' in CMG stands for "measurement" and that's what Michelson was good at, especially when it came to measuring light. He was an experimental or applied physicist and among his many optical devices, he designed a measurement apparatus to detect the presence of the aether, which was presumed to exist as a direct consequence of Maxwell's theory of electromagnetic radiation, developed in the 1860s. Maxwell's equations are equations for wave motion; very special wave motion, it turns out. In particular, Maxwell realized that a necessary constant (c) that appeared in his differential equations, matched the known speed of light. This led to the stunning conclusion that light is an electromagnetic phenomenon that must propagate as waves. All waves propagate in something: sound in air, ocean waves in the sea, but what about light? The hypothesis was that there was a corresponding medium called the aether for light. Since it was a physical medium, it must be possible to measure it.

Michelson constructed something called an interferometer (basically an arrangements of mirrors) to detect the presence of this special aether medium. It's a very clever design, which I explain in my talk. The point here is, that after many very careful measurements, the interferometer indicated that no aether medium existed; a null result (arguably the most famous in physics). Since Michelson's apparatus was cleverly designed, in part because it was an incredibly simple construction, there could be no sustainable doubt about the accuracy of his measurements. Although it did have one moving part (the swivel for the 2-ton stone table), nothing could have gone wrong in the apparatus precisely because it was so simple.

Like most of his peers, this null result was deeply troubling to Michelson because he had been brought up with Maxwell's (then still relatively new) theory, which appeared extremely robust and encompassing. So much so, that everyone expected the aether would just be there. Since Michelson demonstrated that it wasn't there and it was not possible to blame his apparatus, the world of physics was left in suspenders (pun intended). What was going on!?

In simple terms, there was no model to explain Michelson's results. Moreover, it took about 20 years to find one because the existing model (Maxwell's theory) was an integrated edifice that explained such an enormous range of electrical and magnetic phenomena, it was extremely difficult to see what could be wrong with it. Ultimately, great scientists like Lorentz, Poincaré and Einstein started to realize there were some very subtle inconsistencies in Maxwell's theory and the full resolution is known today as the Special Relativity Theory (SRT). Part of the upshot of SRT is that Maxwell's waves are, in a certain sense, self-propagating without the need for a medium. Even after the introduction of SRT in 1905, Michelson's experiment and similar ones, were repeated by others right up to the 1930s, including some reports of what turned out to be false positives. Some say that Michelson continued to believe personally in the aether, although he also accepted Einstein's SRT intellectually.

BTW, it's called the Special Relativity Theory, not the Special Relativity Model because SRT does a lot more than just explain Michelson's measurements. In physics, the term model is usually reserved for an explanation that tends to be isolated from the rest of known physics. A theory, on the other hand, makes experimental predictions that can be tested and it tends to integrate with other established theories. SRT integrated with Maxwell's theory of electromagnetism and Newton's theory mechanics; the two biggies. In fact, SRT is actually a way of correcting Newton's theory so that it is more consistent with Maxwell's theory. Neither Maxwell nor Newton got thrown out of the House of Physics (as it were), they just cohabit better under the rules of SRT.

Performance Measurement
So, we need both measurement and models to come to a full understanding and make progress in physics. What about in the world of IT? There unfortunately, Al Bundy tends to act more like, well ... the usual Al Bundy. In order to remind the Al Bundy's of the IT world what it's like to be married with children, I want to demonstrate that performance measurements alone, e.g., load tests, are not sufficient for proper performance analysis. Just like in physics, when it comes to measurements and models ... You can't have one without the other.

Here, of course, I'm now referring to performance models. The measurement process could be wrong or the measurement tools could be broken. If you don't have a performance model, how would you know? If the load-test numbers are obviously garbage (e.g., negative response times), then it's easy. But what if the generated numbers are not obviously garbage? Then, you really can't know if there's an underlying problem without comparing the data with a performance model. Let's look an an actual example.

Working with Guerrilla graduate, Stefan Parvu of Sun Microsystems in Finland, we recently stumbled onto a simple example that highlights the essence of this point very nicely. The context was a JVM-based application where the JMeter throughput measurements looked fine, but they weren't. Originally, our goal was to apply the universal scalability law (USL) to the application-load data which Stefan was collecting from a test rig. But long before we ever got to that level of performance modeling sophistication, the steps for merely setting up the data in Excel already detected a problem. Specifically, the data showed efficiencies that were greater than 100%, which simply is not possible.

JMeter Measurements
To get a better idea of this effect, here's a plot of the throughput profile for the JVM-based application as measured with JMeter. Since the throughput data are proprietary, I've normalized them so that they can be used here with permission. For this discussion, we're only interested in the general characteristics and therefore we don't need the absolute values.

As you can see, this throughput profile looks kosher because it conforms to the following expected properties:

Monotonically increasing. A sequence of numbers is monotonically increasing if each element in the sequence is larger than its predecessor. Notice further that it appears to become monotonically decreasing beyond N = 300. That's fine because that's precisely the kind of thing the USL is designed to model, but we won't get into that aspect here.
Linear rising up to N = 200 virtual users.
Reaches saturation around N = 300. This is exactly what we expect for a closed system with a finite number of active requests (as is true for any load-testing or benchmarking system). In this case, the onset of saturation looks rather sudden as indicated by the discontinuity in the gradient ("sharp knee") at N = 200. This is usually a sign of significant internal change in the dynamics of the combined hardware-software system. It could also present complications later when we come to apply the USL model. Once again, we defer that step in this discussion.

Excel Spreadsheet
The following table shows the quantities as they would appear in your Excel spreadsheet, when you are setting up to apply regression analysis using the USL model. The column with the actual throughput data has been suppressed for the reasons given before.

But wait. Look at the 3rd column. We have a problem, Houston! Between N = 5 and N = 150 vusers, the efficiency values are bigger than 100%. This is patently absurd because you simply cannot have more than 100% of anything. It's even easier to see the problem if we plot it out.

Efficiency per User
The N = 1 efficiency in the 3rd column of the Excel spreadsheet corresponds to 100% and that is exactly what we expect. Similarly for the final value at N = 350 vusers, the efficiency of the system has fallen to 60% of the N = 1 efficiency. In between, however, we have some efficiencies that are in excess of 100%.

If we plot these values separately, we see that except for the first data point, all the other data points in the red area are illegal. Something is wrong. To see how the efficiency curve should look, check my earlier blog post.

You can see why some efficiencies exceed 100% by looking at the 2nd column (the relative capacity) in the Excel spreadsheet. If you get 1 unit of capacity at N = 1, but get more than 5 capacity units at N = 5, and more than 10 capacity units at N = 10, etc., then you have a lot of explaining to do!

Deviation from Linearity
On the other hand, if we were to ignore this problem and simply press on regardless, we run into trouble in the 4th column with negative values for the deviation from linearity. Once again, this is easier to see if we plot it out and I use the notation L(N) = (N/C) − 1 on the y-axis. The USL model requires that all these values be positive.

Except for the first data point, all the other data points in the red area are negative and therefore, illegal. None of this is obvious from simply staring at the original throughput profile, because the deviations are relatively small and cannot be detected visually. Moreover, unlike Michelson's simple interferometer, there are plenty of things that can go wrong with JMeter and the whole test rig.

What's Going On?
So, what's causing the problem? At this juncture, we actually don't know, but that's ok. Eventually it will be understood. What we do know at this point is that something is broken in the measurement process and it has to be resolved or any further load testing is simply a waste of time and resources.

Note that I am saying that it's the data that are broken, not the model. The model, at this stage of the game, is nothing more than a set of ratios for which the physical meaning is very clear and unambiguous. We didn't even get to the more sophisticated USL model yet. So it cannot be the model. That leaves only the data generation process, the performance measurement tools or possibly some previously unknown pathological behavior in the application or a combination of all these. It's the converse of the situation that Michelson faced. There, he had a complicated theory and simple measurements that could not be reconciled. Here, we have a simple model and complicated measurements that cannot be reconciled (so far).

But if we hadn't been forced to calculate the efficiences for the USL model, we might have remained blissfully unaware that the JMeter data are erroneous, until the application exhibited scalability problems once it had been deployed into production. We can only gain this kind of prescient insight by combining the data with a model. In other words:

Data + Models == Insight

As Al Bundy recognized (and the song says), you can't have one without the other. Well, you can, but you shouldn't when it comes to performance analysis!

Footnote: Nikaus Wirth (inventor of Pascal) wrote a book entitled: Algorithms + Data Structures = Programs, and Richard Hamming (of Hamming distance fame) said, "The purpose of computing is insight, not numbers." When it comes to computer performance analysis, I claim: Data + Models == Insight , where, I've used the programming symbol == for equality, rather than the more mathematical symbol =, which usually means assignment in programming languages.

The Pith of Performance

Monday, June 1, 2009

Data + Models == Insight

2 comments: