## Problem Statement

While benchmarking different games to see if Broadwell still holds up in early 2022, I had a thought: was I wrong to report the average framerate across all benchmarked games? Are the higher framerate games contributing more than they should to the simple average of all the FPS’s across all the benchmarks? The more I thought about it, the more convinced I was that I had indeed made a mistake in my analysis.

So then, what sort of measure-of-center should we use when trying to determine the average performance difference between two configurations of hardware?

## Hypothetical Data

For this exploration, I’ve created a basic set of data below that represents three different hardware configurations where three different benchmarks were tested. These numbers represent the average framerates across those benchmarks.

These hypothetical data are a little exaggerated to illustrate the differences between the measures of center. However, this could be something we see if a high framerate game is significantly CPU bound, but two others are less so. That’s not uncommon in gaming benchmarks.

In my original article, I used the arithmetic mean of all the framerates across all benchmarks to report an “overall” relative performance compared to my configuration of interest. Let’s explore why that is probably not the ideal mechanism to compare overall performance of different configurations.

### Arithmetic Mean

The formula below is for the arithmetic mean (or average) of our set of benchmarks, where ** T_{1}**,

**, and**

*T*_{2}*represent the average framerates for the benchmark runs of tests 1, 2, and 3.*

**T**_{3}The dominance of benchmark 3 in the simple average is illustrated well by stacking all of the framerates for the benchmarks for each configuration on top of each other. The stacked bars below represent the numerator in the formula for the arithmetic mean.

Looking at this, we can see that the contributions of tests 1 and 2 are significantly less than that of test 3. However, if we look at the performance of these configurations relative to configuration A, we can see that configurations B and C show large performance differences for tests 1 and 2, but a smaller difference for test 3.

In this case, the arithmetic mean is not illustrating the difference between the configurations as we would expect from this normalized per-game graph. This is a known mathematical behavior of the simple mean or average, and it’s part of the reason why statisticians like to examine more than one measure of center.

## Alternative Measures-of-Center

### Geometric Mean

First, we’ll look at the geometric mean. From Wikipedia:

The geometric mean is often used for a set of numbers whose values are meant to be multiplied together or are exponential in nature, such as a set of growth figures: values of the human population or interest rates of a financial investment over time. It also applies to benchmarking, where it is particularly useful for computing means of speedup ratios: since the mean of 0.5x (half as fast) and 2x (twice as fast) will be 1 (i.e., no speedup overall).

https://en.wikipedia.org/wiki/Geometric_mean

This is a good candidate for a measure-of-center for gaming benchmarks, as some games can run at over 400 frames per second (e.g. CS:GO), whereas some still struggle to exceed 60 fps on the fastest hardware available.

There is plenty of coverage of the geometric mean across reputable sites on the internet, so I won’t cover it in detail here. In addition to the geometric mean, I would like to propose another measure that may be more intuitive for readers. The geometric mean is very useful, but is not intuitive for some who haven’t learned about it previously.

### Average of Normalized Performance

I would propose this: if you wanted to get the average performance difference among different configurations across multiple disparate benchmarks, you could simply take the average framerates, normalize them against your configuration-of-interest, and then average the normalized values together to get your “average performance difference” or “average speedup” measurement.

Where * T_{i}* is the test value for the configuration-of-interest that we’re using to normalize the results. The formula isn’t the prettiest thing in the world, but I think the written explanation of it is more intuitive than the geometric mean. Below is the chart with all three of the discussed statistics together, plus an extra normalized version of the geometric mean.

As you’ll note, the normalized geometric mean column is identical to the geometric mean. I’ve checked this multiple times, and I believe this to be a result of how the geometric mean is calculated: it doesn’t matter if you normalize the data first or not if you’re going to normalize it at the end. Since it’s the same here, I don’t show it any subsequent charts.

This new measure-of-center comes out close to the geometric mean. The math indicates that this is not the geometric mean, but another measure-of-center for our particular use case. I personally think this is more useful for the kind of benchmarking I’ve been doing, as it’s easier to communicate what is actually being measured.

### Which is best?

As noted in the quote from Wikipedia, the arithmetic mean can be misleading when used with benchmarks. The arithmetic mean of 0.5 and 2 is 1.25, whereas the geometric mean of the same is 1. As a reader, which would you expect? What about the case where you have one configuration that’s 0.8x in one benchmark and 1.2 in another? The arithmetic mean for that is 1, and the geometric mean is 0.98. Which is more useful or intuitive to the reader? Or are they close enough that it doesn’t make a large enough difference to worry about?

It’s impossible to say, as each reader will have their own opinion. I will say that the geometric mean deviates more from what a reader might “expect” when we’re talking about small differences. However, when the differences are large, the geometric mean gives the more “expected” result.

## Redoing the Broadwell Results

I went back and re-did the calculations for the fast vs slow RAM comparison in my Broadwell Part 2 investigation. Happily, the different measures of center are very close to one another, so none of my conclusions from the original investigation are invalid. Going forward, I’ll use the geometric mean or average of normalized performance though, as they are more representative of the differences I’m typically interested in.

## Conclusion

If you want to consolidate performance differences into a single number, it’s best to use something like the geometric mean or the propose average of normalized performance. The arithmetic mean discounts the contributions of games that run at lower framerates too much and over-emphasizes the contributions of the higher framerate games.

I like the average of normalized performance that I’ve proposed because that it’s easy to explain to a reader. However, the geometric mean will get you similar results, while handling the 0.5x/2x case that was outlined earlier as an extreme. It may take more effort to explain, but you can also leave the explanation as an exercise for the reader.

Additionally, the data were **very** slightly misrepresented in the previous Broadwell exploration since the arithmetic mean was used as opposed to the geometric mean. The difference isn’t large, but it is there. In a larger exploration, with more variation in results, the effect may have been more pronounced.

This whole exploration is making me wonder if I should have a method for site visitors to pick which games they want averaged together in a measure-of-center so that the measure is representative of the games they play, and meaningful to them as an individual. Ideally, that’s what everyone would do, but that’s not the reality of running tech websites these days.

## More Complications: Modalities

I didn’t touch on this in the above discussion of measures of center, but imagine this: you’re benchmarking 20 games across different graphics cards. In 10, AMD takes approximately a 10% penalty to performance compared to NVIDIA, and in the rest, AMD is approximately 10% ahead of NVIDIA. The proposed measures-of-center here will not show that; they will show that AMD and NVIDIA are more or less equivalent. In this scenario the user should figure out what games are important to them and look at just those games in order to form a perception of value of one graphics card over another.

No measure of center I’m aware of can effectively capture modalities in a single value. If you use the arithmetic mean alongside the standard deviation, or box-and-whisker plots with percentiles, that can give you a better idea, but that still is imperfect, and doesn’t actually show the above modality.

If you **must** consolidate your conclusions or findings into a single value to compare configurations, you should use either the geometric mean or the proposed average of normalized performance.

## Leave a Reply