## Unusual Statistical Phenomena, Part II: Stat Testing of Percentages

Sometimes when looking at the results from survey data, we see something that makes us say ‘huh?’ or ‘that doesn’t look right’. When the odd results persist after verifying the data were processed correctly (always a good practice), there is typically still a logical answer that can be uncovered after doing some digging. Sometimes the answer lies with something that we will call ‘unusual statistical phenomena.’ This is part 2 of a series that will look at some of these interesting – or confounding – effects that do pop up now and then in real survey research data.

This time we will look at an unusual phenomenon that can occur when doing something typically considered fairly mundane – testing for statistical significance between percentages. An example will help to illustrate this phenomenon which periodically causes us to question stat testing results.

Let’s say we have fielded the same survey for two different brands. One part of the survey collects respondent opinions of the test brand using a battery of attribute statements with a 5-point agreement scale. The base size for each survey was 300.

Stat testing was conducted between results for the two brands for Top Box percentages on each of the attribute statements. However, some of the results are questionable. Specifically, for the attribute “Is Unique and Different” Brand B’s score was higher than Brand A’s by 4 percentage points, which was statistically significant at the 90% confidence level (denoted by the “A” in the chart below); while for the attribute “Is a Brand I Can Trust” Brand B’s score was higher than Brand A’s by __6 percentage points__, which was **NOT** statistically significant at the 90% confidence level. How could this be!

How can a difference of 4 points be statistically significant while a difference of 6 points is not, even with the same base sizes? To understand how this can happen, let’s first look at the basics of how a statistical test for comparing percentages works.

First, a t-value is computed according to this formula:

Then this t-value is compared to a critical value. If the t-value exceeds the critical value then we say that the difference between the percentages is statistically significant. The critical value is based on the chosen confidence level and the base sizes of the samples from which the percentages were derived.

In our example, we chose the 90% confidence level for both statistical tests and the base sizes are the same, so the critical value for both tests is the same. We also know the difference between the percentages (the numerator of our equation) is what appears anomalous as the difference of 4 led to a t-value that exceeded the critical value, while the difference of 6 did not exceed the critical value. Therefore, the issue must lie with the Standard Error of the Difference.

Let’s next examine what a Standard Error represents. Our surveys were fielded among a sample of the overall population. If we sample among women 18 to 49 in the United States, we will infer that our results are representative of the entire population of interest, which is all women 18 to 49 in the United States. However, it is unlikely that the measures we compute from the sample (such as the percentage that say Brand A “is a brand I can trust”) will be exactly the same as the percentage would be if we could ask everyone in the entire population of interest. There is some uncertainty in the result because we are asking it of only a subset of the population. The Standard Error is a measure of the size of this uncertainty for a given metric.

In our equation, the denominator is the Standard Error of the Difference between the percentages. While not precisely correct, the Standard Error of the Difference can be thought of as the sum of the individual Standard Errors for the two percentages being subtracted (the actual value will be somewhat less due to taking squares and square roots). As the graph below illustrates, the Standard Error for a percentage is a function not only of the sample size, but also of the size of the percentage itself.

Specifically, for any given sample size the Standard Error is largest for values around 50% and decreases as values approach either 0% or 100%. For a base size of 100 (the dark blue line), the Standard Error is close to 5 for percentages near 50%, but decreases close to 2 for very small or very large percentages. You can think about this as it being harder to estimate the percent incidence of a characteristic of a population when around half the population has that characteristic versus when almost all (or almost none) of the population has that characteristic.

In our example, the percentages for Is a Brand I Can Trust are close to 50%, so at a base size of 300 the individual Standard Errors would each be a little under 3. In contrast the percentages for Is Unique and Different are around 10%, so at a base size of 300 the Standard Errors would each be around 1.5. That’s a big difference!

It follows that the Standard Error of the Difference for Is a Brand I Can Trust would be much larger than for Is Unique and Different. In fact, the actual values are 4.08 for Is a Brand I Can Trust and 2.34 for Is Unique and Different. Again, a big difference. If we divide the differences in the percentages by these values for Standard Error of the Difference, we get t-values of 1.47 and 1.71, respectively. Given the critical value is approximately 1.65, we see that the t-value for the difference of 6 is below the critical value (hence not statistically significant); while the t-value for the difference of 4 is above the critical value (hence is statistically significant).

Hopefully this takes some of the mystery out of stat testing and helps in understanding why what can appear to be anomalous results may actually be correct.