Assessing the Utility of MSW’s Insight Rabbit Copy Testing Scores in Predictive Analytics: A Validation Case Study

February 2nd, 2022 Comments off

Copy testing has been utilized by advertisers for decades to assess the quality of advertising copy. MSW’s TouchPoint™ copy testing system has been extensively validated, showing that test results on key metrics are predictive of subsequent sales results from airing the tested advertising. A partnership between MSW and predictive marketing analytics firm Keen set out to assess the utility of test scores from MSW’s Insight Rabbit DIY copy-testing platform at improving predictions from Keen’s MIDA decision support system.

The MIDA (Marketing Investment Decision Analysis) platform is designed to help marketers decide how to invest in marketing activities. MIDA users can develop optimized investment scenarios to meet specific business objectives such as hitting revenue targets or meeting budget constraints. It does this by applying a Bayesian modeling approach to a wide range of a brand’s historical marketing and performance data.

Could the use of copy quality metrics improve forecasting of business outcomes and hence, be used as an input to MIDI to improve the allocation of marketing dollars? To address this question, new advertising for a major packaged food brand was selected. This brand had developed two different campaigns with different communication objectives that tied back to the brand’s strategy. The brand intended to air both campaigns concurrently.

The television ads developed for each of the two campaigns were tested using MSW Research’s Insight Rabbit Pulse Lite copy testing solution. Results are shown in the graph below.

Both ads were adequate in terms of the secondary Break Through metric which assesses the degree to which an ad leaves viewers with a memorable and branded impression. However, Copy A scored much stronger in terms of the CCPersuasion™ metric which assesses the degree to which the ad positively influences preference for the advertised brand. Prior validation studies have shown CCPersuasion to be the strongest predictor of an ad’s selling power. Copy A scored significantly above the Fair Share benchmark with an index of 161, suggesting it is a very strong piece of copy. On the other hand, Copy B indexed 113 versus the norm and would be considered slightly above average at best.

Historical performance of the brand’s television investment was measured in MIDA to quantify the expected returns on investment for an average (or benchmark) ad for the brand. Then an initial forecast was developed before the start of the campaign using this historical performance enhanced by the MSW copy test results along with planned media delivery levels.

After the campaign had been running for six months, MIDA was updated with actual sales and television campaign delivery data. As seen below, the ROI for Copy A was approximately 90% higher than would have been expected from the historical benchmark ad performance level. On the other hand, Copy B’s ROI was only about 5% higher than the benchmark expectation.

This actual performance was extremely consistent with the copy test results which suggested that Copy A was a very strong ad, and that copy B was slightly above average. This result illustrates the utility of MSW copy test scores in the a priori forecasting of investment levels through integration with decision support systems. The integration of MSW copy test scores with a decision support system like MIDA would help steer marketing dollars toward more deserving initiatives, improve forecasts and bolster in-market effectiveness of brands’ marketing programs.

Categories: Ad Pre-Testing, Validation Tags:

Unusual Statistical Phenomena, Part II: Stat Testing of Percentages

January 24th, 2022 Comments off

Sometimes when looking at the results from survey data, we see something that makes us say ‘huh?’ or ‘that doesn’t look right’. When the odd results persist after verifying the data were processed correctly (always a good practice), there is typically still a logical answer that can be uncovered after doing some digging. Sometimes the answer lies with something that we will call ‘unusual statistical phenomena.’  This is part 2 of a series that will look at some of these interesting – or confounding – effects that do pop up now and then in real survey research data.

This time we will look at an unusual phenomenon that can occur when doing something typically considered fairly mundane – testing for statistical significance between percentages. An example will help to illustrate this phenomenon which periodically causes us to question stat testing results.

Let’s say we have fielded the same survey for two different brands. One part of the survey collects respondent opinions of the test brand using a battery of attribute statements with a 5-point agreement scale. The base size for each survey was 300.

Stat testing was conducted between results for the two brands for Top Box percentages on each of the attribute statements. However, some of the results are questionable. Specifically, for the attribute “Is Unique and Different” Brand B’s score was higher than Brand A’s by 4 percentage points, which was statistically significant at the 90% confidence level (denoted by the “A” in the chart below); while for the attribute “Is a Brand I Can Trust” Brand B’s score was higher than Brand A’s by 6 percentage points, which was NOT statistically significant at the 90% confidence level. How could this be!

How can a difference of 4 points be statistically significant while a difference of 6 points is not, even with the same base sizes? To understand how this can happen, let’s first look at the basics of how a statistical test for comparing percentages works.

First, a t-value is computed according to this formula:

Then this t-value is compared to a critical value. If the t-value exceeds the critical value then we say that the difference between the percentages is statistically significant.  The critical value is based on the chosen confidence level and the base sizes of the samples from which the percentages were derived.

In our example, we chose the 90% confidence level for both statistical tests and the base sizes are the same, so the critical value for both tests is the same. We also know the difference between the percentages (the numerator of our equation) is what appears anomalous as the difference of 4 led to a t-value that exceeded the critical value, while the difference of 6 did not exceed the critical value. Therefore, the issue must lie with the Standard Error of the Difference.

Let’s next examine what a Standard Error represents. Our surveys were fielded among a sample of the overall population. If we sample among women 18 to 49 in the United States, we will infer that our results are representative of the entire population of interest, which is all women 18 to 49 in the United States. However, it is unlikely that the measures we compute from the sample (such as the percentage that say Brand A “is a brand I can trust”) will be exactly the same as the percentage would be if we could ask everyone in the entire population of interest.  There is some uncertainty in the result because we are asking it of only a subset of the population. The Standard Error is a measure of the size of this uncertainty for a given metric.

In our equation, the denominator is the Standard Error of the Difference between the percentages. While not precisely correct, the Standard Error of the Difference can be thought of as the sum of the individual Standard Errors for the two percentages being subtracted (the actual value will be somewhat less due to taking squares and square roots). As the graph below illustrates, the Standard Error for a percentage is a function not only of the sample size, but also of the size of the percentage itself.

Specifically, for any given sample size the Standard Error is largest for values around 50% and decreases as values approach either 0% or 100%. For a base size of 100 (the dark blue line), the Standard Error is close to 5 for percentages near 50%, but decreases close to 2 for very small or very large percentages.  You can think about this as it being harder to estimate the percent incidence of a characteristic of a population when around half the population has that characteristic versus when almost all (or almost none) of the population has that characteristic.

In our example, the percentages for Is a Brand I Can Trust are close to 50%, so at a base size of 300 the individual Standard Errors would each be a little under 3. In contrast the percentages for Is Unique and Different are around 10%, so at a base size of 300 the Standard Errors would each be around 1.5.  That’s a big difference!

It follows that the Standard Error of the Difference for Is a Brand I Can Trust would be much larger than for Is Unique and Different. In fact, the actual values are 4.08 for Is a Brand I Can Trust and 2.34 for Is Unique and Different. Again, a big difference. If we divide the differences in the percentages by these values for Standard Error of the Difference, we get t-values of 1.47 and 1.71, respectively. Given the critical value is approximately 1.65, we see that the t-value for the difference of 6 is below the critical value (hence not statistically significant); while the t-value for the difference of 4 is above the critical value (hence is statistically significant).

Hopefully this takes some of the mystery out of stat testing and helps in understanding why what can appear to be anomalous results may actually be correct.

Categories: Special Feature, Uncategorized Tags:

Do you ever look at your data and say, “huh?” The Unusual Statistical Phenomena of Simpson’s Paradox

November 2nd, 2021 Comments off

Sometimes when looking at the results from survey data, we see something that makes us say “huh?” or “that doesn’t look right”.  When the odd results persist after verifying the data were processed correctly (always a good practice), there is typically still a logical answer that can be uncovered after doing some digging.  Sometimes the answer lies with something that we will call “unusual statistical phenomena.”  This is part 1 of a series that will look at some of these interesting – or confounding – effects that do pop up now and then in real survey research data.

This time we will look at Simpson’s Paradox.  And we aren’t referring to the fact that Bart Simpson never seems to age while the rest of us do.  It is actually a phenomenon first described by the statistician Edward H. Simpson in 1951.

It’s easiest to understand this phenomenon through an example.  So, let’s say that we have two ads that have been on air, ad A and ad B.  In our tracking survey among adults 18 to 65, we will ask respondents if they recognize having seen each ad on air.  Earlier in the survey we ask Purchase Intent for the product which is featured in each of the two ads.  From these results, we will compare Top Box Purchase Intent among respondents who recognized each of the two ads.  The results in the table below show somewhat higher Top Box Purchase Intent for Ad A:

However, the client is also interested in seeing the results among each of two age groups: age 18 to 39 and age 40 to 65.  When we table those results, we find something that just doesn’t make sense.  Purchase Intent is slightly higher for Ad B among both age groups – a reversal from the overall results.  How can that be!

After verifying with data processing that the data are correct, we have our team dig into the data to figure out what is going on.  Finally, an explanation is found.

Ad B was aired heavily among programming targeted to a younger audience, while Ad A was primarily aired in general interest programming – which skews to a slightly older audience.  Hence Ad B had much higher recognition among the younger age group – and as a result, a much higher proportion of young people in the set of respondents among whom purchase intent was calculated.

The table of base sizes shown below reveals this imbalance. When combined with the younger age group’s more skeptical nature (and lower results) when it comes to Purchase Intent – especially in our category – the apparent anomaly is explained.

This is an example of Simpson’s Paradox.  It is a phenomenon in which individual subgroups all show the same trend in results, but the trend reverses when the subgroups are combined.  This occurs when there is a confounding variable that causes an imbalance in base sizes such as we saw above.  In our example, the confounding variable was the differing recognition levels for the ads among the two age groups.

Simpson’s paradox shows us the importance of knowing and understanding our data and keeping a watch out for the kind of confounding factors that could end up misleading us if we don’t account for them.

Categories: Uncategorized Tags: