The Place for Finding SEO Correlation Issues
Updated 8/21/2013!
As the number of articles that try to publish findings grow in conjunction with the increasing complexity of Google’s algorithm, I’ve found the need to put together an everexpanding article on how you can determine if any of the correlation articles you read violate basic organic data analysis. You can get a quick peak for my past views having presented at SMX East on this topic.
The list you find below is my reference place (and hopefully yours) for determining good versus bad SEO correlation articles. Please feel free to add more in the comments as I expect to continually find more issues around future correlation studies.
Correlation Violations
Currently there is no particular order for the violations, but this may change in the future if the list gets comprehensive enough. Otherwise, run through this checklist, the more violations the less likely you should trust the article:

The article wrote an SEO correlation as the conclusion and not as the beginning of the article and analysis.
This is a fundamental flaw in just about all correlation articles; the analysis ends with the correlation rather than helps begin where your analysis went. The point of running a basic correlation is to determine where to look next, not to make sweeping judgments of what the correlation means. Since these basic correlations are only looking at one variable, all other variables can play a part to influence that variable, so you have to be thoroughly careful not to assume what’s impacting rankings (see: the debate between Rand Fishkin and Matt Cutts with Facebook Shares).

The article didn’t caution with “correlation does not imply causation.”
Unless you are writing for a statistically savvy audience, the article must highlight the issue and note some examples why correlation does not imply causation. Use these fun examples to learn why.

The article cautioned with “correlation does not imply causation,” then proceeded to imply causation.
This one particularly bothers me as the writer obviously knew the phrase needed to be included, but continued on with the implication of causation anyway.
This can often take the form of certainty phrases such as does, means, are, and have, all with no qualifiers. Give the article bonus points if the word cause is slipped in there too.

The article uses a misleading title implying a causation or a highly significant correlation.
Once again, like the rest of the article itself, the title must reflection the point of what you’re writing about. Linkbait titles deserve their place in robot hell. Too often, SEOs just read the title, see a famous name and assume the information provided is accurate and correct. The writer, must keep this in mind and adjust the title accordingly.

The article didn’t include a single scatter plot.
Just take a look at the highly correlated graphs to the right to understand why scatter plots are a must (and no, line/column graphs do not count).
Correlations provide a linear relationship, which, in of itself isn’t a bad thing to assume if you’re running an econometrics regression (where you can add in nonlinear functions), but in a basic correlation, scatter plots are a must to show something isn’t offsetting the values.

The article includes a scatter plot but makes it impossible to tell the linear relationship.
Okay, this is more just bad user experience, but its an important point to bring up: If I cannot visually comprehend your correlation in ten seconds our less (and that’s being generous), redo the graph or relook at your analysis.A linear correlation should have as many dots above and below the line in a consistent, well, linear pattern. The graph on the left does not.

The article cites a 0.3 correlation as a “high” correlation.
A high correlation isn’t determined by how solid the correlation may be. A high correlation means a high number (let’s say a ballpark range of 0.75 or greater). When a correlation is as small as 0.3, you have a weak correlation. It may be a solid (even statistically significant) correlation of 0.3 via Pearson’s Coefficient, but it is still a weak correlation and should never be called a high correlation even when there are over 200 factors that play a part.
Bonus points if the article calls one correlation of 0.31 “low” and later in the same article another correlation of 0.20 “high.”
8/21/2013 Edit: One clarification I should add from Dr. Pete Meyers‘s comment below is that my point around calling a 0.3 correlation “high” is not about whether that makes it interesting or not. Just because the correlation is weak, doesn’t make the correlation uninteresting (or that a high correlation means its interesting). A correlation with a weak, but solid, correlation becomes an interesting area to look at for further analysis.

The article runs a Pearson’s Coefficient instead of an econometrics linear regression.
My personal bias (I studied econometrics). If the article has the scientists running a Pearson’s Coefficient, a more efficient use would be having them run econometrics regressions. That way you can, you know, determine the causation of your analysis. If you’re unsure what that is, see the main image of this post.

The article uses the phrase “statistically significant” without noting the statistical significance.
Yes, I know the article isn’t intended for an audience that actually critically analyzes the data, but if you’re going to claim it’s statistically significant, I need to know what level of confidence you’re considering it at (eg: 99%, 95%, or 90%)!

The article uses the phrase “statistically significant” or just “significant” in a basic correlation study.
For the laymen, (statistically) significant means causation, thus the article should avoid the phrase to avoid any appearances of impropriety.

The article uses a low sample of data to imply a correlation.
Your 200 data points? Sorry, it’s not enough to warrant a conclusion or statistical significance. It’s enough to warrant further research, but just like political surveys, you need around 1,000 data points and note your error range in order to fully judge how solid the uniqueness of your results may actually be. There’s a reason why a 47%50% preference with a 3% error range is called a “tied race.”

The article claims that with their larger data set, the correlation is a better correlation than a smaller data set.
Your 10,000 data points = 1,000 data points. Sorry, but as long as you are running a basic correlation, your data levels make next to no difference. Until an article is using econometrics looking at multiple variables in one correlation, the amount of data after about 1,100 data points becomes moot for how significant your data truly is.

The article uses graphs that don’t start at zero.
In of itself this isn’t necessarily a bad thing, but one should immediately consider why the graph isn’t starting at zero (call it a yellow flag if you will). Is it purely a UX thing or is there a more cynical attempt to overemphasize a correlation to make a difference between 0.23 and 0.25 seem a lot larger? 8/21/2013 hat tip: Enrico Altavilla

The article leaves out basic things to correlate.
This issue is far harder to detect and requires a bit more thoughtprocess to analyze. Essentially a reader will have to determine whether the author is accidentally or purposefully leaving out a list of potential signals that may make the article less valid or results less enticing. Consider this another yellow flag. 8/21/2013 hat tip: Enrico Altavilla

The article leaves out the methodology or raw data.
A truly helpful SEO study will also provide the raw data (sensitive data removed where possible) and though not every time can the raw data be provided, an SEO correlation article that asks its readers to check one’s work is going to be far more trustworthy in the end.
Methodology (or how the process went) is important to help replicate your work as well and understand the thought process to help find anything missing. More or less its basic courtesy to explain the who, what, where, when, why, and how of the correlation article. The more you tell us how you think, the better we can either relate or add value for other areas to consider.
In the end, it’s establishing trust by providing these two areas, so an article without it suffers. 8/21/2013 hat tip: Enrico Altavilla

The article only looks at the highlevel or in one subdivision of data.
A good correlation study should dive deeper by segmenting and dividing the data further to test to see if the correlation found holds up among different groups. Few good studies sadly even do this kind of check on the work to have the data broken up (say by website category or query type). I’d call this mostly wishful, but still a violation of writing a good SEO correlation article. 8/21/2013 hat tip: Enrico Altavilla
The Significant Conclusion
If you’ve gotten this far, I only ask that you either contribute more things that I missed or go forth and help stop the spread of bad analysis/information on SEO correlations and challenge the authors before it spreads too wildly to stop. Feel free to point out which number the article has violated referencing the post here; it is time to push back against bad correlation data analyses.
I completely disagree with #7, and even the statistics world doesn’t have clear guidelines on this. With 200 factors, we can’t reasonably expect one factor to top a certain threshold. I think it’s entirely legitimate to look at the relative rvalues, as long as we’re honest about it. For a Google ranking factor, r=0.30 might be the highest we’re going to see, even with perfect data, and that may be extremely meaningful and actionable.
The question in that case is how does it compare to other factors. If everything gives us r=0.30 then, yes, we don’t know anything. If 199 factors are r=0.01 and 1 yields r=0.30, then that would be incredibly compelling. Reality is in the middle, and the devil is in the interpretation, but to just set a threshold and say “nothing below 0.75 is interesting” is at best an opinion.
Thanks for dropping by! I should clarify the remark (I’ll edit it after this) to note that just because it is a weak correlation, that doesn’t mean it isn’t interesting. The flipside being a high correlation doesn’t mean it’s a solid correlation either and thus could possibly not be of interest. I didn’t mean to imply that just because it’s a weak correlation that it’s not interesting, only that one cannot call a 0.3 correlation “high.”
A weak correlation that’s solid with other variables individually looked at that are tinier by a measurable amount could be an interesting metric to look further at, but it could also be a false lead as many other factors that you also looked at (as this is a onetoone correlation look) may be the hidden reason why it looks correlated. Therefore, I would be very careful about calling it “compelling.” Interesting for more study? Sure.
To the logical extreme, to say that if all other correlations are 0.00001 and one correlation is 0.001 thus “high” would not be taken as seriously given how close it is to zero. I look at correlations from potential impact (thinking in an econometrics mindset of causation) and therefore a 0.01 correlation to me means if the correlation is true, then its impact is weak, maybe more impactful than other true correlations, but individually still a weak impact.
Pingback: How to Properly Analyze SEO Correlation Article...