Tableau 10.2 came out a couple weeks ago and a small but notable feature included in this version of our product is the Corr() and Window_Corr() functions. These calculations compute a coefficient of correlation based on two variables.
I’ve done linear algebra in Tableau before, and it’s often quite complicated and looks like this:
The idea with making Corr a simple function in Tableau is to give users a faster and simpler way to find statistical results.
But I was confused. correlation coefficients operate on large groups of numbers, so why would it make sense to aggregate a pair of measures with Corr()? How would the aggregation change based on the layout of a Tableau worksheet? And how could I check the results to make sure I was seeing the correlation coefficient I wanted to see?
When doing complex calculations, you often get a seemingly arbitrary number as a result, and some faith is required to trust that the number is answering the question you intended to ask, and not a different question.
Below are some findings from my Corr studies:
A quick preamble about my math background. I was always pretty good at math growing up, but I have little formal education in Stats. Here is my math resume:
- I took Calculus I and II in high School and I got a 4 on my AP Calculus exam.
- I took a Stats class because it was a part of my graduation requirement for business school.
- I know how to search for things on Google.
- When I get mathematical formulae wrong, Wilson insults me until I get them right.
In short, I know enough to hack my way through most math problems, but I often need some help to understand the specifics.
*Also note – there are a few different ways to compute correlation coefficients. As far as I can tell, Tableau uses the “Pearson correlation coefficient” which I read about on Wikipedia, which always has the best information because anyone can contribute to it (h/t Michael Scott). Anyway, I’m just going to refer to the “Pearson correlation coefficient as “Corr” or “R” for the rest of this article.
Corr() is an aggregation in Tableau, which means it operates on disaggregated (or “row-level”) values. Using Tableau’s out-of-the-box Superstore dataset, if you wanted to find the correlation between Profit and Sales, you could write the calculation “Corr(Profit,Sales)” to do it.
The confusing thing for me is, what does this calculation do? When I drag it onto a viz, I get .4791. When I square it (which is what a lot of linear trend models do – this is the number represented by “R2” when you describe a linear trend model in Tableau or other applications with similar capabilities) I get .2295. I made note of these numbers.
I knew Corr() operated on disaggregated values, and that it was a coefficient related to linear trends, so I decided to try plotting a disaggregated trend line against Profit and Sales in superstore. I did this by adding Sales to columns, Profit to rows, and unchecking the “Aggregate Values” option from the Analysis menu. Then I added a trend line by dragging from the analytics pane and choosing “linear.” Here is my result:
Cool! note that the R-Squared value is the same as my .2295 value I got from doing SQUARE(CORR(Profit,Sales)) in Tableau. So Corr operates at the datasource level, looks at all of the points, and computes a correlation coefficient.
My question was, what happens if I change the dimensionality of my viz? Tableau aggregations “partition” by every dimension in the view.
I broke out my Correlation calculations by Category:
Ok, now I have an R (Corr(Profit, Sales)) and R-Squared (Square(Corr(Profit, Sales))) for each Product Category. I imagine the R-Squared would align to the values I’d get if I created trend-line per color using Category on the above scatter plot.
So here’s what I learned: the Corr calculation gives us the coefficient of correlation based on every row in the data set related to a particular partition. This is great. With the above table, I can tell that Office Supplies has the highest coefficient of correlation among the three categories. Remember, R is a number between -1 and 1, so a positive value indicates a positive correlation (as one number goes up, so does the other). A 1 would be a perfectly correlated data set – every point on the scatter plot would be on the line. Also note that even though the highest R value has the steepest line in this viz, that won’t always be the case.
Yeah, I know this is basic math for a lot of folks. But I like how visualizing the disaggregated data helps illustrate the meaning behind the numerical results you get with the aggregation of Corr.
The problem is, what if I want to analyze performance of different products, or customers? I can’t create a coefficient of correlation using Corr for customer sales and profit totals, because the aggregation will partition, or recompute, for every individual customer. This is where Window_Corr comes in:
Window_Corr, like it’s cousins Window_Sum, Window_Avg, and others, is a table calculation that computes only on aggregated data. That means to write a calculation using Window Corr, each variable has to be aggregated.
The advantage of this? I can look at aggregate values. Some customers in my store have purchased more than one item, I want to correlate the sum totals of profit and sales across the entire population of customers.
Now, if I create an aggregate viz, and compute my table calc using Customer Name, I’ll get the same coefficient of correlation for every customer. Again, squaring it will give me a R-squared value matching what I’d find with a linear trend analysis. I can also use other aggregations, like average, median, min/max, standard deviation, and variance.
Why would this be useful? Well, maybe I’d like to see the coefficient of correlation in a tooltip or label that dynamically changes when I filter to different categories or periods of time. This calculation will do that. I think Window_Corr will end up being more useful in visualizations than Corr, since it allows you to see a high level summary as a label which is part of a more descriptive viz. Corr will be more useful for simple crosstabs, but they won’t necessarily mean a lot to non-data-literate consumers.
We have to go deeper
There are some interesting niche use cases where the above cases don’t work, though. For one, Table Calculations can get slow with huge data sets. What if I want to create a scatter plot that analyzes hundreds of thousands of points? It will probably run very slow, even though it’s only returning one figure.
Or, what if I want to “drill down” to a particular area of the viz based on a certain Region or Market Segment, but I still want to see the correlation coefficient for the entire data set? Because of how Table Calculations, work, I’d have to jump through some pretty funky hoops to get there.
Corr() and LOD
So, as a fun exercise, I decided to look at Corr with Level of Detail expressions. Because it’s an aggregate calculation, and because FIXED LOD statements result in row-level results, I can nest an LOD in my Corr calc.
Here, I’m computing the correlation between average profit per product and average sales per product. This allows me to do the row level summary analysis that I did with Corr, but with aggregate totals. As a way of spot-checking it, I compared the results of looking at this calculation by Category…
…with the R-Squared values on a linear trend analysis of Avg(Profit) vs Avg(Sales) on a scatter plot…
But when I drag the calc to label on this viz, it comes up null. I think that asking the LOD expression to compute Corr for all of the values for Product Name, while partitioning the viz by product name, doesn’t allow Tableau to look at all of the values it needs to get an answer. Because if I use something else on my viz, like Customer Name or State, I get text. That means I can look at a viz that trends Profit and Sales by state, but see labels that show me the correlation between product Profit and Sales in each of those states. Maybe I have a state with a high Profit to Sales correlation, but the products in that state have a negative correlation. That’s a bit of a mind-bender, but it’s cool that I have that type of flexibility.
Now, I didn’t quite figure out the answer to the problem I pose above – what if I want to filter but retain a total Corr? Even when filtering just the crosstab with aggregate core totals based on FIXED LODs, the results changed, which is different from how FIXED usually works with Sums and Averages. Maybe this is because Corr is based on a total population and needs to use the size of the partition to compute itself. But I suppose I could just create a viz with aggregate totals uninfluenced by my filter and use that as a summary view on the same dashboard as my scatter plot, because this is Tableau and we can do stuff like that.
In summary, here’s what I learned about Corr:
- Corr() will compute the coefficient of correlation at a row-level – i.e. for each individual data point in your data set. It can be partitioned by dimension.
- Window_Corr() will compute the coefficient of correlation for aggregate totals. You have to use the dimension by which you’d like to compute corr as a part of your viz.
- You can also compute an aggregate coefficient of correlation without including the dimension in your viz by creating a nested LOD with Corr().
- It doesn’t seem like you can make Corr() or Window_Corr() ignore filters, like you can with other FIXED calculations or through other methods, but you might be able to figure it out with some more finagling.
Generally, it seems like Corr() is best for text totals and summary statistics, while Window_Corr() is the best option for creating dynamic visualizations with statistical details.
It would be cool to see other types of regression-based stats called this way, like the slope of a line, or maybe a quick way to test the null hypothesis of two correlated fields and get a P-value. Tableau has a lot of great statistical functionality in the pipeline, and I would imagine it’s only a matter of time before we see things like that.
For now, may all of your correlations be positive.