Monday, March 1, 2010

Forecast Modeling

Author's Note, Dec 2010:This blog post is now obsolete by virtue of the fact that Tableau 6.0 has "parameters" - the entire subject of data explosion or data fanning is no longer a requirement.

original blog post

Quick start cheat notes:

  • sample data is here 
  • make all joins "not equal to" 
  • set all filters to single-select (and make them global just for convenience)

Read the rest of the article below this dashboard, for the full details... 
Use the sliders to interact



A discussion about "predictive modeling", or "forecast analysis", or, "what if analysis"

This entry discusses steps required to use Tableau to create what if or hypothetical analysis models. While not currently a core feature of the product suite (with the typical resulting marketing and sales splash as such), it turns out that with a little thought, and by following a few simple rules, you can create forecast models in Tableau. For the purposes of this discussion, I have used Xcelsius as a comparison, a tool that is commonly thought of as excellent for this type of analysis.

Products that provide extensive forecasting and predictive modeling algorithms are great tools for creating "what if" or hypothetical analysis - there's a great deal of heavy lifting going on to make this magic work. Some of the technical challenges involved include (not limited to):

All of the heavy lifting or number crunching needs to occur somewhere or at some point in the overall process

The user experience and software design thereof needs to provide compelling and useful tools for managing the input and output of the modeling.

There are some good software packages out there that provide forecast modeling and predictive analysis. A few worthy mentions include SAS, SPSS, the open source "R" package, and SAP’s Xcelsius (formally owned by Business Objects, who in turn purchased it from Infommersion, Inc, see nice history link here). From anecdotal evidence it appears that Xcelsius attempts to operate on one very small amount of data at a time, and then "explodes" this data out into a much larger data set - most likely in memory. General approaches to this type of modeling scenario include pre-processing possible values in a database engine, or, performing other in-memory processing.

Tableau can achieve this "exploded" data concept by creating table joins on all possible values - essentially a Cartesian join showing all possible combinations of data. There are a few simple tricks to make this work:

  1. Create all joins as "table1.fieldX <> table2.fieldY" (i.e. use the "not equal" type). Make sure no values will match, or else this does not work! See the excel file which shows the four sheets to join; as well, you can download the workbook shown above to see the joins.
  2. Any time you create calculated fields, make sure that they are non-aggregation calculations. See the workbook for the calculation called "potential value". Basically, you want row-by-row calculations at all times! 
  3. And most importantly! Make sure you have single-select filters for all forecast or what-if variables which are controlling the "fact" data.

What does that last item 3 really mean? Let’s take a closer look at the sample excel data:

In the example excel file, we had three categories (K-12 School District, S&P 2000, and Federal Agencies) and 6 possible values (50k through 300k) for a total of 18 possible values.

By joining on a percentage between 1 and 100, the data gets exploded out to 1800. And by adding in a second variable called "blending", which ranges from -.21 percent to .21 percent (43 possible choices including zero), this further explodes the 1800 records out to 1800 times 43 = 77,400 records.

However, because our three Tableau filters for "blending", "percentage" and "variable" are all marked as single-value selections, we will only see 3 records at a time (the original three categories).

Exploring the workbook shown above by downloading it will help convey this point.

You can now start to guess why Xcelsius and possibly other products start to perform slowly when dealing with larger data. One of the reasons is the data explode factor involved. If instead of the above numbers, I wanted to apply a two-variable what if or forecast model on 500 types of values (500 customers showing sales, 500 products showing inventory, etc), and one variable ranged from 1 to 100, the second variable ranges from -30 to +30, then my data explode result would be 500 * 100 * 61 = 3,050,000 records! No matter how you slice and dice this problem - in memory vs. pre-calculated, oracle vs. sql server or some other database, java versus dot net- you will still be faced with this concept of data "explosion"...

Tableau can handle 3 million records, by the way...


Richard said...

Nice example. I use this trick too. A couple of comments.

Shouldn't they be "cross joins" (i.e. no predicate) rather than "not equal to"? With the "not equal to" condition I think you're relying on having no overlap in the domains of your various variables.

One other observation. I've used this technique from Tableau with 3 or 4 different back ends. Some optimisers handle the Tableau generated SQL nicely, by applying the single-select filters before evaluating the cross-products. Some don't - which doesn't matter for small tables but doesn't work very well when you have millions of rows to start with. The ones I remember are SQL Server and (surprisingly) Jet which both rearranged the query to apply the filters first and Firebird which (also surprisingly) didn't.

NorthwestCoder said...

these are both valid comments! I mentioned above to "Make sure no values will match, or else this does not work!" and your idea of a cross-join solves this (100% lazy) approach on my part.