Incremental Refresh of Intelligent Cubes

Arguably, the headliner feature of MicroStrategy 9 was the introduction of Intelligent Cubes.  This powerful feature revolutionized project speed and design by allowing for lightning fast data access directly from memory.  After implementation, it gives the developer the best of both ROLAP and MOLAP worlds without either of the draw backs.  You get the speed of MOLAP data requests, but the flexibility of ROLAP ad-hoc analysis.

With each release following the 9 introduction, Cubes have gotten faster to load and smaller to store in memory, but two main drawbacks have existed:

1) Not all features are supported to be placed in Cubes, and even fewer features are supported for Dynamic Sourcing.  This is getting better each release, but until support for Custom Groups and per pass Dynamic Sourcing is added, system wide adoption will still be a dream.
2) You have to fully repopulate the entire cube each time you want to update it.  This makes some scenarios infeasible as the time to load the data would not fit into potential ETL windows.

Thankfully, MicroStrategy has solved the 2nd issue with the introduction of Incremental Refresh for Intelligent Cubes in the 9.2 release.  Today, I’ll talk about this exciting new feature and how it works for you.

Intelligent Cubes vs Caches
An Intelligent Cube is basically a report whose result set is stored in memory, as well as a copy on the file system.  After a reboot, that file copy is reloaded back in memory to prevent the need for reprocessing, and should you run out of memory, the cube that hasn’t been accessed in the longest amount of time is unloaded from memory, but the files still remain.  The saved result set is different from a traditional cache file because it can be queried against by MicroStrategy.

If you run a report for North East Region, 2010 by Month, that single result is stored in a cache file.  If you run that exact report again, then you get the result instantly.  But if you change that query by requesting North East Region, 2010, March only, then the report has to run again, even though the cache file does contain a row for your answer.

This is where Intelligent Cubes come in.  If you populated an Intelligent Cube with North East Region, 2010 by Month, then any subset query from that dataset can also be answered from it.  This means that instead of pre-caching specific reports, you would build a more generic Intelligent Cube to fit a wider array of requests.  In this example, you’d probably build a cube for All Regions, All Years, All Months.  This miniature database would sit in memory, and any request using those objects would be returned instantly as if they were cached.

Another powerful difference is it’s use in conjunction with Security Filters.  Say we both run the same NE, 2010 report, but I’ve got a security filter for NE Region only, and you don’t.  Even though we’re requesting the exact same thing, our different security filters dictate that we can’t share a cache, resulting in the report running twice.  With Intelligent Cubes, the security filter can be resolved against the cube.

Full Refreshing Intelligent Cubes
The traditional method of refreshing Intelligent Cubes is to run the cube (it’s a variation of a report, and runs in the same editor) and the results are populated to memory instead of displayed on the screen.  You can also right click on the Intelligent Cube and choose Schedule Delivery To -> Refresh Cube.  This allows you to choose a Schedule and have the cube refreshed automatically.  Both of those steps will result in a full reload of the data into memory, so if you’ve made your cube as generic as possible to cover as many scenarios as possible, it could take awhile.

Incremental Refreshing Intelligent Cubes
With 9.2, there is now a second option for refreshing the cube.  Right click the Intelligent Cube and select Defined Intelligent Cube Refresh Report.

You’re then provided with a new option dialog to define the method of refresh you’d like to take:

1) Update – Insert new rows from the report and replace any overlapping data.
    This would be the most common method of Incremental Refresh.  If you’ve got a cube with daily data for the last year, you may choose to schedule your Incremental Refresh to load the last 3 days of data each day.  This way you get the new day, and potentially a little overlap in case there were ETL issues or source system changes you didn’t previously catch.

2) Insert – Only insert non-overlapping rows from the data.
    This would be used if you were certain that you didn’t need to change data (or perhaps purposefully intended not to) and want to ignore any potential overlaps.

3) Delete – Remove overlapping rows from the data.
    This would be used to trim your cube to prevent growth past what you’re interested in.  If you’ve got 10yrs of data in your warehouse, you may only want to load a rolling 3yrs of data in your cube.  You could do that rolling at the month level, so once a month you’d delete the last month from 3yrs ago.  Defining a report that returns that data and running in Delete mode would remove those rows from the Cube to give you a manageable size.

4) Update Only – Only overwrite overlapping data from the report.
    This would probably only be used for patching data, say from an ETL error or source system fix.  If data was retroactively changed, you could update just that portion without having to reprocess the entire cube.

 The editor for an Incremental Refresh Report is simply the Report Filter portion of a traditional report.  You’re defaulted to the existing filter that the target cube is using, but you can modify it in any way you wish.  I’d imagine you’d want to keep the existing filters in place, but then add some new ones.

Example, if my original cube was All Regions, All Years, All Months, then I may define an Incremental Refresh Report to just add Last Month as the filter.  I can then schedule it to run on the 1st of the month, and it will simply run a database query for 1 month of data and append it to my existing Cube with historical data.

Once your report is complete, you simply right click on it and choose Schedule Delivery To -> Refresh Cube just like before.  Only this time, the smaller refresh report is running and updating your cube in a fraction of the time.

Project Impact
In my environment, I’m currently processing about 2hrs worth of Intelligent Cubes every day.  With the deployment of 9.2, the daily update window for these cubes is expected to fall below 5mins.  That gives us incredible flexibility in loading the cubes for two reasons.

1) Currently we load data into the warehouse in near realtime, but 1am – 5am are daily maintenance jobs on the server.  While ETL continues to run in these windows, traffic on the server is high and everything is slowed down.  Even though the data for the previous day’s consumption (and all accompanying aggregates and table statistic refreshes) are available for my cubes by 3am, it wouldn’t be feasible to run all of those huge database dumping queries during that traffic, so we wait.  This also gives us a buffer in case of an ETL issue so that there is extra time for long running jobs or re-running of jobs.  We can’t use a technique such as having the ETL trigger a Command Manager script to load the cubes since the trigger would be from the 3am hour.  There are too many other parallel processes running to determine a good combination of factors for triggering the cubes other than the blind 6:30am starting schedule.

However, with Incremental Refresh, the queries are small enough that they can brave that heavy morning traffic.  This allows us to make the cube data available 3hrs earlier than before.

2) Now that we can run the cube refreshes at any time we’d like, we can now setup a Command Manager script trigger event to run from the related ETL processes that finish at around 3am.  This allows us to not only have the data available as soon as possible, but in the event of an ETL issue, we don’t have to worry about impending schedules.  This gives the ETL team a great more deal of time in the morning as well as flexibility in their own schedule handling (not to mention the pleasant side effect of not having to wake me up to stop the schedules ;)).

You may also like...