Improving Query Flexibility with Partition Mapping
Partition Mapping is a traditional database concept that MicroStrategy has adopted in the SQL Engine to give you even more options for performance and tuning. Traditional Database Partitioning involves taking a large table and breaking it up into smaller pieces by separating it at logical break points, such as one table per year or per region. The database would then be intelligent enough to know where the breakpoints are and direct queries to the specific tables containing the data, providing a smaller set of data to process.
MicroStrategy uses a similar approach to carving Fact tables into smaller pieces and offers two different approaches for implementation. While this feature is intended to provide a little more efficiency to databases, most advanced platforms won’t have any (or much) performance increases, such as Netezza, Greenplum or Vertica (though I’m no expert on those platforms, it’s just from my observations). While this feature may seem antiquated for projects trying to squeeze the last bit of life out of a stressed MS SQL Server, you can also use it in a few clever ways to provide query flexibility and even achieve some amazing data availability.
Warehouse Partition Mapping Table
The best method for pure performance is to use the Table approach. Effectively, you create a “table of tables” by physically breaking up your Fact table into individual pieces, and then tying the table name to an Attribute Element in a Control Table so MicroStrategy knows how to load them.
You have 10 years worth of data in a table, and it’s getting a little large to scan through. Most of your queries are for the last few years anyway, but you don’t want to lose the ability to query history. You could break that single table up into 10 smaller pieces and create a control table such as:
MicroStrategy will consume this table as a special kind of table, called a Partition Mapping Table, and treat it as a single table. When you run a query, MicroStrategy will first determine which table(s) contain the Year(s) in your report, and only use the appropriate tables. Aside from the obvious benefits of smaller tables to query, consider also the benefits to ETL from having smaller “Current Year” tables to work with, as opposed to sifting through 9yrs of data that isn’t going to change on a day to day basis.
The obvious drawback of this approach is that you have to maintain that control table. While it may not seem like a big deal, especially if you are slicing by an Attribute like Region, which probably doesn’t change very often, there is a requirement not immediately noticeable. You must provide *separate* tables for every element. That means if you wanted to do something like FactData2011, FactData2010 and FactDataArchive (that contains everything prior to 2010), you couldn’t use this feature. Even if you define each row in your Control Table, they can’t point to the same physical table in the database. Perhaps you could be clever and circumvent that requirement with Views, but even then, if you’re slicing on an Attribute that will have lots of values, such as a Month or even a Day, you’re still in for an unmanageable number of objects. For something that granular, you’d want to turn to the next Partitioning method…
Metadata Partition Mapping
I’m not a fan of the name of this method, because whenever I say it, I feel like the initial interpretation is that it has something to do with the Metadata. While I guess “everything” has something to do with the metadata, this is really a Query Based Partition Mapping, compared to the Table Based Partition Mapping mentioned previously.
This approach is a lot more MicroStrategy-esque (to coin a word). Effectively, you’re building a filter that will tell the SQL Engine which table to choose. Based on the previous example that wasn’t feasible (using 3 tables for FactData2011, FactData2010 and FactDataArchive), this can be achieved using this method. You simply define the filter Year=2011 and then choose FactData2011 as the target table. Repeat the process for Year=2010 to FactData2010 and then you can define Year<2010 to FactDataArchive.
The downside of this approach is that before the report can run, a pre-query must first be run to determine which table to choose. This can be pretty inefficiency if it has to query the tables themselves, though you can get it to come from dimensions with some careful schema tweaking. I don’t have those details readily available right now, but when I’ve implemented this in the past, it was a challenge. Even once you craft pre-SQL that is efficient, debugging reports becomes incredibly harder all around. Since the SQL Engine must first run a pre-query to determine which tables to run, you end up with Schrodinger’s SQL where every possible outcome is listed in the SQL View. This not only occurs when trying to view the SQL pre-execution, but also in the Job Monitor. This can make the debugging process very difficult and annoying. Thankfully, once the report is complete, you can view the SQL and see the actual chosen path.
Partitioning for Real Time Data
Now on to my favorite way to leverage Metadata Partition Mapping. In my current environment, we load data in near real time out of necessity (saving it until a single nightly load would take too long to process), so the Warehouse is within an hour of real time at any given time during the day. Since user’s mostly use the BI project for historical analysis and trending, we don’t query the Today data too often, so we don’t include them in Aggregates. Of course, if they do want to do some real time reporting, usually in the form of debugging some system issue, traditional Aggregate Table approaches wouldn’t work here. An alternative solution to this scenario to last week’s Attribute Anchoring approach would be to use Metadata Partition Mapping to force queries that include “Today” to go directly to the FactDetail table, and any queries “< Today” to go to the Aggregates. This provides some relief on the ETL for not having to reprocess Aggregate tables throughout the day, which if they are at the Day level would require lots of deletes/updates, but without sacrificing access to the data.
9.2 MultiSource Enhancement
A fantastic enhancement found in 9.2 is the ability to do Metadata Partitioning across MultiSource connections. This means you could use separate databases to hold your real time data vs historical data. In the original example, this could mean that you move Archived data off of your expensive Warehouse Appliance and off into a cheaper database solution. Reports wouldn’t have to change, and users wouldn’t even notice (other than speed differences).
Another Real Time solution would be to possibly hit Source System data directly. In most environments, this may not be feasible, but in my current environment our Source System actually keeps aggregate level data in absolute real time (within a minute). While they don’t keep history, it is enough that we could use MultiSource Metadata Partitioning to provide absolute real time data availability to our users in MicroStrategy.