Fact and Attribute Anchoring

Carrying on with this week’s theme of Aggregate tables is an exploration of a concept I call Anchoring.  As your data volume grows and reporting complexity increases, an inevitable outcome is the growth of complex Aggregate and Sub-Set tables.  While as Architects we are effectively teaching MicroStrategy the “best” way to write a query by laying out rules such as key tables and parent/child relationships, sometimes we have to be a little more explicit and tell it when and where it can use specific tables.  Today I’ll highlight some methods of forcing MicroStrategy to use specific tables in specific instances, and share a few tricks along the way.
As described in my previous posts Aggregate Tables and Handling Count Metrics in Aggregates, the MicroStrategy SQL Engine has a set of rules for determining the best table to use when building a query.  The cases for why it would have more than one option arise from performance tuning using Aggregate Tables, which are tables at a higher level than the detail that can be used as shortcuts in a query.  Additionally, sometimes it may be necessary to build “Sub-Set” tables, which are at the same level as detail, but contain far less records, usually only pertinent to a particular type.

Example:
Let’s say you work in the Automotive Insurance business, and you have a table FactIncident which contains every incident a driver has experienced ranging from traffic violations to serious accidents.  Let’s say a common metric is the distinct number of customers involved in any kind of accident.  Doing a Count(Distinct) is one of the heaviest operations a database can perform, and if the FactIncident table is very large, there could be lots of wasted processing time.  A possible performance improvement could be to remove the facts from the table and only include a single record per unique individual.  So if a person was in 5 accidents, you only need 1 record to perform the Count(Distinct) and get the same value.  A Count(Distinct) metric is the worst, because not only is it one of the heaviest to process, but you can’t aggregate it at all.  These kinds of Sub-Set tables are really the only measure of performance gain you can get.
But now we have a problem.  If our metric is defined as Count(Distinct Driver), how does MicroStrategy know which table to use?  Since we didn’t remove any attributes, the logical size will be the same as the main FactIncident table, and if the query is running out of that table for other metrics, there’s a good chance it’ll pull this one from there as well.  But we can’t lower the Logical Size of this FactAccidentsOnly table, or normal queries may find their way into it, and since it only contains a Sub-Set of the real data, it would return incorrect results.  So how do we tell MicroStrategy that we always want it to measure this Count(Distinct) metric from our special Sub-Set table?
The answer is that in every aggregation function in the Metric, is a Parameter called Fact ID.  The determination of which table is used in a query is driven by the Fact, and independently of other Facts that may be involved on the report.  So when you have a metric that is Count(Attribute), and no Fact is involved, that query could resolve out of any table that Attribute is in, even when it doesn’t make sense (like possibly the lookup table).  To enforce MicroStrategy to use specific tables when a factless metric is involved, you can set the Fact ID to whatever you want.  The Fact ID is not involved in the actual SQL or calculation, but it’s set of available tables is used to determine where the query should be sourced from.
To take advantage of this, I create specific Facts that I call Anchors that serve to “anchor” a metric to a specific table (or set of tables).  It doesn’t matter what column from the table you pick, but I usually pick a Key column just for consistency.  I call it “(Anchor) FactAccidentsOnly”, which is just my personal naming convention to add that prefix.  I put them in a separate folder under the Facts folder so they’re out of the way (called Anchors) and now any metric that I want to always source from that table, I can set it’s Fact ID to (Anchor) FactAccidentsOnly.
Another possible approach, though one that I use a lot less frequently, is an Attribute Anchor.  It’s the same concept of the Fact, except as an Attribute.  It works the same way, by artificially limiting the number of tables available for the query.  A reason you may want to use this is if you want the anchoring to occur because of a filter.
Example:
Say that you’re running ETL throughout the day at near real time.  If you have high volume, you probably are reliant on aggregate tables, and updating those aggregate tables in real time probably isn’t practical, plus users probably don’t look at date ranges including Today very often (since it would be comparing full days to a partial day).  Knowing this, you may choose to setup your schema such that a query for Today hits the detail table, and a query < Today hits the Aggregate Table.  While this sounds like a call for Metadata Table Partitioning, you could also use Attribute Anchoring to achieve it a little cleaner.

You can create a separate attribute called “Today” and a Table Alias for your Fact called “Fact Today”.  Map all of the same attributes/facts to this alias table, but be sure to set it’s logical table size to a manual high number so that it doesn’t accidentally overtake the Aggregates.  Then, you can make a filter where “Today (Attribute) = Today (Dynamic Date)” and whenever you use that filter, the SQL Engine will resolve out of the FactToday table since it is the only place that contains the Today attribute.  Other filters using any other kind of date range will still run out of the Aggregate tables.

Compared to Metadata Table Partitioning (which I’ll cover in more details in a future post), this provides a little bit easier to read SQL, and depending on what pre-queries need to be run to determine the proper tables to use in a query, possibly better performance (it at least won’t be worse).  The trade off is that the schema gets a little more complicated and you may run into “broken” reports when using the Today attribute.  You also couldn’t run a query for “This week including today”, but this is just an example.  As I said, I rarely use this approach for anything, but I just wanted to put it out there as another tool at your disposal.

You may also like...