Triggering Cubes & Extracts

One of the functions that never seems to be included in BI tools is an easy way to kick off an application cache job once your ETL is finished.  MicroStrategy’s Cubes and Tableau’s Extracts both rely on manual or time based refresh schedules, but this leaves you in a position where your data will land in the database and you’ll either have a large gap before the dashboard is updated or you’ll be refreshing constantly and wasting lots of system resources.  They both come with command line tools for kicking off a refresh, but then it’s up to you to figure out how to link your ETL jobs to call these commands.  What follows is a solution that works in my environment and will probably work for yours as well.  There are of course a lot of ways for your ETL tool to tell your BI tool that it’s time to refresh a cache, but this is my take on it.  You won’t find a download-and-install software package here since everyone’s environment is different, but you will find ample blueprints and examples for how to build your own for your platform and for whatever BI tool you use (from what I’ve observed, this setup is fairly common).  Trigger was first demoed at the Tableau Conference 2014.  You can jump to the Trigger demo here.

The Design

The basic idea is that ETL tools have a variety of features and abilities, but one thing they all can absolutely do is update the data in a table.  Add to this the potential limitation of having to install BI command line tools on ETL boxes that could be locked down, behind firewalls or not a supported platform of the BI vendor, and you arrive at using a signal table for controlling the flow.

The signal table will contain one row per extract which will contain a status for whether or not ETL has completed.  All we have to do is add a final step to the ETL job that updates a flag in this table for the appropriate row to let us know that the extract should now be executed.

A trigger service then runs continuously looking at this signal table and waiting for one of the flags to change to ‘Ready‘.  When it sees it, the service will be the one calling the BI tool’s command line argument to trigger the refresh.  This service can be written in any scripting language and can be hosted anywhere as long as it has access to read from the database and access to execute the appropriate BI command line tool.  A common place to host this is on the BI server itself since it already has access to the database (firewall wise) and has the necessary software on it.

Those are the two basic pieces of a Trigger service, but there are a couple of additional, optional pieces you could add if you want to fancy it up a bit:

An API Service could generalize the call to the BI tools in the event you have multiple tools and you want a single entry point, or if you want to expose BI command line tool functions to systems other than triggering extracts.  In our environment, we built a little web service that receives a command and runs it locally.  It’s an extra piece that gives us the flexibility of hosting the trigger service in a different place (like a central service center).  Instead of calling local command line arguments, our trigger service instead just hits the web service.  It’s an extra step, but depending on your plans it could afford you some better maintainable systems.

Another optional step is a Web UI to help people create new triggers.  It’s not really that hard to just insert a new row into the signal table, but everyone loves a pretty UI, especially when it can add some conveniences and safe guards.

 

The Signal Table

The signal table is simply a list of the extracts we want to maintain and their current status.  We’ll refer to each item as a ‘trigger’ and it can be considered a single ‘job’.

trigger_signal

Columns

  • TRIGGER_ID – This is the main ID that you’ll reference throughout the system.  Make it unique, but make it whatever you want.
  • TRIGGER_DESC – This is an optional description so that you know what each one does or is if you weren’t that descriptive with your naming convention or want some additional notes.
  • STATUS – This is the current status that the trigger service will be polling to determine when it should refresh an extract.
  • DATA_AS_OF – This is an optional field that just lets you keep track of the data that was refreshed.  This isn’t a refresh timestamp, but rather the data itself that just landed.
  • APP_TYPE – If you want to support multiple BI tools from one service or you need to support multiple environments, you’ll want to differentiate them here.
  • SCRIPT – This is the command parameter that will be executed.  More details about how this will work in the Trigger Service section.  Another thing you may want to support here is a single trigger refreshing multiple extracts.  I implement this by seperating them with || (not pictured), for example {“workbook”: “SalesDash”}||{“datasource”: “SalesData”}.  the Trigger Service will end up parsing this later.
  • EMAIL – Inevitably, stuff breaks.  Who you gonna call?

 

 

Trigger Service

The next piece is a script of some kind that will poll the signal table periodically and check for new ‘Ready’ statuses.  When it finds one, it will immediately execute the SCRIPT and set the STATUS to ‘Triggered’.  We run this every 5 minutes to process any triggers that are ‘Ready’.

One more table we’ll need for this to work is a configuration table for how to tell each BI tool or environment how and where to execute it’s script.

Screen Shot 2014-10-03 at 10.56.22 AM

These are going to be command line executions, and in the case of Tableau I’m using both a web service api (see Bonus Points: API Service) and a native tabcmd.

The script itself should query the signal table for the list of triggers in the ‘Ready’ status and then iterate through that list and execute the EXEC_CMD from the config table + the SCRIPT parameters from the signal table.

Here’s a snippet from my implementation (written in Python):

def exec_trigger(trigger, config):
    try:
        trigger_id = trigger['TRIGGER_ID']
        script = trigger['SCRIPT']
        exec_cmd = config['EXEC_CMD']
 
        for t in script.split('||'):  # Support for multiple extracts per trigger
            print(">>>>executing: {0}"
                .format(exec_cmd)
                .replace('{SCRIPT}', t))
            result = os.popen("{0}"
                .format(exec_cmd)
                .replace('{SCRIPT}', t))
            update_status(trigger_id, 'Triggered', result.readlines())
        else:
            log(
                trigger_id,
                'fail',
                'unknown trigger: id({0}) app_type({1}) script({2})'
                .format(trigger_id, app_type, script))
    except:
        log(trigger_id, 'fail', sys.exc_info()[0])

The exec_trigger() function accepts an array trigger and an array config.  The trigger array contains all of the settings for the individual trigger (like it’s trigger_id for logging and it’s script for executing).  The config array contains properties like the exec_cmd which is the command line tool syntax to execute the refresh.

One feature I support is the ability for one trigger to refresh multiple extracts, but since the BI tool probably doesn’t support that, I break them out into separate calls.  I chose || as the delimiter, so I just split the script and process each item, for example, {“workbook”: “SalesDash”}||{“datasource”: “SalesData”}.  Processing is just running the command line command by using the app_type‘s exec_cmd and replace the {SCRIPT} tag with whatever this specific trigger’s parameters are.

Other functions (not pictured but called in this code) are things like update_status() which simply updates the signal table that it’s finished and calls my log() function to indicate success.

If you want to run this script from a linux box but Tableau doesn’t have a linux version of tabcmd, you can reference this post by Tamas Foldi on how to get tabcmd to work on linux.

 

 

Conclusion

Some people had asked me to post the full code that I use, but I unfortunately can’t do that and don’t think it would help you anyway.  The UI I have built is on hack (which I really doubt is a platform you have available to you) and my python trigger service is simply that one function above with support logging and emailing functions that are specific to my environment.  I think the important thing to share is the idea and design because even though it sounds really simple and obvious, a lot of companies don’t do this step and stick with periodic schedules that end up burning system resources unnecessarily.  I’ve been at other companies that have done very similar things, and since every company’s stack, needs and capabilities are going to be different, you should build this system in a way that works for you.

You may also like...