Triggering Cubes & Extracts
One of the functions that never seems to be included in BI tools is an easy way to kick off an application cache job once your ETL is finished. MicroStrategy’s Cubes and Tableau’s Extracts both rely on manual or time based refresh schedules, but this leaves you in a position where your data will land in the database and you’ll either have a large gap before the dashboard is updated or you’ll be refreshing constantly and wasting lots of system resources. They both come with command line tools for kicking off a refresh, but then it’s up to you to figure out how to link your ETL jobs to call these commands. What follows is a solution that works in my environment and will probably work for yours as well. There are of course a lot of ways for your ETL tool to tell your BI tool that it’s time to refresh a cache, but this is my take on it. You won’t find a download-and-install software package here since everyone’s environment is different, but you will find ample blueprints and examples for how to build your own for your platform and for whatever BI tool you use (from what I’ve observed, this setup is fairly common). Trigger was first demoed at the Tableau Conference 2014. You can jump to the Trigger demo here.
The Design
The basic idea is that ETL tools have a variety of features and abilities, but one thing they all can absolutely do is update the data in a table. Add to this the potential limitation of having to install BI command line tools on ETL boxes that could be locked down, behind firewalls or not a supported platform of the BI vendor, and you arrive at using a signal table for controlling the flow.
The signal table will contain one row per extract which will contain a status for whether or not ETL has completed. All we have to do is add a final step to the ETL job that updates a flag in this table for the appropriate row to let us know that the extract should now be executed.
A trigger service then runs continuously looking at this signal table and waiting for one of the flags to change to ‘Ready‘. When it sees it, the service will be the one calling the BI tool’s command line argument to trigger the refresh. This service can be written in any scripting language and can be hosted anywhere as long as it has access to read from the database and access to execute the appropriate BI command line tool. A common place to host this is on the BI server itself since it already has access to the database (firewall wise) and has the necessary software on it.
Those are the two basic pieces of a Trigger service, but there are a couple of additional, optional pieces you could add if you want to fancy it up a bit:
An API Service could generalize the call to the BI tools in the event you have multiple tools and you want a single entry point, or if you want to expose BI command line tool functions to systems other than triggering extracts. In our environment, we built a little web service that receives a command and runs it locally. It’s an extra piece that gives us the flexibility of hosting the trigger service in a different place (like a central service center). Instead of calling local command line arguments, our trigger service instead just hits the web service. It’s an extra step, but depending on your plans it could afford you some better maintainable systems.
Another optional step is a Web UI to help people create new triggers. It’s not really that hard to just insert a new row into the signal table, but everyone loves a pretty UI, especially when it can add some conveniences and safe guards.
The Signal Table
The signal table is simply a list of the extracts we want to maintain and their current status. We’ll refer to each item as a ‘trigger’ and it can be considered a single ‘job’.
Columns
- TRIGGER_ID – This is the main ID that you’ll reference throughout the system. Make it unique, but make it whatever you want.
- TRIGGER_DESC – This is an optional description so that you know what each one does or is if you weren’t that descriptive with your naming convention or want some additional notes.
- STATUS – This is the current status that the trigger service will be polling to determine when it should refresh an extract.
- DATA_AS_OF – This is an optional field that just lets you keep track of the data that was refreshed. This isn’t a refresh timestamp, but rather the data itself that just landed.
- APP_TYPE – If you want to support multiple BI tools from one service or you need to support multiple environments, you’ll want to differentiate them here.
- SCRIPT – This is the command parameter that will be executed. More details about how this will work in the Trigger Service section. Another thing you may want to support here is a single trigger refreshing multiple extracts. I implement this by seperating them with || (not pictured), for example {“workbook”: “SalesDash”}||{“datasource”: “SalesData”}. the Trigger Service will end up parsing this later.
- EMAIL – Inevitably, stuff breaks. Who you gonna call?
Trigger Service
The next piece is a script of some kind that will poll the signal table periodically and check for new ‘Ready’ statuses. When it finds one, it will immediately execute the SCRIPT and set the STATUS to ‘Triggered’. We run this every 5 minutes to process any triggers that are ‘Ready’.
One more table we’ll need for this to work is a configuration table for how to tell each BI tool or environment how and where to execute it’s script.
These are going to be command line executions, and in the case of Tableau I’m using both a web service api (see Bonus Points: API Service) and a native tabcmd.
The script itself should query the signal table for the list of triggers in the ‘Ready’ status and then iterate through that list and execute the EXEC_CMD from the config table + the SCRIPT parameters from the signal table.
Here’s a snippet from my implementation (written in Python):
[codesyntax lang=”python”]
def exec_trigger(trigger, config): try: trigger_id = trigger['TRIGGER_ID'] script = trigger['SCRIPT'] exec_cmd = config['EXEC_CMD'] for t in script.split('||'): # Support for multiple extracts per trigger print(">>>>executing: {0}" .format(exec_cmd) .replace('{SCRIPT}', t)) result = os.popen("{0}" .format(exec_cmd) .replace('{SCRIPT}', t)) update_status(trigger_id, 'Triggered', result.readlines()) else: log( trigger_id, 'fail', 'unknown trigger: id({0}) app_type({1}) script({2})' .format(trigger_id, app_type, script)) except: log(trigger_id, 'fail', sys.exc_info()[0])
[/codesyntax]
The exec_trigger() function accepts an array trigger and an array config. The trigger array contains all of the settings for the individual trigger (like it’s trigger_id for logging and it’s script for executing). The config array contains properties like the exec_cmd which is the command line tool syntax to execute the refresh.
One feature I support is the ability for one trigger to refresh multiple extracts, but since the BI tool probably doesn’t support that, I break them out into separate calls. I chose || as the delimiter, so I just split the script and process each item, for example, {“workbook”: “SalesDash”}||{“datasource”: “SalesData”}. Processing is just running the command line command by using the app_type‘s exec_cmd and replace the {SCRIPT} tag with whatever this specific trigger’s parameters are.
Other functions (not pictured but called in this code) are things like update_status() which simply updates the signal table that it’s finished and calls my log() function to indicate success.
If you want to run this script from a linux box but Tableau doesn’t have a linux version of tabcmd, you can reference this post by Tamas Foldi on how to get tabcmd to work on linux.
Conclusion
Some people had asked me to post the full code that I use, but I unfortunately can’t do that and don’t think it would help you anyway. The UI I have built is on hack (which I really doubt is a platform you have available to you) and my python trigger service is simply that one function above with support logging and emailing functions that are specific to my environment. I think the important thing to share is the idea and design because even though it sounds really simple and obvious, a lot of companies don’t do this step and stick with periodic schedules that end up burning system resources unnecessarily. I’ve been at other companies that have done very similar things, and since every company’s stack, needs and capabilities are going to be different, you should build this system in a way that works for you.
Do you use message queues like RabbitMQ? We are using this to monitor lab processes and build a warehouse when a message is consumed by a listener.
Nope, haven’t heard of it.
Lucky I bump into this blogpost of yours. We are just looking for a solution to replace the time based refresh to event based triggers. I’m thinking of using system manager (mstr) to play the role of the trigger service as you call it. Have u any experience with it?
I have not used it for more than a POC, but System Manager is just a bulk operation tool and I don’t believe it would be used in this context. You’d still need some way for your ETL to tell your Cube to refresh, so you’d either just call Command Manager directly, or you’d use some kind of central monitoring system like the one I’ve outlined above.
Not sure how much overhead will be for the server for this continous polling process but we have Control-M setup for this job, where ETL job completion triggers BI jobs. We have this setup for “Cache clear” , “Cube Refresh” etc. However emailing facility was easy to establish since 921, but with System Manager in MSTR that too become available. However your method will definitly save additional S/w costs. Thank you Brian for sharing this. write up was wonderful as always.
Sounds quite good,
do you have any of these scripts open sourced? Not only Trigger, but also things related to the format of Tableau workbooks files.
Would be quite useful for community.
I actually rewrote this system recently and I think the code is pretty clean, accessible and presentable now that I could share it. I’ll look into updating this post when I can.
I did work with some others on some scripts to chop up the Tableau workbook files and we had some demos that we showed at the Tableau Conference, but we never followed through with productionalizing any of it. We kind of went a different direction from all of the Tableau hacking and mostly lost interest in those kinds of projects. It was pretty straight forward though, just working with XML, and we didn’t find anything too complicated in there that we couldn’t reverse engineer from just looking at it or doing a diff after saving a change.
Brian.
We use Tivoli to run our ETL and the final job triggers a .bat file on our server to trigger MicroStrategy events to build cubes. The trick is when to trigger subscriptions once the cubes are completed. I created a cube monitor that leverages the iServer SDK to check statuses for one or multiple cubes. Once they are finished loading, the subscription event is triggered.
Hi Rich, in order to trigger the subscriptions after the cubes are refreshed, you could use MicroStrategy System Manager which you aren’t already aware is basically a workflow definition system. You could write command manager scripts to trigger the events that you want and arrange them so that the subscriptions are triggered only when the cube refreshes are successful. Am I missing something? Having said that, I would love to hear more about the cube monitor that you have built.
Brian, have you ever build a trigger when the Tableau extract finished by triggering the subscription schedule to send out the email to users that report data was refreshed on server.