Archive for April, 2011

Why Isn’t Data Enrichment A Standard Practice In Every Data Warehousing Project?

Data enrichment is a terrific way to increase the value proposition of your data warehouse, and yet, it is implemented quite rarely. With data enrichment you can almost overnight provide your sales team with additional mountains of information about the demographics of prospects, provide finance with accurate client addresses and other contact info, provide marketing with competitive information about companies in related industries and help manufacturing by assigning standard attributes to parts used in manufacturing. So, why is Data enrichment not a standard phase for almost any data warehouse project?
Let’s start by understanding what data enrichment is. By data enrichment, I am referring to the practice of adding additional data attributes that do not exist within company internal databases, to data that does. There are many companies that provide such enrichment services in almost any industry, probably one of the most well-known ones is D&B. So, for example, you could send Dun and Bradstreet (D&B) your client list and get in return the complete address, phone and some other stats about the companies in the list.

That sounds terrific, right? The premise is simple and is loaded with ROI. If you have the money to invest in such a process, why wouldn’t you? Well, the answers are typically related to process, trust and being able to step “outside-of-the-box”.
Strange as it may sounds, data warehouses are quite intimate affairs. They involve millions, or billions or details about the way individuals go about conducting their business as part of their day-to-day work.  As such they are considered a core part of the business and opening up this internal, often highly pervasive and massive database to an external, foreign data source can be a difficult proposition for both business line and IT managers.
There are also technical challenges involved in such integration, such as the ability to sync the historical changes in the organic company data with the historical changes (if any are tracked) in the procured data. Sometimes, there is also a need to provide feedback on such procured data, and a feedback loop/mechanism must be developed.
And finally, it does require some degree of “out-of-the-box” thinking to integrate an external data feed used to enrich company data into your most internal and intimate database.
If you are able to overcome the procedural, trust and “out-of-the-box” challenges though, you may find yourself rewarded with a wealth of information your business users would find extremely useful and that you would not have been able to obtain otherwise.
 

Abusing the date dimension: when does it make sense to NOT use it

The date dimension is a dimensional model staple. You will not find a data mart, data warehouse or any kind of a reporting database that does not have one form or another of a date dimension. The benefits of using a date dimension are numerous, and it’s a great tool for report developers, who can simply reference columns in the date dimension rather than look to write complex sql functions to perform date gymnastics. However, when it comes to the semantic layer, I suggest being much more judicious when implementing the date dimension. Overusing aliases in your semantic layer to represent many different dates, or using contextual separation of dates, can add significant complexity to your semantic model and reduce its maintainability dramatically. And the value of using the date dimension with many of the dates tracked in your fact may not be significant.

So, when does it make sense to NOT use your date dimension? The first condition that should exist for you to choose not to use your date dimension is the use of “smart keys” for your dates. Smart keys were very popular 30 years ago, and were used to embed business rules and logic into a numbering system that represented the rows values. This sounded like a great idea, until these business rules began changing. As the rules embedded in the smart keys started getting more dynamic, database designers faced a whole slew of new problems and headaches due to the need to make changes in existing keys. To avoid these problems, data warehouses use surrogate keys (or synthetic keys). These are “dumb” serial numbers that are used to encapsulate the “business keys” (legacy keys) and make handling of changes much easier. This is an almost bullet proof technique and you cannot go wrong with implementing surrogate keys in any dimension, except, the date dimension. This is the only exception that merits a diversion. Dates are not “slowly changing dimensions”, and most implementations of a date dimension are implemented at the grain of a day, and do not require multiple rows per day. This kind of implementation allows the use of “smart” date keys. Keys that follow the convention yyyymmdd (for example 20110419) are numeric, easy to produce and maintain, and very importantly, will allow report developers to deduct the date, without using the date dimension directly.

And this leads me to the second condition under which date dimensions should be considered not applicable in the semantic model: date proliferation. If your semantic model looks like this, before accounting for anything else but dates, you may want to reconsider your use of the date dimension.

Many subject areas require the use of many tens or dates per fact. Especially with so called “factless facts”, or facts associated with domains that are more process oriented then measurement oriented, it’s very common to find dates upon dates that can be used to analyze the process. Think about your sales process and how it is captured in your CRM: your lead generation date, your initial call to the lead date, your second call to the lead date, your first meeting date, your first quote date, your revised quote date, your po date, your booking date, your revenue recognition date, and so on and so forth. Or imagine a ticketing system used to track tickets escalation in a highly regulated environment where each step of the escalation process goes through several approvals, all happening at different dates. So, while it is easy to imagine a fact of this sort with 50-60 dates, would you really want, or need to create these many joins and aliases in your semantic model? Would that be productive and add value to your business users? Do they find it useful to know the date, week number in month, week number in year, month number, month name, year, quarter number, quarter name, weekday name, etc etc for each one of these dates? Or would you simply want to know the actual date? The common answer is that there are probably a handful to a dozen of dates that can benefit from a join to the date dimension, and for the rest, all that is needed is simply the ability to report the date. So, if your date dimension has a smart key, as described above, it would be simple to convert it to a date with a simple sql function. If more is needed, it would also be simple to join it back to a date dimension alias and add any kind of date permutation on it. This approach is also known as eating the pie AND having it…

Connect an Xcelsius dashboard to a public RSS feed using simple jsp proxy – work around cross domain problem

Published by Ron Keler in BI At Large, Xcelsius on April 11th, 2011 | No Comments »

Xcelsius has several data connectivity facilities that make it relatively easy for dashboard developers to connect to live data. Web Services and RSS feeds can be used via XML maps, or directly. However, the practical limitation anyone who ever tried using those features undoubtedly runs into, is the fact that these sources will typically reside on a different server, and hence will cause a cross-domain error on the swf file deployed from Xcelsius.
To address this problem, I used a simple .jsp based proxy. Since Xcelsius developers are usually working close to a BusinessObjects implementation, and that in turn, is typically deployed on a J2EE server (usually Tomcat), a .jsp solution seemed appropriate and applicable to many folks. All this proxy does is connect to the RSS (can be adjusted to connect to a Web Service) , and then re-stream it. That way, rather than connecting the swf to some remote server, it can be connected to a local file, hence eliminating the cross-domain problem. In the XLF connection, rather than using the actual URL, I am using the proxy URL (http://localhost:8080/proxy.jsp?proxyUrl=http://www.guardian.co.uk/news/series/24hoursinpictures/rss – be sure to replace localhost with your server name).
To illustrate this solution, I created an Xcelsius based swf that goes out to the Guardian UK 24 Hours In Pictures RSS feed (http://www.guardian.co.uk/news/series/24hoursinpictures/rss), and displays the images, and other info about them.
You can find the XLF, proxy.jsp file and related html/swf here. After downloading, extract the .zip package, and place the jsp.proxy, .html and .swf file in your web app root (for Tomcat this is in TomcatHomeDirectorywebappsROOT). You could then open a web browser on the server and load http://localhost(:8080typically)/PicOfTheDay.html
I would love to hear about others use of this concept.

Google Public Data Explorer

Published by Ron Keler in BI At Large on April 6th, 2011 | 2 Comments »

These are exciting and busy times for Business Intelligence at large, and data visualization in particular. It’s hard to keep up with all the innovations in the field. From SAP Xcelsius and Explorer, Microsoft Silverlight, QlikView, mobile players like Roambi and many many others. And, well, Google. I reviewed the Google Charts in a prior blog, but recently also ran into the Google Public Data Explorer labs project.

This is a fascinating concept. Google create an open platform designed to load and explore data. I wanted to share some personal observations about this platform, which I found worth highlighting:

  1. It’s open. You can upload any data set (once you understand how to package the data and describe it using the DSPL – more on that in a bit) and presto, you get a free data visualization and mining application.
  2. It makes an attempt at standardization. Google labs engineers faced a difficult problem: how to let developers describe their data in a generic manner. Every BI vendor today solved this problem with proprietary tools and technologies: SAP (BusinessObjects) with Universes, Oracle (OBIEE) with the Admin Tool, IBM (Cognos) with the Framework manager and so on and so forth. Google’s approach is XML based. Google’s Dataset Publishing Language (DSPL) is an attempt at standardizing the way datasets can be described. It borrows from the dimensional modeling (dimensions, metrics) as well as from more abstract notions (topics, concepts, slices).
  3. It uses flash. The data visualization engine itself is flash based, which is, to say the least, strange. Flash is not an open technology, and since Apple won’t allow it to run on its iOS devices, many vendors are looking to stir clear from it to assure their solutions will work on iPads, iPhones, and such.

Regardless of all this, it is still a very worthy development, and I encourage you to try it out for yourself. It took me about an hour to parse out for myself the DSPL XML structure, and produce a sample set of data files that loaded into the tool. That’s not bad, and reflects well on the simplicity of the design and the well documented tutorials.

If you like to see my example, you can go here. You can mine the data, change visualization (Bar, Line, Bubble and Map charts), run the data across time, apply filter, pivot and more.

© BI HAPPY
CyberChimps