Archive for the ‘Data Warehousing’ Category

Medicare Advantage Enrollment Dashboard

About a month ago, my colleague, Josh Tapley (http://data-ink.com/) and I set out to build an Xcelsius based BI application. Our goals were to demonstrate the capabilities of the technology, when combined with some fundamental data warehousing and BI techniques and great design principals.

We started our quest by looking for a large, publically available, dataset that will be interesting to analyze. We wanted to use a data set that can produce valuable insights by applying analytics and data visualization techniques to it. Josh suggested we use the government center for Medicare and Medicaid services which publishes monthly national Medicare enrollment numbers each month. The data is published as text and excel files and is proliferated across the site. It is a classical example of large amounts of data, published in raw format for users to wrestle with and with no information or insight in sight.

We proceeded to create a data repository (data warehouse of sorts), essentially coming up with a process for collecting the multitude of data files across the various sources it is published in, and loading the data into a database, using ETL techniques. The most granular fact table in our database of Medicare records contain 588846 records as of Feb 2012, it grows at a rate of approx 30-50K records a month. Not bad for a publically available data set.

We applied an aggregate strategy to our database to assure the queries we perform on it to support our application are adequately fast and users experience minimal delays as they navigate this rich data set.

With all the data on hand, we set out to design our application. We used simple, clean and intuitive design principals to highlight the analytics, and allow users to easily understand enrollment trends across several key factors. We wanted to tell a story, paint a picture, and not require users to struggle and stumble through complex interfaces and data layouts to get a gleam of information.

With our design at hand, we created a series of xml feeds that query the database on the fly and return result sets that meet our application layout needs.

The result is what we consider to be both a useful and informative analytical tools for anyone in the healthcare industry and beyond, as well as a good showcase of BI Happiness.. Please be sure to checkout our Medicare Advantage Enrollment Dashboard.

Enjoy!

Time lapse or high speed your business processes with history preservation techniques

High speed and time lapse photography are two techniques that allow us to understand the world around us by slowing down or speeding up processes that we could otherwise not be able to observe. High speed photography allows us to record phenomena that occur within a blink of an eye, and then replay them, frame by frame, millisecond by millisecond, to understand in full what has happened.

Time lapse photography allows us to capture processes that take a very long time, and play them much faster, which again allows us to develop new ways of understanding what has happened.

Both time lapse and high speed photography are ways to observe historical changes, at various speeds, using different filters to develop new perspectives regarding the subjects photographed.

That is, in essence, one of the most important premises of the data warehouse. History preservation techniques allow the business to play back events, slow down fast moving procedures and examine individual transactions that contributed to big maneuvers, or play back events that happened in the past to analyze and gain insight into their efficiency.

History preservation in the data warehouse is accomplished through different techniques, primarily known as slowly changing dimensions and snapshots. There are tons of literature materials out there about both, and if you are not familiar with those, and are working anywhere near data, you need to very quickly become familiar with those.

As a matter of practical example, take your sales forecasting information. You must have wondered about: how good are your forecasts? How accurate are they? Can you rely on your sales team forecasts regarding next week, next month, next quarter or beyond?

The only way to answer this question is to observe the forecasts accuracy over time. The forecast changes every day, and as opportunities get closer to closure, the accuracy improves. But since the forecast updates constantly, how can you tell if the predictions made a month ago about the following month were accurate or not? Well, your data warehouse could provide you with exactly this capability and allow you to examine your forecast as it gets updated each day, comparing it to the actual bookings as they occurred. Armed with such powerful information, you can develop a strong sense of your forecasting accuracy over time and across time.

Just like time-lapse photography, it takes time to build up the necessary amount of frames, or snapshots, to allow you to slow down reality and analyze what happened, but the results can provide you with course changing insights into your processes to help you achieve your goals and improve your competitive edge.

Why Isn’t Data Enrichment A Standard Practice In Every Data Warehousing Project?

Data enrichment is a terrific way to increase the value proposition of your data warehouse, and yet, it is implemented quite rarely. With data enrichment you can almost overnight provide your sales team with additional mountains of information about the demographics of prospects, provide finance with accurate client addresses and other contact info, provide marketing with competitive information about companies in related industries and help manufacturing by assigning standard attributes to parts used in manufacturing. So, why is Data enrichment not a standard phase for almost any data warehouse project?
Let’s start by understanding what data enrichment is. By data enrichment, I am referring to the practice of adding additional data attributes that do not exist within company internal databases, to data that does. There are many companies that provide such enrichment services in almost any industry, probably one of the most well-known ones is D&B. So, for example, you could send Dun and Bradstreet (D&B) your client list and get in return the complete address, phone and some other stats about the companies in the list.

That sounds terrific, right? The premise is simple and is loaded with ROI. If you have the money to invest in such a process, why wouldn’t you? Well, the answers are typically related to process, trust and being able to step “outside-of-the-box”.
Strange as it may sounds, data warehouses are quite intimate affairs. They involve millions, or billions or details about the way individuals go about conducting their business as part of their day-to-day work.  As such they are considered a core part of the business and opening up this internal, often highly pervasive and massive database to an external, foreign data source can be a difficult proposition for both business line and IT managers.
There are also technical challenges involved in such integration, such as the ability to sync the historical changes in the organic company data with the historical changes (if any are tracked) in the procured data. Sometimes, there is also a need to provide feedback on such procured data, and a feedback loop/mechanism must be developed.
And finally, it does require some degree of “out-of-the-box” thinking to integrate an external data feed used to enrich company data into your most internal and intimate database.
If you are able to overcome the procedural, trust and “out-of-the-box” challenges though, you may find yourself rewarded with a wealth of information your business users would find extremely useful and that you would not have been able to obtain otherwise.
 

Abusing the date dimension: when does it make sense to NOT use it

The date dimension is a dimensional model staple. You will not find a data mart, data warehouse or any kind of a reporting database that does not have one form or another of a date dimension. The benefits of using a date dimension are numerous, and it’s a great tool for report developers, who can simply reference columns in the date dimension rather than look to write complex sql functions to perform date gymnastics. However, when it comes to the semantic layer, I suggest being much more judicious when implementing the date dimension. Overusing aliases in your semantic layer to represent many different dates, or using contextual separation of dates, can add significant complexity to your semantic model and reduce its maintainability dramatically. And the value of using the date dimension with many of the dates tracked in your fact may not be significant.

So, when does it make sense to NOT use your date dimension? The first condition that should exist for you to choose not to use your date dimension is the use of “smart keys” for your dates. Smart keys were very popular 30 years ago, and were used to embed business rules and logic into a numbering system that represented the rows values. This sounded like a great idea, until these business rules began changing. As the rules embedded in the smart keys started getting more dynamic, database designers faced a whole slew of new problems and headaches due to the need to make changes in existing keys. To avoid these problems, data warehouses use surrogate keys (or synthetic keys). These are “dumb” serial numbers that are used to encapsulate the “business keys” (legacy keys) and make handling of changes much easier. This is an almost bullet proof technique and you cannot go wrong with implementing surrogate keys in any dimension, except, the date dimension. This is the only exception that merits a diversion. Dates are not “slowly changing dimensions”, and most implementations of a date dimension are implemented at the grain of a day, and do not require multiple rows per day. This kind of implementation allows the use of “smart” date keys. Keys that follow the convention yyyymmdd (for example 20110419) are numeric, easy to produce and maintain, and very importantly, will allow report developers to deduct the date, without using the date dimension directly.

And this leads me to the second condition under which date dimensions should be considered not applicable in the semantic model: date proliferation. If your semantic model looks like this, before accounting for anything else but dates, you may want to reconsider your use of the date dimension.

Many subject areas require the use of many tens or dates per fact. Especially with so called “factless facts”, or facts associated with domains that are more process oriented then measurement oriented, it’s very common to find dates upon dates that can be used to analyze the process. Think about your sales process and how it is captured in your CRM: your lead generation date, your initial call to the lead date, your second call to the lead date, your first meeting date, your first quote date, your revised quote date, your po date, your booking date, your revenue recognition date, and so on and so forth. Or imagine a ticketing system used to track tickets escalation in a highly regulated environment where each step of the escalation process goes through several approvals, all happening at different dates. So, while it is easy to imagine a fact of this sort with 50-60 dates, would you really want, or need to create these many joins and aliases in your semantic model? Would that be productive and add value to your business users? Do they find it useful to know the date, week number in month, week number in year, month number, month name, year, quarter number, quarter name, weekday name, etc etc for each one of these dates? Or would you simply want to know the actual date? The common answer is that there are probably a handful to a dozen of dates that can benefit from a join to the date dimension, and for the rest, all that is needed is simply the ability to report the date. So, if your date dimension has a smart key, as described above, it would be simple to convert it to a date with a simple sql function. If more is needed, it would also be simple to join it back to a date dimension alias and add any kind of date permutation on it. This approach is also known as eating the pie AND having it…

Why Data Modeling Fall Sideways?

One of the most surprising aspects I have come to expect during the years of data integration and warehousing consulting is how little data modeling is actually being practiced in this field. While it is hard to find a database technologist who would argue that data modeling is “bad” or “wrong”, it is simply not practiced. When I ask for the model of the existing database, the usual response is in the form of a 4-5 years old ER diagram that is completely outdated (at best) accompanied by an embarrassed shrug. It is one of the most basic and fundamental things to do when approaching any data problem, and yet companies are always impressed by this and are surprised by the huge benefit and value the model brings to the development process. Don’t they know how important data modeling is? Of course they do. So why don’t they practice it?

I think there are many explanations to this puzzling question, and I would like to offer a few. As you would expect the answers are not related to technical competence, or lack of appreciation for the fundamentals of good development practice. It has more to do with the reality of corporations, and the pressures of IT operations.

Lack of time to model

The fact of the matter is that you do not have time NOT to model.

The fire alarm doesn’t stop ringing. The users are constantly demanding increased functionality and services, “threatening” to replace internal IT services with external “Cloud” services or outsourced solutions. You need to deliver results every two weeks. In that kind of an environment, where you can barely plan how your day will end, you don’t have time to model. Or so goes the argument. It is not always spelled out and conscious, but it is a big reason why modeling falls sideways. The “lack of time argument”. You understand how good modeling practices can save you time in the long run, and keep your development streamlined and your environment maintainable, but you simply don’t have the luxury of long time planning right now.

While it’s absolutely clear why modeling saves you time in the long term, it will also save you LOTS of time in the short term. Databases are not built in vacuum. They are used to house information and data that will need to be consumed by users through applications. And unless you have only one person who does all the work, a data model is the ONLY way your various team members and participants in your data driven application will be able to make progress together and IN PARALLEL.

The data model becomes the contract between the application and the data it represents. It shields and separates the various parties that are collaborating to realize the latest initiative, and allows them to make progress in parallel, without needing to wait for each portion of the work to be completed before they can proceed.

Lack of tools

This does not come up often, but I still run into this sometimes. Enterprise modeling tools are typically expensive and are complex to learn and use. Licensing restrictions prevent many people in the organization from using them and lack of functionality, or training, causes database administrators and developers to eventually stop using them.

Keeping up the model can be a lot of work, but it is the only way to keep everyone informed about what is going on in the database, and how it evolves and changes over time. When it’s time to work on a new initiative, everyone can go to one place and understand the state of the information.

And while you will need a tool to store and maintain your model, especially the physical one, the most important modeling tool you can ever have is your whiteboard.

Modeling ALWAYS starts on the whiteboard. There is simply no better tool for collaborative data modeling then standing in front of a whiteboard with a dry erase marker, and a group of other talented database developers all standing or sitting around, shooting down approaches or embracing new ideas, erasing, tweaking and refining until they crystalize to a reality everyone can understand.

Lack of discipline

Modeling takes a lot of practice, and a lot of discipline. It is a bit counter instinctive for a super star developer. Their instinct, when presented with a technical challenge, is to “jump right in” and solve it in a blazing flash of genius. So the discipline to take time to see how this “fits the model” takes an extra effort upfront, which is very easily dismissed.

If you can embrace the ideas that modeling can help speed up your development cycles and can be done using simple tools like a whiteboard, you will be that much closer also accepting the fact you will need to nurture a culture that encourages discipline around modeling. Before you know it you will be moving out of a reactionary fire drill mode, and into a proactive and thoughtful mode, where you can help deliver information that matters to your organization most strategic needs.

Naming Conventions In The Data Mart

Published by Ron Keler in Data Warehousing on January 16th, 2011 | No Comments »

What you call things matter. It matters to you, and it matters to others who interact with you. When you name things, your project, your children, your children’s stuffed animals, you typically consider two perspectives: what YOU think the name should be and what would OTHERS think about it.

Naming your data mart database tables is no different. And while when naming pets or stuffed animals, I would argue that what you think should matter just as much as what you think others will think about it, when it comes to tables and columns in your mart, I would suggest the OTHERS perspective should have a much higher weight.

Coming up with a naming convention and database standards at large is probably one of the first tasks you will face when starting to work on your reporting project implementation phase. In your logical designs and high levels diagrams you were able to use names that business users relate to, but when the time comes to start implementing the physical model, the DBAs, ETL developers and the rest of the technical database development team takes over and the needs of OTHERS, when it comes to what things are called, are usually completely ignored.

The fact of the matter is that most of the folks who are working on creating the data warehouse, operational data store or data mart, will never actually use it for any kind of reporting purpose. They are concerned with building database objects that would satisfy a variety of reporting needs, explicitly stated, and sometimes not. Structuring them in a way that will transcend the mind boggling fast pace of change in any modern organization. Systems come and go, business rules constantly change, personnel and minds shift. Designing a model that will hold throughout all that, and continue to deliver accurate, timely and reliable results is hard enough, and if that is accomplished, nobody except the database development team should have any say about what objects are called inside the database. Right? Well, they might be missing something.

Just like any other corporate project, the success of your reporting project is dependent on perception. The “rep” on the project is what will ultimately determine its faith. If people in the organization perceive the project to be a success, you will be a hero, but if, for whatever reason, the talk over the water fountain deems the project as a failure, you can forget about that promotion or extension you were eying.

And who will create this success or failure perception on your reporting project? Well, we can think of the people who will create this perception in terms of three groups.

First, you have your end users. These are typically the business users who defined the somewhat narrow requirements that launched your reporting initiative. They will never see the inside of your data warehouse or mart, and most of them have no idea what is an RDBMS or the faintest clue regarding how it works. They will see and use reports, dashboards and applications that rely on the data you produce, and their perception will ultimately rule. That is why you invest heavily in reporting technologies, and spend months refining fonts, colors, layouts and graphs. The end result of the monumental data effort has to look absolutely stunning, and the information has to be accurate. That is a given. If you cannot accomplish that, don’t bother getting started.

The third group is the technical implementation team. It consists of developers, business analysts, quality assurance folks and many others. This is the group the builds “the thing”. You better make sure they are all “groupies” and that the aura they project regarding anything related to the data work, is bright and positive.

The second group, in between the end users and the development team, is responsible for gluing the two. It’s the group of folks who design the interfaces between the database, and the reporting. The meta layer of your project. They are report and dashboards developers, universe designers or admin tool administrators or framework managers. They are tasked with understanding the tables, joins and structures of your database and translating those into something that can support the information needed from the end users. Some of them work directly for you as part of the project team, some you know of, but don’t manage, others, you may learn about in hind sight.

Why do these folks matter? Well, this group of individuals typically strudels the development and business worlds. They talk to everybody. They are often opinionated, and let folks on both sides of the project know what they think about the technical implementation as well as the business requirement.

They are the OTHERS you should consider when thinking about the naming convention you will deploy. Even if you are using a meta data layer such as BusinessObjects Universe, or OBIEE Admin Tool, these are the folks who will still go back to the database layer to check things, test, validate, understand, and design end products in the form of meta data layers, reports and dashboards and applications. This could be a fairly large group, and in a global situation, can be spread across the world.

If your report developers and meta data layer gurus start trashing the reporting data in front of your end business users because they cannot understand it – NOT BECAUSE IT IS NOT GOOD, BUT BECAUSE THEY CAN’T FOLLOW YOUR NAMING CONVENTIONS, you will be facing a perception problem.

So, how can you stay ahead of such a problem? Simply consider the needs of those who will be querying the database, as well as those who will be developing it, when naming things. Here are some concrete suggestions:

  1. All report developers and meta data layer designers are familiar with basic dimensional modeling concepts. They will expect to see dimensions and facts. Name your objects accordingly. Use a _FACT or _DIM suffix (some prefer prefix) to indicate the object orientation from a dimensional perspective. If you model an object as both (does that really happen often?), indicate that (_DIM_FACT). Use this same concept to indicate additional concepts such as aggregates.
  2. Avoid using acronyms and abbreviations if possible. If your RDBMS has restrictions on object names, and you have to shorten names, seek report developers and business users advise on appropriate short names to use. If you can’t do that, look to avoid cutting off long words, instead use shortening techniques that use letters from the beginning, middle and end of the word you seek to shorten. If you don’t have enough space to spell out CATEGORY, consider using CATEG, instead of CAT. A couple of months from now, It may save the report developer in Japan a lot of time and wonder as to why you would use domesticated animal  references in the reporting table names.
  3. Consider whether you need to indicate the database object type as part of its name. Your report and meta data layer developer will not care if they are using a view, a materialized view or a table. Your developers are using database tools that clearly indicate the object type, typically by displaying different types in different groupings. Especially if you are short on space for object names, do you really need to devote 3, 4 5, 6 or more character of the object names to indicate it is a _TBL_, _VW_ or _MV_?
  4. Column names matter as well. Once again, avoid acronyms and abbreviations as much as possible and spell out as much as you can to avoid ambiguity by someone who is not sitting right next to you right now, understanding exactly what you mean by some obscure three letter code you choose to use.
  5. Use a convenient and easy to follow naming convention for Primary and Foreign key column names. This is especially critical if you do not plan to enforce data integrity at the database level (which I personally typically don’t, but that’s a topic for another post..). It would make the reports and meta data layer developers life so much easier if they know that a CATEGORY_ID is a CATEGORY_ID is a CATEGORY_ID. Be clear about synthetic keys as opposed to natural keys, your reporting folks would care about that.
  6. Perhaps the most important of all is BE CONSISTENT. If your naming convention is consistent, even if it’s somewhat complex, people will eventually get it, and when they do, they will love the predictability and ease in which they can understand your models. Lack of consistency will throw the reporting team off, causing confusion, doubt and ultimately reduce the level of confidence in the mart.
© BI HAPPY
CyberChimps