Tuesday, June 22, 2010

Starting Small – One Little Example

It sounds anti-intuitive: the larger and more complex a project is, the smaller and simpler pieces I usually start to tackle the project after gaining an overall and high level landscape of the project.

For instance, when I face many dispersed data stores with all kinds of technologies and challenges such as data accessibility, data quality and data redundancy, and business demands a single view of their data (among other factors), I might tend to create a centralized ODS (Operational Data Store) using MDM architecture and techniques. This is a very daunting task actually, more than it looks like on the surface.

My starting point of this endeavor will be to just create a Person table with 4 attributes (First Name, Last Name, Birth Date, and SSN) to begin with. The goal is to put all the human beings from multiple systems into this one table. Sound too easy. As a matter of fact it is not. Actually, if we can accomplish this, we will test many situations down the road. Let me explain.

First of all, finding these human beings in these systems sometimes are is obvious. Some legacy systems store these 4 attributes in more than one table/file. First name and last name could be in one place and birth date in another place, and SSN yet another place. It takes a while to identify where they are.

Secondly, some legacy systems are not that accessible from outside world. Yes, you could spend some money to put technologies such as adapters into the systems. One of the problems is that these adapters for some systems are very expensive and might not perform well for large amount of data retrieval. So we need to settle the data accessibility issues.

Thirdly, we need to figure out the data refresh frequency of the data movement. For many reasons, the real time is not feasible. Batch job is mostly used in the situation. Then we need to deal with the permutations of things like how about the job fails.

Fourthly, we need to settle the data transportation: protocol, security and etc.

Fifthly, now the data is ready to be put into the Person table. But wait a minute. These data may not be consistent. For example, for the same person, one system may use his first name as Mike, one system may use Michael. Which one will be the winner? We can say whichever the last one in is the winner. Then if the person uses our system, one time he sees his first name as Mike, the other time is Michael. Will he like it? Or we can have little rules to guard this. Or we can use business users to mandatorily pick one. The same applies to the other three. For instance, how does the system know which SSN is the right one when one person has more than one SSNs? It is more complex than the first name situation.

Sixthly, OK, after all the 5 chaos, we finally have a Person table that store clean, complete and accurate data about these 4 attribute. We know the truth of each human being. We also know at this moment that some of the systems don’s store the truth. What should we do with these systems? The natural answer to some IT folks is, well, let’s write a module to reach to these systems and automatically fix the wrong data about these 4 attribute. Sometimes the business people accept and appreciate your help; sometimes they don’t. They know there are un-intended consequences if you change the data without them to examine it. I have had this resistance more than once.

Seventhly, who will own this pretty Person table and its data? Since there are already teams that own their own systems, this Person table is new and does not belong to and originate from any single systems, it could easily become an orphan. Along the same line, who will govern the table, for one thing, to make sure it will always store the clean, complete and accurate data in the future? Some politics even could involve here.

In summary, by just tackling one table with 4 attributes, we can gain and know a lot. For instance, can we deal with all the 7 situations? If not, people will wonder that if we are not capable of even one rudimentary table with 4 attributes, are we capable of handling much more complex tables, their relationships and attributes with larger volumes? So it tests us, which is very important because IT people often feel stronger than what we actually can do. It is called “wise” when we know our true capabilities. And based on that, we can set the real expectations with our business partners, instead of disappointing them one after another, or digging around excuses, or worse, finger-pointing each other.

So starting small is not because we are timid; it is because there are many benefits by doing that. I only use one little example to illustrate how and why it is as such.

No comments:

Post a Comment