Love is in the air in the business analytics world; actually, Big Love.
So first we had the old warhorse, Enterprise Data Warehouse (EDW), the bedrock of enterprise analysis that brought together data from different sources in the enterprise and transformed it into a variety of models required by the company. Then came along the challenger, Hadoop; a new analytics platform designed to meet the data needs of the business world today. And now what we have is an increasing realization that EDW and Hadoop can, in fact at times should, co-exist and complement each other. Yes, the technologies are vastly different, but yet experts feel there could be Big Love between the two platforms if the synchronization is executed correctly.
Why Do We Need Hadoop?
There are tonnes and tonnes of material on this one. Listed below are a few key points that highlight some of the inherent challenges that EDW faces and how Hadoop can help companies to overcome them.
The two main issues with EDW are:
- It is rigid. The EDW platform brings together data from different sources and transforms it into agreed models required by the business owner. However, once the models are agreed upon and rolled out, it is very difficult to change them. Changes typically require a lot of effort, especially in the Transform phase of ETL (Extract, Transform & Load) paradigm, which is main driver of EDW.
- Time to complete ETL processing increases exponentially with data: As business grows, data also grows and this impacts the time required to complete the ETL processing. Unfortunately, the time to complete usually increases in an exponential rather than a linear manner.
For instance, during the first year of EDW implementation an hour may be enough to complete the ETL process on daily basis. So one might expect that after 10 years of data accumulation the number of hours required would be 10 hours. However the number could well be 30 hours.
Ironically, the company does not really need the 10 years of data for performing the routine analytics. Often, just the last couple of years of data is sufficient to run these analytics. However if the company stores 10 years of data to comply with data preservation rules or for archive purposes, then the entire 10 years of data will be loaded up even for routine analytics.
In contrast, Hadoop is amazingly more fast and flexible, unleashing an entirely new dimension of analytics.
- The Distributed File System and NoSQL embraced by Hadoop solves the rigidity issue. Hadoop allows us to save any format of data, without having to define the schema first.
- Hadoop also allows us to run logic on data as soon as it streams into the Hadoop repository. This logic could be anything: sending an alert, performing a computation, building a model incrementally or any other functionality we need. Creativity is the only limit here.
- Working with the Hadoop platform allows us to leverage the power of data stored in-memory. Since the data is in-memory, it get processed faster than when it is saved only on physical storage.
- Time to Complete is also not an issue with Hadoop. You can scale up by simply adding more nodes via common hardware and increase both storage capacity and/or computation power.
Do We Need EDW As Well?
But then if Hadoop is so great why do still need EDW? Should companies not migrate completely to Hadoop?
Well, it depends. Rigidity is not always a bad thing. In some cases, rigidity is required to define standards and guidelines. That is why many statutory reports are based on data generated by the EDW models. In such cases maintaining an EDW, alongwith the Hadoop platform, is a good idea. If the company is large the need to maintain the EDW, at least initially, is even more important. New technology adoption usually poses a problem for large companies and moving all data models onto the Hadoop platform at once would require deep learning curves and significant effort in a big organisation.
However, for both platforms to co-exist and complement each other successfully we have to ensure that the EDW loves Hadoop. So how can we make this happen?
Making EDW love Hadoop
Before we go any further, take a look at the figure below. It depicts how EDW is implemented commonly.
Now, if we want to add Hadoop in the picture here we would have to ensure that it is embraced effectively by the EDW ecosystem; only then can we leverage real productivity. And to achieve this goal there are two very effective approaches you could use:
· Pre-Process ETL Approach
In this scenario, we move the ETL process from EDW Staging Layer to Hadoop and push the output to EDW to be used later on any dashboard or reporting tools. Using this approach will bring Hadoop’s twin powers of flexible data schema and fast parallel processing to enterprise analytics. Also, the company would be able to shift high-cost data warehousing to lower cost Hadoop clusters.
· Hot and Cold Storage Approach
Under this approach we split data into two types:
Hot Data: Data frequently used for analysis
Cold Data: Data that is not necessarily accessed on a daily basis, but yet needs to be archived for purposes such as rare analysis or data archiving requirements.
The large volume of historical data is offloaded into cold storage with Hadoop and only the hot data is kept in the EDW. Whenever data from cold storage is needed, it can be moved back into EDW or directly queried and joined with EDW result.
This approach frees up space in the EDW. It needs to process fewer data and consequently requires less time to complete and less cost as well.
So if you have an EDW in place and are debating whether to replace it with Hadoop or maybe hold on a little while longer, remember it does not have to be an either/or decision. You can keep that old EDW and add on some Hadoop as well. Just make sure that you follow the right approach when integrating them.
*All images are taken from Hadoop Analytic with HD Insight material of Microsoft Azure