Mayank Bhasin, Senior Director at Enterprise Consulting and Ness Software Engineering (SES) suggests that the challenge of Big Data can best be addressed by embracing a concept of data lakes that scale at the pace of the cloud
Data is incessantly exploding exponentially in this digital economy. The emergence and eruption of new types of data in recent years have put tremendous pressure on all of the data systems within the enterprise. These new types of data stem from systems of engagement such as websites, social networks, geospatial, financial transactions or from the growth in connected devices. Challenges of capture, processing and storage aside, the blending and correlation of existing enterprise data with the value found within these new types of data is being proven by many enterprises across many industries from Financial Services to Healthcare, from Advertising to Energy, etc.
One of the prime approaches to tackle the challenge and realize the value in big data is enterprise data lakes. Data lakes can scale at the pace of the cloud, enables open source technology like Hadoop to be designed and run on large number of commodity servers, remove integration barriers and clear a path for more timely and informed business decisions with the lowest possible friction and cost. They can swallow massive streams of data and store it in whatever form it arrives, much like a large body of water drinks from its tributaries.
A data lake is a large object-based storage repository that holds data in its native
format until it is needed. What is .com
Pentaho CTO James Dixon is credited with coining the term "data lake".As he described it in his
blog entry, "If you think of a datamart as a store of bottled water – cleansed and packaged and
structured for easy consumption – the data lake is a large body of water in a more natural state.
The contents of the data lake stream in from a source to fill the lake, and various users of the lake
can come to examine, dive in, or take samples." Wiktionary
From an architectural perspective, the concept of a data lake should be based on Hadoop for near real-time and batch data ingestion, historical storage and discovery; Analytics Engine for data mining, business analytics and statistical/predictive models; Stream Engine for real-time data ingestion and in-memory processing; and BIDW. Unlike data warehouses that requires schema on write, in which data is transformed into a specified slow-changing schema when it is loaded into the warehouse, data lakes empowers users to store data in its raw form and then analysts can create the schema to suit the needs of their application at the time they choose to analyze the data, thus empowering schema on read that is quickly adaptable to enhancements and changes to business models.
This overcomes issues around the lack of structure and investing in data processing when there is questionable initial value of the high velocity and varietal of incoming data. From the lake maturity standpoint, more emphasis should be paid in capturing the metadata while designing it as it provides a more robust and varied context at the time of analysis. Metadata and scores derived form a lake can drive numerous deeper analytics associated with an event, including the reasons behind that occurrence and how to better avoid and predict such occurrences in future. By supporting multiple access methods such as batch, near real-time, streaming, in-memory, etc. to a common data set, analysts should take advantage of data lakes to transform and view data in multiple ways and across various schemas to obtain closed-loop analytics by bringing time-to-insight closer to real time than ever before.
By creating a data lake, whether internally or in the cloud, we can make digital science more widely available for business intelligence or data analytics via APIs or data services. In this way, data becomes the building block of new internet of things and changes application development. In practical sense, a data lake design should enable ingestion and collection of massive datasets including dark data and disparate data sources in bulk and real-time that includes raw sources over extended periods of time as well as any processed or high velocity structured and unstructured data, enabling massively parallel SQL analysis and improved machine learning and predictive analytics. It should also empower multiple data access patterns across a shared infrastructure including batch, interactive, online, search, in-memory and other processing engines. Users across multiple business units should be able to access the same data to refine, explore and enrich data on their terms.
The author: Mayank Bhasin, has over fifteen years of experience leading domestic and international large scale initiatives and solutions covering retail, healthcare, utilities and entertainment industries. He is an Enterprise Data Integration Architect by trade and has lead successful technology campaigns including software development and sales, business process improvements, big data analytics, data integration and governance. Ness SES helps organizations compete and grow in today’s digital economy by providing deep expertise in products and platforms, data and analytics, and experience engineering.
June 8 2015