The growing interest in big data has been building up steam in last couple of years and has become a hot topic in recent times. In the past, only big enterprises enterprises could afford the technology to use big data. However big data has evolved at an unbelievably fast pace since then making it accessible to enterprises of all scales and today all types of industries can benefit from the advantages of using big data.
However, there are some common challenges faced across businesses while using big data:
- How to store and analyze this large and rapidly growing (structured and unstructured) data
- How to translate it into meaningful information
- How to decide the best way to manage that data
What we essentially need is some kind of a provision to store such large amounts of data at a central location without having to convert it into a predefined schema. In this way, the data is readily available for use as and when it is needed.
Drum roll!!… A data lake can help you do just that!
If you feel the abundance of data (sources) makes things seem complicated and are overwhelmed by the decision to handle this data, well you’ve come to the right place! This post will help you understand the concept of data lakes to make an informed decision on how to manage your data.
Understanding Data Lake
There is a lot of data streaming in from a number of different sources in varying formats. The concept of data lakes was thus created to be able to ingest this diverse data into a central location for further processing. As a result, this would overcome the challenges of storing and analysing this big data that traditional data warehouses with strict data models cannot handle.
In 2010, the Pentaho Corporation’s CTO James Dixon defined a data lake as follows:
“If you think of a datamart as a store of bottled water, cleansed and packaged and structured for easy consumption, the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine it, dive in, or take samples.”
The concept of a data lakes is very simple to understand when compared to real lakes and rivers. Just as in a natural lake, you have multiple tributaries bringing in water from different sources, geographical locations with different composition, a data lake too has streams of data coming in from a number of sources in different formats, at varying speeds and with different structures.
In simple words, a data lake can be termed as a large container that holds massive amounts of data in its native format, which comprises structured (spreadsheets or relational databases), semi-structured (logs and XML data) and unstructured data (social, video, email, text etc.). You may still feel that so many other methods such as data warehouses available to store and manage your data why would you possibly need a data lake. It is noteworthy to state here that it was James Dixon himself who pointed out that the major drawback of these traditional methods was that they store data only to answer questions that have been asked in the past which makes it impossible to answer new questions that may arise in the future.
In contrast to these methods, a data lake is capable of not only answering questions that have been asked in the past but since the data in it is always available it can also answer new questions. In addition to this, one of the major advantages of a data lake is that you can store your data in it without the hassles of having a defined structure or schema until the data is needed (which we will learn next as schema-on-read). The concept of data lakes can easily be summarized by the following points:
- All types of data irrespective of its form can be loaded e.g. structured, semi-structured or unstructured
- Data is organized in schema based on need
How does it work?
In traditional methods of data storage such as the data warehouse, data is loaded into the warehouse after organizing it into a well defined structure or schema, otherwise known as schema-on-write. Any application that needs to retrieve and use this data first needs to know and understand the format in which the data is stored. Also, in this approach, data is not loaded in the warehouse unless and until it is needed.
On the other hand, data is loaded in a data lake and it is stored in its native format. Applications can freely read this data as and when it is needed and add structure to it. This is called schema-on-read. In this approach, all data is stored with the possibility that it may or may not be needed in the future.
Data in a data lake is stored as a binary large object (BLOB) and a unique identifier is assigned to each data. They are also tagged with a number of metadata tags with the help of which they can be accessed and retrieved.
The main advantage of data lakes is that all the data from different content sources can be be stored at one central location. It is extremely flexible and scalable too. Despite these and numerous other benefits to using data lakes, they do come with their own set of risks too. Designing and implementing a data lake can be complicated, however, the biggest risk involved in the implementation of data lakes is security and access control.
How does DNIF implement data lake?
Now that we’ve understood the idea of data lakes let us see how DNIF implements this concept. As we know with data lakes, data from different sources can be stored at a centralized location. The same concept applies here as well, events from various sources are stored in indexes at one central location. Each event stored in this raw format is stored with additional parsed fields to make accessing and retrieving this data easier. Multiple users can query this data simultaneously with the help these parsed fields. They can use the data they require, process it and store the new result without affecting the original raw data in any way.
The biggest advantage of a centralized approach over other pipelined approaches is that it is flexible and easily accessible. Different users can be scattered all around the globe and still have flexible access to the data stored. You can do away with the hassles of managing a number of devices and have direct access to the data.
Conclusion
To summarize, a data lake is a storage architecture for big data collection and processing, allowing collection of all data suitable for analysis today and potentially in the future. It supports storage of data regardless of the structure, data source or format and transforms it only when it is needed.
When designed and built well, a data lake removes data silos and opens up flexible enterprise-level exploration and mining of results.