Big Data has been trending up for several years and has gained steam in the last couple. There are many key differences between data lakes and data warehouses to understand before making an informed decision on how to manage your data.
Those of us that are data and analytics practitioners have certainly heard the term, and as we begin to discuss big data solutions with customers, the conversation naturally turns to a discussion of data lakes. However, many customers have heard of the term; many don’t know what it means.
Data Warehouse
A data warehouse is the central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.
This definition describes the purpose of a data warehouse but doesn’t explain how the purpose is achieved.
Data Warehouses have the following properties:
- It represents an abstracted picture of the business organized by subject area
- It is highly transformed and structured
- Data is not loaded to the data warehouse until the use for it has been defined
- It represents an abstracted picture of the business organized by subject area.
- It is highly transformed and structured
- Data is not loaded to the data warehouse until the use for it has been defined
Data Lake
The term “data lake” is generally credited by Pentaho CTO James Dixon. He describes a data mart (a subset of a data warehouse) as akin to a bottle of water. Data flows from the streams to the lake. Users have access to the lake to examine, take samples or dive in.
- All data is loaded from source systems.
- No data is turned away.
- Data is stored at the leaf level in an untransformed or nearly untransformed state
- Data is transformed, and schema is applied to fulfill the needs of analysis
Next, there are key differences between a data lake and how they contrast with the data warehouse approach.
Data Lakes Retain All Data
During the development of a data warehouse, a considerable amount of time is spent analyzing data sources, understanding business processes, and profiling data.
Data Lakes Support All Data Types
Data warehouses generally consist of data extracted from transactional systems and consist of quantitative metrics and the attributes that describe them. Non-traditional data sources such as web server logs, sensor data, social network activity, text, and images are largely ignored.
Data Lakes Support, All Users
In most organizations, 80% or more of users are “operational.” They want to get their reports, see their key performance metrics or slice the same set of data in a spreadsheet every day.