Opinions and facts about data science and machine learning
This post is part of a Series on Modern Data Warehouse Architectures and was written in collaboration with with Will Fleury.
The first consideration in the proposed dataware house architecture is naturally storage. Which storage solution should we use? If your existing infrastructure is in house, then your choices are restricted and you are probably already using an in-house HDFS cluster such as MapR-FS (see comparison). We would still recommend examining the possibility of using a cloud provider. However, remember that whatever option you choose, the compute and storage must to be on the same network. You do not want to pull data outside of a cloud provider’s network. If you are already a cloud user, then you have the choice of running a HDFS cluster in the cloud vs using that cloud providers storage solution. It is very difficult to recommend to anyone to run their own HDFS clusters in the cloud over choosing that of the cloud provider. There are so many benefits in choosing the cloud provider storage solution, from performance to elasticity, versioning, regional replication, cold storage and the cost of running. Some support 4 9’s availability and 11 9’s durability. The operational costs of running a performant HDFS cluster are extremely high and it is very difficult for most custom HDFS setups to achieve even near similar performance, availability or durability to the cloud providers. Resource isolation and starvation issues become a considerable time sink with custom HDFS clusters.
If you are already using a cloud provider, then you are unlikely to choose another providers storage for similar reasons to in-house solutions not choosing a cloud solution. Compute and storage must be within the same network. Each cloud provider’s storage has its own benefits and shortcomings vs the others and we’re not going to list them off here. Similarly a performance benchmark which is up to date is difficult to find and so we will not link to any. One should always perform their own benchmark for themselves of the given solutions they are investigating. However, from experience we have found that one can concurrently read data from cloud storage to different nodes in the same region, within the provider’s network and max the 1Gbps network card on the nodes. With speeds like this, data access is not going to be the bottleneck for your analytics engine. While the Time To First Byte (TTFB) in most cloud storage solutions will be slower than reading from a local disk or dedicated attached instance storage, that initial latency is negligible when working with data volumes required for analytics. The more important metric is throughput or Time To Last Byte (TTLB). We want solutions that can give us a higher throughput.
Some early cloud storage solutions such as AWS S3 had certain eventual consistency quirks in the early days. However, these are well and truly resolved. All of the main providers guarantee read-after-write consistency for PUT operations on new objects, and some even for PUT operations on existing. However, all providers, including S3 now have a transparent solutions for dealing with any remaining consistency issues. The new S3Guard feature of the S3A filesystem implementation is bundled with the latest versions of Hadoop (see HADOOP-13345) and if working with EMR one can use the EMRFS. This means one can use these providers without any fear of consistency issues.
Almost all big data tools have standardised around the Hadoop file system interface for I/O. This interface abstracts away the fact that one is working against native HDFS, AWS S3, Google Storage, Azure Storage or any other provider. This means that one is not tied into a decision they make now, and if needs or circumstances change, they can easily (in terms of interaction) change the storage solution to any of the supported implementations.
Storage format optimisations were a major factor in making a “queryable” Data Lake feasible. Columnar database solutions such as Vertica started appearing in the early 2000’s to tackle the analytics challenges of Big Data. When columnar data formats such as ORC and Parquet started appearing as standalone projects, it allowed the benefits of columnar storage to be plugged into a variety of existing storage solutions. Writing the data lake directly in these formats started making more sense.
So what does columnar storage provide? At the most fundamental level, columnar storage allows one to reduce the amount of data an engine must read from storage, to only what is necessary to solve a given query. Predicate and projection pushdown are crucial concepts. Unlike typical row based datastores where performing an aggregate query such as that shown below requires reading all of the data (every column) from every single row of the table, a columnar solution can read only the data for the records it requires (predicate pushdown) and only the data for the single column it requires (projection pushdown). If there are only 10 columns of equal datasize in the table, then you immediately have an order of magnitude reduction in the amount of data to read.
SELECT AVG(col_a) FROM table_name WHERE col_a > 100
A regular row based database could potentially add an index to the column so it does not need to read every row, however, if the number of records to be read is over a certain threshold, the index is ignored and a full range scan is performed anyway as it is faster than using the index. The use of an index by its nature means that reading is less sequential and there may be more random IO which is extremely detrimental to analytical queries. Indexes also require additional storage. Columnar storage not only removes the need for the storage used by indexes, it also handles sparse data and allows for optimal encoding and compression of the column values which reduces the storage costs, and the amount of data which must be read for a given query.
Columnar storage provides many other benefits depending on the format. Parquet for instance enables one to work with highly nested structures as if they were automatically denormalized (we discuss this further below). Partition pruning is another concept which can dramatically reduce the amount of data which must be read to satisfy some queries. The storage level should support a partitioning strategy which can then be used by the query engine. Note that this feature is not unique to columnar data formats and most row based databases have supported advanced partitioning and clustering configuration for a long time. A more detailed discussion of the differences between row and column base storage can be found in this paper.
In a classical data warehouse, complex relational structures are either fully denormalized or partially denormalized. The level of denormalization sometimes depends on the level of redundancy and complexity one can manage in a given table. It also depends on the depth of the related entities as in the case of a 1:n relationship, clearly we cannot denormalize past a certain threshold. To query such a denormalized structure can be very complex and lead to the creation of multiple views of the same data, denormalized in different ways to work for particular query patterns.
To take the classic example of representing an order and its order items from a purchase. The logical structure is pretty simple: an order can have one or more order items. In an OLTP system this would be modeled as two different tables: one for order items and one for orders with referential constraints between the two table. In a OLAP system the same object would be modeled using a star schema, by denormalizing the OLTP version into a order items fact table and create dimension for each attribute of the order. This model has various shortcomings: the first one appears when you want to join other objects together. For example, in this case if you are looking at attributing orders to specific events, you will have the classic fan-out problem. Things get even messier if you need to handle updates to orders, because you will need to either do a full delete / insert of the order or update the fields that have changed. The order and order item may also have other child entities such as consumers or segments in the travel industry. The number of tables and the relationships between them can get out of control very quickly. The complexity can become even worse if you are trying to do this with an SQL style ETL.
Managing the denormalization comes at a severe cost. The ETL complexity, the modelling complexity, the maintenance complexity. Anyone who has been on the receiving end of keeping a data warehouse consistent when updates to denormalized entities can occur knows that this is something we should try to avoid at all costs. Very often the understanding of the data model is completely lost and what was received or consumed by the original endpoint is all but indistinguishable when it comes to querying at the analytics end. Attributes assumed unimportant 6 months ago become critical today. This leads to additional complexity and time wasted from both a business analysis and data science perspective. If using a classical OLAP database system such as Redshift, you have no choice but to go down this route if you want performance.
Wouldn’t it be nice to just add the order entity, as-is, with all attributes and structure to the storage, and query it directly as if it had been denormalized? This is exactly what Parquet allows you to achieve and it does so without any performance penalty. In fact, in many if not all use cases, there is an improvement in performance due to the removal of the physical JOINs between the various parent-child entity relationships. With Parquet, the Order and all of its child entities are stored as a logical record in “orders” Parquet files (logical table). Within these files, the data is flattened and stored in columnar format, taking care of nested, repeated elements and allowing optimal storage and query efficiency. We will discuss the Parquet storage format in more detail in a subsequent post in this series, but if you wish to know more know, read or watch one of the presentations available on the parquet website.
Next in the series - Design Considerations II - Interactive & Real-time