Storage I – The Data Storage Saga

Ten years ago no one cared about data. Now data scientist is the “Sexiest job of the 21st century”. How did we get here? What came before Hadoop, NoSQL and all the other big data buzz words of the last decade?

1st Gen: Hierarchical Databases

qui gon

“The introduction of the term database coincided with the availability of direct-access storage (disks and drums) from the mid-1960s onwards. The term represented a contrast with the tape-based systems of the past, allowing shared interactive use rather than daily batch processing.” [1]

“The distinction of storing data in files and databases is that databases are intended to be used by multiple programs and types of users.” [2]

The hierarchical model is similar to a file system on a PC these days, ie start with a parent/root directory C:/

To locate a file, you first have to locate it within a hierarchy of folders with respect to the parent directory, C:/Documents/Programming/SQLHistory.docx

In this way, each (parent) folder can have many (children) subfolders but each subfolder can only have one parent. The technical term for this is a ‘one-to-many’ relationship.

Unfortunately, many complex phenomena can’t be sorted and ranked into an explicit hierarchy, and instead require ‘one-to-many’ relationships to go on both directions between some entities.

Despite these drawbacks, IBM mainly used hierarchical databases in the 1960s as part of their Information Management System (IMS) [2].

2nd Gen: Network Databases

ewanobistarwars2112a

These promised to overcome the main drawback of hierarchical databases by replacing the hierarchical structure with a graph or network structure; so that any record could have multiple ‘parents’ and multiple ‘children’, ie multiple ‘one-to-many’ relationships pointing to/from each record.

Charles Bachmann developed this approach in the 1960s and first implemented it in an industry database called Integrated Data Store (IDS). A standardised approach for network databases was established in 1971 at the Conference on Data Systems Languages (CODASYL). Network databases have been synonymous with CODASYL ever since [1][2].

Together the 1st and 2nd generations, hierarchical and network databases, became known as navigational databases.

liam-neeson-quiere-volver-a-ser-qui-gon-jinn-en-la-pelicula-de-obi-wan-kenobi_landscape.jpg

“A navigational database is a type of database in which records or objects are found primarily by following references from other objects.” [3]

“The essential characteristic … is finding records by virtue of their relationship to other records” [3]

This is implemented using physical links (pointers) to describe the parent/child relationship of this records to other records in the database. These links point to the memory address of the related records rather than providing some conceptual framework to connect them. For more information on links see [4][5].

Advantage: High speed referencing between records.

Disadvantage: Difficult to manually design and implement. Any changes require complete reorganising and reprogramming all the affected links.

Once set up, they are very fast and efficient, like a Jedi, but they require specialist training and a complicated intricate set up, akin to a long stint of Jedi training and apprenticeship, with extensive periods of personal growth and meditation.

jediTraining

So for small scale business and data sets, early navigational databases were okay. But as the speed and size of data began to grow exponentially, they couldn’t churn out these ‘Jedi’ fast enough and a new system needed to be developed, one that was more scalable.

3rd Gen: Relational Databasesanakinwithtroopersrc0attack-of-the-clones-1050x448Grand_army_formation

What we really need is abstraction, consistency and structure to move away from a hardware-based approach to a more conceptual way of storing data, but one that is highly organised and efficient so there is no duplication of data yet all the relationships between data can be easily and quickly dealt with as needed.

Cue relational databases. They get their name from the ‘relational algebra’ mathematics they are built on but it’s easier to understand relations w.r.t similar lingo for various row/column-based storage types.

tableTuple3

Relations are often referred to as tables and vice versa, although there’s a slight technical difference. Relations/tables are said to model real world entities that have internal attributes and external relationships to other entities. Hence the theory behind relational databases is often called the Entity-Relationship (ER) Model.

There are two key concepts with a relational database:

  1. Abstraction
  2. Structure

Abstraction

There are no physical links (pointers) joining all the related records together in a relational database. Instead each row/record in each table/relation gets a unique identifier called a primary key. Because its unique we don’t have any duplication of data. Whenever we want a relationship between records we share information about primary keys rather than sharing the location of their memory addresses in the navigational database link/pointer method.

Advantage: The database structure isn’t tied down to how its stored on the computer so it can be easily changed/updated without modifying the whole system.

Structure

Relational databases are synonymous with structured data. If you have structured data, its most likely in a relational database, and if you have a relational database you are definitely dealing with structured data.

The goal of having highly organised, structured data is to improve efficiency, prevent duplication of data and avoid data storage ‘anomalies’ (insert, update, delete anomalies).

A process called normalization is used to convert an unstructured dataset into higher and higher degrees of structured data.  The first three degrees of structured data are called 1st, 2nd and 3rd normal form [6][7]. Other forms exist but database design usually stops after 3rd normal form (3NF).

“We modified their genetic structure to make them less independent than the original host.” – Lama Su, Attack of The Clones

Unstructured Data

boba_fett_1

1st Normal Form

maxresdefault

Unique primary key. One value per field (row/column intersection). No repeating groups [6].

 2nd Normal Form

002SWA_Temuera_Morrison_001

No partial dependencies [8].

3rd Normal Form

maxresdefault (1)

No transitive dependencies [9].

Query Languages (SQL)

To make the computational speed of relational databases very fast, complex mathematical ‘relational algebra’ is used to join subsets and process requests. This is similar to lower-level compiled programming languages, like C and Fortran, which are very fast but harder to code in.

To overcome this, computer scientists developed a database analogue of higher-level interpreted programming languages, like JavaScript and Python, to make database programming more intuitive for users. These are called query languages and they provide a middle-man between human programmers and the underlying relational algebra.

For relational databases, the query language used is SQL, ‘Structured Query Language’.

Continuing our Star Wars analogy, if you think of a database as the central complex computer of a spaceship; then a query language is like an astromech droid, R2D2, that talks to the ship directly on your behalf.

r2d2

You give it orders in a semi-human language and it relays them to the computer in a complex computer language (relational algebra) with high speed and efficiency. Otherwise you’d have to try and do it all manually yourself, which could take a long time.

Schemas

DeathStar_plans

Before you begin a military operation, you have to have a plan, which everyone understands and sticks to, regardless of the outcome. In a database setting, such a plan is called a ‘schema’.

Unlike navigational databases, relational databases can add, update and delete individual records with relative easy, however their highly structured design requires them to have a ‘fixed schema’ at the point of inception. This means they can’t easily add, update, or delete the relationships between the different tables/relations. Instead, all relationships have to be carefully planned and implemented during the schema design phase.

Schemas are usually designed visually using an Entity-Relationship Diagram (ERD) that is then converted into code which generates the database. Below is an ERD drawn in LucidChart [10]:

ERDpic

There are a few different notations used to construct ERDs, including Crow’s Foot, Chen, Bachman and Martin notations. For more information see [10][11] [12].

Transactions

Capture

As mentioned above, databases are designed to be use by many users at the same time. Records are constantly being viewed, changed and updated, often in real-time. The database manages this similarly to how a computer manages read/write privileges on a file when multiple programs are trying to access it at once.

Requests by users to access certain records are called transactions. Relational database transactions are required to satisfy 4 key requirements/properties, collectively known as A.C.I.D:

  1. Atomicity – all parts of a transaction must be completely successful or completely aborted.
  2. Consistency – a consistent state in the database satisfies all the defined constraints, cascades and triggers. When completed, consistent transactions put the database back in a consistent state.
  3. Isolation – transaction behaves as if in isolation, ie unaffected by any concurrent transactions. Data used by this transaction can’t be used by another transaction until this one is finished.
  4. Durability – changes made are properly saved so that they apply even after a system restart or system failure.

Data Warehouses

laat

There are two main types of database used in industry:

  1. OLTP (On-Line Transaction Processing) databases
  2. OLAP (On-Line Analytical Processing) databases

OLTPtable

OLTPs are also called operational databases because they focus on real-time business data ie operational data. For any real data mining and analytics, however, OLAPs are need instead.

OLAPs are also called data warehouses. . “The term ‘data warehouse’ was first coined by Bill Inmon in 1990. According to Inmon, a data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data.” [13]

For an excellent overview of data warehouses, see [13] and [14].

Data warehouses are a separate layer to a business’ operational (OLTP) database which ‘integrate’ data from a variety of different sources, including the business operational database(s), other relational databases, flat files etc. All the relevant historic data is organised in one place, making it easily accessible for complex analytics.

The key point here is that data warehouses take data from a range of sources then ‘ingest and digest’ it all into highly organised, “Static structures and categories that dictate the kind of analysis that is possible on that data, at the very point of entry” [15].

Hence, they have the same basic approach as relational databases, namely to process and transform incoming data into a highly organised form that is set by a schema. While relational databases usually use 3rd normal form (3NF) schemas, data warehouses often use a more flexible less structured ‘star schema’, or more structured ‘snowflake schema’, instead. [16]

In practice it is often hard to integrate all the various databases in an organisation, as they are often incompatible with each other for one reason or another. These disconnected databases are called ‘data silos’.

Data Silos

hanSilo4

“Silos occur whenever a data system is incompatible or not integrated with other data systems. This incompatibility may occur in the technical architecture, in the application architecture, or in the data architecture of any data system” [17].

“They can arise for multiple reasons.

  • Structural – Software applications are written at one point in time, for a particular group in the company… [and] are optimized for their main function [rather than for sharing]
  • Political – Knowledge is power, and groups within an organization become suspicious of others wanting to use their data
  • Growth – Any long-lived company has grown through multiple generations of leaders, philosophies, and acquisitions, resulting in multiple incompatible systems.
  • Vendor lock-in. Software vendors are among the first to know that access to data is power, and their strategies can frustrate the desire of users to export the data contain in applications. This is particularly dangerous with software-as-a-service applications, where the vendor wants to keep you within their cloud platform” [18].

Reconciling data from various silos is a major problem for data warehousing and business analytics, especially in recent years where the volume, velocity and variety of data needing to be stored and analysed on a regular basis has risen exponentially. In contrast, data warehouses are slow and expensive to create and update. Inventiv Health project manager Harry Fox says [15]:

  • “It is estimated it takes upwards of 8 weeks to add a column to a table
  • The average cost of an integration project ranges between $250K and $1M”

Fortunately new technologies have recently been developed to help manage the inflow of big data and transform it into a wealth of possibilities instead of a management crisis. Currently the two main business alternatives to data warehouses are data lakes and data hubs.

Data Lakes

swamp08

“If you move all your data from disparate silos into one system (Hadoop/HDFS) it is now a Data Lake. The data need not be harmonized, indexed, searchable or even easily usable.” [19]

Data Harmonization refers to fixing the naming, structural and/or semantic differences between data sources that previously made them incompatible.

Data Indexing ie reindex all the data from the silos after its all been combined into a central data warehouse/lake/hub. This allows fast lookup and analytics.

The focus of data lakes is to analysis all the data in its ‘native format’ or ‘natural state’ [20].

It has very little data harmonization or (re-)indexing, and so while it overcomes data silo separation issues by allowing everyone in an organisation to access all the data, it often requires people, “Highly skilled at data manipulation and analysis,” [20] to pull useful insights from the mess.

Edd Wilder-James’ 2016 Harvard Business Review article [18] put it bluntly,

“Don’t be dazzled by the draw of another favourite industry buzzword, the ‘data lake.’ Things aren’t as beautifully simple as the image of clear water and mountain springs might conjure. We can’t just pour all our data into one system, expecting goodness to result.”

Not surprisingly, mismanaged data lakes are often called ‘data swamps’, and can be more trouble than they’re worth [21] [22].

Luckily there’s also a halfway point behind highly structured expensive data warehouses and flexible but potentially chaotic data lakes. They’re called data hubs.

Data Hubs

StarWarsEpisodeI-OtohGunga

Data hubs are similar to data lakes in that they are a central storage for combining all the incompatible data from data silos, but unlike most data lakes, data hubs focus on data harmonization (reconciling incompatible data) and (re-)indexing the data once its in a more harmonized form.

The key focus here is on processing the data to make it suitable for data mining and analytics, similar to data warehouses but in a less structured, prescribed, rigid format. For a great overview of data lakes, data hubs, harmonization and indexing see [19]:

markLogicTable

What Next?

The key distinction between data warehouses, data lakes and data hubs is in how organised and structured the data is after it had been processed and stored in the new system.

This raises a big problem for relational databases in the age of Big Data – highly structured data and SQL are useful and efficient when the data coming into an organisation can be properly processed and structured in a reasonable amount of time, or when each division and department is allowed to have its own unique system that doesn’t have to be compatible with every other system.

But like a clone army, the added structure and efficiency take time to properly design and develop, and its rigid standards become increasingly difficult and complex to maintain as the size and scope of the problem increases exponentially.

A clone army can defend or conquer a planet or two, but it can’t defend or conquer an entire galaxy. To do that we need a different approach:

  1. Lower recruitment standards (NoSQL, Rebel Alliance)
  2. Distributed hive mind (Hadoop, Droid Army)

Until then,

Data science needs deep thought.

References

[1] https://en.wikipedia.org/wiki/Database#1960s,_navigational_DBMS

[2] https://mhaadi.wordpress.com/2010/10/18/the-evolution-of-database/

[3] https://en.wikipedia.org/wiki/Navigational_database

[4] https://msdn.microsoft.com/en-us/library/aa365006(v=VS.85).aspx

[5] https://www.howtogeek.com/howto/16226/complete-guide-to-symbolic-links-symlinks-on-windows-or-linux/

[6] http://www.businessintelligence-solutions.com/differences-1st-2nd-3rd-normal-form/

[7] https://www.studytonight.com/dbms/database-normalization.php

[8] https://www.1keydata.com/database-normalization/second-normal-form-2nf.php

[9] https://www.1keydata.com/database-normalization/third-normal-form-3nf.php

[10] https://www.lucidchart.com/pages/er-diagrams

[11] https://www.smartdraw.com/entity-relationship-diagram/

[12] https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model#Cardinalities

[13] https://www.tutorialspoint.com/dwh/dwh_overview.htm

[14] https://www.tutorialspoint.com/dwh/dwh_olap.htm

[15] https://www.linkedin.com/pulse/big-data-hub-vs-warehouse-harry-fox

[16] https://blog.cloveretl.com/data-warehouses-lakes-hubs-and-vaults-explained

[17] https://en.wikipedia.org/wiki/Information_silo

[18] https://hbr.org/2016/12/breaking-down-data-silos

[19] https://www.marklogic.com/blog/data-lakes-data-hubs-federation-one-best/

[20] https://blogs.dxc.technology/2017/07/24/data-lakes-hubs-and-warehouses-when-to-use-what/

[21] https://www.collibra.com/blog/blogdata-lake-vs-data-swamp-pushing-the-analogy/

[22] https://www.cio.com.au/article/620407/3-keys-keep-your-data-lake-from-becoming-data-swamp/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s