bigdatacloud-lead
nathan carney
Nathan Carney Data Architect

Our Thinking Thu 21st July, 2022

How Should You Do Big Data on Cloud?

Why is it worth asking this question?

We still work with organisations who are in the process of moving from on-premise data centres to cloud, and have large big data solutions. It can be tempting for them to simply lift and shift the on-prem capabilities to cloud-hosted ones, but this misses out on the benefits that cloud platforms can offer.

First some history

Over 20 years ago Google was formed (at least that according to Wikipedia – Google) and within 6 years they had published their whitepapers on MapReduce and the Google File System which kick started the Big Data revolution. One of the questions that I think is important to consider is why did they do this and what challenges did they have in growing their business?  

From the Google File System paper:

First, component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and is accessed by a comparable number of client machines.  

Second, files are huge by traditional standards. Multi-GB files are common. 

Third, most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially. 

In the MapReduce paper they state:

  • Machines are typically dual-processor x86 processors running Linux, with 2-4 GB of memory per machine. 
  • Commodity networking hardware is used – typically either 100 megabits/second or 1 gigabit/second at the machine level, but averaging considerably less in overall bisection bandwidth. 
  • A cluster consists of hundreds or thousands of machines, and therefore machine failures are common. 
  • Storage is provided by inexpensive IDE disks attached directly to individual machines. A distributed file system [8] developed in-house is used to manage the data stored on these disks. The file system uses replication to provide availability and reliability on top of unreliable hardware.

Google also had a model of recruiting expert developers and large numbers of PhDs and viewed coding as low cost and hardware as high cost.  

So as a rough summary Google was developing software, to process Terabytes of data on 

  • Unreliable hardware
  • With slow networks
  • In an Enterprise world that thought Gigabytes were enormous
  • But they had many excellent developers.  

MapReduce introduced a change in Data philosophy from “move the data to the process” to “move the processing to the data”.  

They simplified many of the problems that made distributed processing on large clusters for huge volumes of data possible.  And the world changed.  

Hadoop

For many organisations, Big Data started by being synonymous with Hadoop, which was an open-source implementation of the ideas in the Google whitepapers, and the Hadoop ecosystem grew around this set of challenges:  

  1. Hardware is unreliable
  2. Networks are slow
  3. Memory is limited
  4. Cores per machine are limited 
  5. Disks per machine are limited

To reiterate, Google talked about machines with 2 cores, 4GB memory, 1Gbit/s networking and 2-4 disks.  

To meet the challenge of 

  • Processing Internet scale volumes of data
  • At an affordable price

A constraint that didn’t exist for Google, especially when indexing Web pages, was the need for high levels of Security and Auditing.  The 3 initial Hadoop vendors (MapR – now part of HP – Cloudera and Hortonworks – now merged into Cloudera) worked hard and added Enterprise grade Security features into their products.  

20 years is a long time in most industries and in computing it’s probably similar to 3+ lifetimes.  

Do the same constraints still exist?

The need for enterprises to process more variety and volume of data at a greater velocity is still accelerating, as more advanced uses of the data are found and more value identified as being extractable from it. So are the constraints still the same?   

  1. Hardware is unreliable – Yes, if you can program around hardware reliability you can utilise lower cost computing platforms that are less expensive to build, run and replace.  
  2. Networks are slow – No, Cloud providers enable 100 Gbits/sec networks (100 times faster than in 2003)
  3. Memory is limited – No, Cloud providers enable instances with 2-4TB of memory (1,000 times larger than in 2003)
  4. Cores per machine are limited – No, AWS offers machines with 196 cores, Azure 120 cores, GCP 224 cores (over 100  times more than in 2003).  All also offer accelerated compute options which include TPU’s, FPGA’s or GPU’s. 
  5. Disks per machine are limited – Not true, SSD, NVRAM, Memory storage has made volumes and performance many times faster than 2003.  More importantly Object stores (AWS S3, Azure Blob) and optimised stores (Azure ADLS Gen2, GCP Cloud Storage) have revolutionised the capabilities to store huge numbers of files and also huge files.  

No, the paradigm has changed, unless you really need internet scale processing.  

Do you really need Internet scale processing?

There are a small number of companies that really do process “Internet” scale volumes of data – mainly the core cloud and social media companies – this then fans out into heavy data processors and down the pyramid to individuals. The great news is that these companies don’t see their platform code as their secret sauce and either open source or spin off companies that open source it – great examples being Hadoop, Kafka, Kubernetes.  

Do Cloud providers do anything to make Big Data easier?

Yes, yes and even more yes. All the Cloud Providers offer a range of integrated and standardised services that enable you to implement supported managed services. All offer a set of defined patterns for creating Big Data services that can meet a range of different use cases.  

AWS:  Big Data Patterns

Azure: Big Data Architecture

GCP:  Cloud Storage as a Data Lake

These services are all integrated into their Identity and Access Management & Security services, are regularly updated and have low administration requirements.  Along with the ability to utilise serverless functions or any of the available computing options, this forms a very cost effective and flexible set of solutions.  There are also a growing number of  SaaS services that enable you to extend the capabilities that are either not offered or don’t meet your current needs.  For example for Data Warehousing – Snowflake, Firebolt; Data Catalogue – Collibra, Alation; Transformation – dbt

So why choose Cloud Native Services over Build it Yourself?

  • Cloud providers have optimised their services based on their knowledge of their Cloud’s capabilities.  This means you can implement scalable, secure, managed services at a known cost that is lower than you could build yourself on their platform. 
  • Running VM’s on Cloud providers is usually the most expensive model from both a Cloud cost and also the Administrative costs.  Even if you get a really good discount on running VM’s at best it will only be as cheap as the Cloud provider can internally charge, you still have to pay for administration.  
  • Security and Access models are tightly integrated with Cloud native services and seamlessly integrate into your Enterprise security solutions, significantly reducing the costs for management and reducing the risk profile.  
  • Engineers are an expensive resource the more time they spend on designing, developing and deploying applications and the less on Administration, the better value they are to you. 

 

If you’d like to know more about data architectures or want to talk to us about our experiences do get in touch: