How Should You Do Big Data on Cloud?
Why is it worth asking this question?
We still work with organisations who are in the process of moving from on-premise data centres to cloud, and have large big data solutions. It can be tempting for them to simply lift and shift the on-prem capabilities to cloud-hosted ones, but this misses out on the benefits that cloud platforms can offer.
First some history
Over 20 years ago Google was formed (at least that according to Wikipedia – Google) and within 6 years they had published their whitepapers on MapReduce and the Google File System which kick started the Big Data revolution. One of the questions that I think is important to consider is why did they do this and what challenges did they have in growing their business?
From the Google File System paper:
First, component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and is accessed by a comparable number of client machines.
Second, files are huge by traditional standards. Multi-GB files are common.
Third, most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially.
In the MapReduce paper they state:
- Machines are typically dual-processor x86 processors running Linux, with 2-4 GB of memory per machine.
- Commodity networking hardware is used – typically either 100 megabits/second or 1 gigabit/second at the machine level, but averaging considerably less in overall bisection bandwidth.
- A cluster consists of hundreds or thousands of machines, and therefore machine failures are common.
- Storage is provided by inexpensive IDE disks attached directly to individual machines. A distributed file system [8] developed in-house is used to manage the data stored on these disks. The file system uses replication to provide availability and reliability on top of unreliable hardware.
Google also had a model of recruiting expert developers and large numbers of PhDs and viewed coding as low cost and hardware as high cost.
So as a rough summary Google was developing software, to process Terabytes of data on
- Unreliable hardware
- With slow networks
- In an Enterprise world that thought Gigabytes were enormous
- But they had many excellent developers.
MapReduce introduced a change in Data philosophy from “move the data to the process” to “move the processing to the data”.
They simplified many of the problems that made distributed processing on large clusters for huge volumes of data possible. And the world changed.
Hadoop
For many organisations, Big Data started by being synonymous with Hadoop, which was an open-source implementation of the ideas in the Google whitepapers, and the Hadoop ecosystem grew around this set of challenges:
- Hardware is unreliable
- Networks are slow
- Memory is limited
- Cores per machine are limited
- Disks per machine are limited
To reiterate, Google talked about machines with 2 cores, 4GB memory, 1Gbit/s networking and 2-4 disks.
To meet the challenge of
- Processing Internet scale volumes of data
- At an affordable price
A constraint that didn’t exist for Google, especially when indexing Web pages, was the need for high levels of Security and Auditing. The 3 initial Hadoop vendors (MapR – now part of HP – Cloudera and Hortonworks – now merged into Cloudera) worked hard and added Enterprise grade Security features into their products.
20 years is a long time in most industries and in computing it’s probably similar to 3+ lifetimes.
Do the same constraints still exist?
The need for enterprises to process more variety and volume of data at a greater velocity is still accelerating, as more advanced uses of the data are found and more value identified as being extractable from it. So are the constraints still the same?
- Hardware is unreliable – Yes, if you can program around hardware reliability you can utilise lower cost computing platforms that are less expensive to build, run and replace.
- Networks are slow – No, Cloud providers enable 100 Gbits/sec networks (100 times faster than in 2003)
- Memory is limited – No, Cloud providers enable instances with 2-4TB of memory (1,000 times larger than in 2003)
- Cores per machine are limited – No, AWS offers machines with 196 cores, Azure 120 cores, GCP 224 cores (over 100 times more than in 2003). All also offer accelerated compute options which include TPU’s, FPGA’s or GPU’s.
- Disks per machine are limited – Not true, SSD, NVRAM, Memory storage has made volumes and performance many times faster than 2003. More importantly Object stores (AWS S3, Azure Blob) and optimised stores (Azure ADLS Gen2, GCP Cloud Storage) have revolutionised the capabilities to store huge numbers of files and also huge files.
No, the paradigm has changed, unless you really need internet scale processing.
Do you really need Internet scale processing?
There are a small number of companies that really do process “Internet” scale volumes of data – mainly the core cloud and social media companies – this then fans out into heavy data processors and down the pyramid to individuals. The great news is that these companies don’t see their platform code as their secret sauce and either open source or spin off companies that open source it – great examples being Hadoop, Kafka, Kubernetes.
Do Cloud providers do anything to make Big Data easier?
Yes, yes and even more yes. All the Cloud Providers offer a range of integrated and standardised services that enable you to implement supported managed services. All offer a set of defined patterns for creating Big Data services that can meet a range of different use cases.
AWS: Big Data Patterns
Azure: Big Data Architecture
GCP: Cloud Storage as a Data Lake
These services are all integrated into their Identity and Access Management & Security services, are regularly updated and have low administration requirements. Along with the ability to utilise serverless functions or any of the available computing options, this forms a very cost effective and flexible set of solutions. There are also a growing number of SaaS services that enable you to extend the capabilities that are either not offered or don’t meet your current needs. For example for Data Warehousing – Snowflake, Firebolt; Data Catalogue – Collibra, Alation; Transformation – dbt
So why choose Cloud Native Services over Build it Yourself?
- Cloud providers have optimised their services based on their knowledge of their Cloud’s capabilities. This means you can implement scalable, secure, managed services at a known cost that is lower than you could build yourself on their platform.
- Running VM’s on Cloud providers is usually the most expensive model from both a Cloud cost and also the Administrative costs. Even if you get a really good discount on running VM’s at best it will only be as cheap as the Cloud provider can internally charge, you still have to pay for administration.
- Security and Access models are tightly integrated with Cloud native services and seamlessly integrate into your Enterprise security solutions, significantly reducing the costs for management and reducing the risk profile.
- Engineers are an expensive resource the more time they spend on designing, developing and deploying applications and the less on Administration, the better value they are to you.
If you’d like to know more about data architectures or want to talk to us about our experiences do get in touch:
The inconvenient truth is that most big data projects fail to deliver the expected return on investment. In fact, Gartner predicts that only 15% of data projects utilising AI in 2021-2022 will be successful.
Companies are spending more than ever on data and analytics projects, often using cutting-edge AI and machine learning tech – but many of them don’t generate the ROI that the business expected. In fact, a recent ESI ThoughtLab study of 1,200 organisations found that companies are generating an average ROI of just 1.3% from AI data projects, while 40% don’t generate a profit at all.
There are lots of reasons why this happens. Sometimes, the expectations of data projects are too high. But more often, companies embark on data projects without a clear strategy and without appropriate skills and resource to replicate the benefits of a pilot project at scale. AI projects require time, expertise and scale to deliver a decent ROI.
This might come as a surprise to some early project teams. Building a proof-of-concept AI data project can be relatively easy – if you have a team of skilled data scientists, a small project could be ready to test in a few months. The challenge comes when organisations try to scale up those prototypes to work in an enterprise setting.
If your data scientists don’t have the appropriate software development skills, then you could end up with a machine learning model that works in principle but isn’t fully integrated into workflows and enterprise operations – meaning it’s not collecting, sharing or analysing the intended data.
Enterprises need to ensure that they have the skills needed to make machine learning models work within their business. This might mean creating an app or integrating machine models with existing sales platforms.
When a global online home retailer developed a machine learning model to improve the efficiency of logistics, they soon realised that this was only the first step. Data scientists had created a model that was able to predict which warehouse and logistics carrier would be the most efficient for individual projects based on the product size and likelihood of sale in a particular region.
Our development team was able to help take the project to the next step, by creating ways to integrate this model into existing systems and automate the data collection process. The result is a system that can advise the business which proportion of a product to store in a particular warehouse, and which carrier to use to cut 5% from shipping costs, for example.
To increase your chances of creating positive ROI from data-enabled AI projects, organisations need to ensure they have the right skills in project teams – in addition to data scientists, you will need engineers, process owners and strong DevOps.
Second, ensure that you are measuring ROI over an appropriate timescale. The upfront costs involved in scaling data projects can result in flat ROI in the short-term. Data preparation, technology costs and people development are substantial expenses, and it takes an average of 17 months to show ROI, with firms surveyed by ESI showing a return of 4.3% at this stage.
Third, are you measuring the right things to accurately measure ROI? Capturing the cost savings from automated processes and data availability only tells half the story. By incorporating machine learning into the transformation of enterprise supply chains, logistics and product development, companies can drive increased revenue, market share, reduced time-to-market and higher shareholder value.
To find out more about how you can realise higher ROI from data investment, download our free Playbook here.