MPP: Massively Parallel Processing
In order to understand popular data warehouses you first need to understand their underlying architecture and the core principles upon which they are built. Massively Parallel Processing (or MPP for short) is this underlying architecture. Here we’ll come to know about what an MPP Database is, how it works, and the strengths and weaknesses of Massively Parallel Processing
Big data defined
What Big Data means?
Before going further, it's helpful to understand what big data means. You cannot simply assign a fixed-size metric to a data set and call it big data. Big data represents datasets that might exhibit properties such as high volume, high velocity, or high variety. These properties mean that you cannot use traditional technologies to process these datasets. As a result, several popular massively parallel processing (MPP) frameworks have emerged, such as Hadoop, to help process big data workloads efficiently.
Benefits of Cloud Platform
GCP provides cloud-native storage and processing services that can help you address all your key big data needs, such as event delivery, storage, parallel processing of streaming and batch data, and analytics. With these services, you can build and seamlessly scale end-to-end big data applications quickly, easily, and more securely. GCP allows you to define your processing logic and can take care of auto-scaling and optimizing resources on your behalf. GCP also provides fast access to popular open source data processing engines, including Apache Spark and Apache Hadoop. You can use this open source software to run your processing directly on the data stored in the GCP storage services.
About MPP Database
Clusters & Nodes
An MPP database is a type of database or data warehouse where the data and processing power are split up among several different nodes (servers), with one leader node and one or many compute nodes. In MPP, the leader (you) would be called the leader node – you’re the telling all the other people what to do and sorting the final tally. The library employees, your helpers, would be called compute nodes – they’re dealing with all the data, running the queries and counting up the words. MPP databases can scale horizontally by adding more compute resources (nodes), rather than having to worry about upgrading to more and more expensive individual servers (scaling vertically). Adding more nodes to a cluster allows the data and processing to be spread across more machines, which means the query will be completed sooner.
Phases of a big data pipeline
The following diagram shows stages that are commonly seen in any big data pipeline.
The first phase of any data lifecycle is to ingest the data from the unprocessed source, such as Internet of Things (IoT) devices, on-premises systems, application logs, or mobile apps. After the data is available in GCP, you choose how to store it appropriately, process and analyze it from raw forms into actionable information, and explore and visualize the data to generate and share insights.
Cloud Platform partner ecosystem
The following diagram maps partner offerings to big data pipeline phases.
The data integration and replication services offered:
- Enable you to perform extract-transform-load (ETL) processing on your data.
- Enable you to connect to a variety of different data sources.
- Help you migrate your data into GCP storage services.
Use Case 1
Assume that you are capturing click-stream events for your ecommerce website. The purpose of the application is to record every click made by the end user and to perform traffic analytics on the data. You want to track web pages that users stay on the most often or the longest, shopping cart abandonment rate, and user navigation flow in real time.
You can use Google Cloud Pub/Sub to collect the large stream of click events coming from the website. You can use Cloud Dataflow to process the data stream coming from Cloud Pub/Sub. You can create a stream and batch subscription to Cloud Pub/Sub to handle real time and batch use cases separately. The batch pipeline ensures that you have raw data stored in Google Cloud Storage as a backup, so you can handle issues related to data recovery, logical corruption, and data reconciliation. The real-time pipeline performs the necessary filtering, enrichment and time- window aggregation on the data, and can store the data in Google BigQuery. You can analyze the data stored in BigQuery by using a comprehensive set of analytics tools provided by the partner ecosystem. These tools allow you to build visualizations of the data about the click-stream events. These visualizations operate on real-time aggregated data, which helps you to derive insights about user behavior soon after it happens.
Use Case 2
Assume that you have an on-premises Hadoop installation that hosts petabytes of advertising data using hundreds of servers. This Hadoop cluster is used for performing churn analysis, understanding factors that contributed to advertising revenue, understanding advertising properties that influenced conversion the most, and so on.
Imagine that this cluster has been growing at over 1 TiB per day, and you constantly have to deal with space issues and have to drop off or archive old data in order to continue taking in new data. You also have to forecast months in advance for your growth needs, because it takes months to get new servers provisioned. You run daily analytical processing at midnight of every day after you receive data for the previous day. The analytics run for less than 8 hours and make reports available for the business the following morning.
The following diagram shows partner solutions that can help you migrate your on-premises Hadoop data to GCP.
In this scenario, the cluster is idle two-thirds of the time. You are still paying for computing resources, even when you are not processing the data. You want to resolve the space issues permanently, avoid having to forecast months in advance by having an auto-scaled system, and optimize costs by not paying for idle computing time.
This solution enables you to transfer large volumes of data to your Hadoop cluster running on GCP continuously as it changes on-premises, with strong consistency, and allows you to eliminate your migration window. Cloud Dataproc is the target for migration in this case.
Cloud Dataproc lets you store and process your data in Google Cloud Storage without having to store it in the local Hadoop Distributed File System (HDFS). Cloud Storage provides highly durable and virtually unlimited storage for your data, so you can immediately solve your space issues by migrating your data. Another advantage is that Cloud Dataproc lets you decouple storage and computing, so you do not have to provision large clusters to ensure you have enough space for the data.
You can shut down the cluster when you are not using it, without losing any data, which can help to optimize costs. This feature, by itself, can reduce costs because you don't pay to run resources full time.
GCP enables you to add more servers to reduce processing time without doing any forecasting. You can address business cases that you couldn't address in your on-premises environment, such as adding more reports and computing more metrics for existing reports, which was otherwise not possible in your current on- premises environment. And you can test and perform software upgrades on the cluster by starting a cluster with the new software version, and use that cluster for processing