Big Data in the cloud

GC Big Data Platform

All this services together are called Integrated serverless platform. Serverless means you don’t have to worry about providing Compute Instances to run your jobs. The services are fully managed. You pay only for the resources you consume.

The platform is integrated so that GCP work together to create custom solutions.

Cloud Dataproc

It is a managed Apache Hadoop, which is an open source framework for big data. It is great for when you have a data set of known size or when you want to manage your cluster size yourself.

It is based on the MapReduce programming model (invented by Google). This means that one function, the map function, runs in parallel with a massive dataset to produce intermediate results. Also, the reduce function, builds a final result set based on all those intermediate results.

Hadoop is often used to encompass Apache Hadoop, and related projects such as Apache Spark, Apache Pig and Apache hive. Cloud Dataproc is an easy way to run all of them on GCP.

All needed to start it is to request a Hadoop cluster, which will be built on top of Compute Engine VMs whose number and type you control. Is it possible to scale it up and down. You can use a default or custom configuration and monitor it using Stackdriver.

You only pay for resources used during the life of the cluster. It is billed by the second.

Spark & Spark SQL

Once your data is in a cluster, you may use Spark and Spark SQL to do data mining. You can also use MLib which is Apache Spark’s machine learning libraries to discover patterns through machine learning.

Cloud Dataflow

It’s a general purpose ETL tool (extract/transform/load) and a data analysis engine. It’s both a unified programming model and a managed service. It is great for when your data shows in real time, or it’s of unpredictable size or rate.

It lets you develop and execute a big range of data processing patterns: extract, transform, load batch computation and continuous computation. You use it to build data pipelines and the same pipelines work for both batch and streaming data. There’s no need to spin up a cluster or to size instances.

It fully automates the management of required resources. It frees you from operational tasks like resources management and performance optimization.

A pipeline would be as follows: It gets the data from a BigQuery table, transforms it with maps or reduce operations and saves the results into Cloud Storage. Each step in the pipeline would be automatically scaled. There is no need to launch and manage a cluster. It has automated and optimized worked partitioning built in, which can dynamically rebalance lagging work. This reduces the worry about hotkeys. This is situations, where disproportionately large chunks of your input get mapped to your cluster.

BigQuery

It’s a fully managed analytics data warehouse. It’s for when your data, instead of in a dynamic pipeline, needs to run more in the way of exploring a vast sea of data. You want to do ad-hoc SQL queries on a massive set of data. It uses SQL.

You can load load data from Cloud Storage or Cloud Data Store, or stream it into BigQuery at up to 100,000 rows per seconds. Once loaded it can run SQL queries against multiple terabytes of data in seconds.

In addition to SQL queries, you can read and write data via Cloud Dataflow, Hadoop and Spark.
There’s a free monthly quota.

BigQuery lets you specify the location at which the data is stored. It separates storage and computation for billing. If you share data sets with other people, they pay for their own queries. There’s a discount for data stored for +90 days.

Cloud Pub/Sub (publishers/subscribers)

It’s a messaging service to work with events in real time. It’s meant to serve as a simple, realiable, scalable foundation for stream analytics. You can use it to let independent applications you build send and receive messages. This way they’re decoupled, so they scale independently.

Applications can publish messages in Pub/Sub and one or more subscribers receive them. It doesn’t have to be synchronous. You can configure subscribers to receive messages on a push or pull basis.

Some messages may be delivered more than once by error.

It’s an important building block for applications where data arrives at high unpredictable rates, like the IoT. If you’re analyzing streaming data, Cloud Dataflow goes great with Cloud Pub/Sub.

Cloud Datalab

Interactive tool for large-scale data exploration, transformation, analysis and visualization. Built on Project Jupyter (IPython).

This lets you create and maintain web-based notebooks containing Python code, and you can run that code interactively and view the result.

Cloud Databalab takes out the management part out of this technique. It runs on Compute Engine VMs. To get started you set the VM type you want and the region it should run in. When it launches, it presents an interactive Python environment that’s ready to use. It’s integrated with BigQuery, Compute Engine and Cloud Storage. When it’s up and running you can visualize your data with Google Charts. There’re many published packages for statistics, machine learning etc.