Google Cloud DataProc Lab to Setup Jobs for Hadoop and Spark

Google Cloud DataProc Lab to

Setup Jobs for Hadoop and Spark

To show you just how easy it is to create a Cloud Dataproc cluster and use it to complete a Spark job.
We'll use one of the example scripts that is bundled with Apache Spark that will estimate the value of Pi.

We take a quadrant of a circle, then generate a number of uniformly distributed random plotted points.
The points can fall anywhere within the square.
So some will fall inside the curve of our circle and some won't.
If we divide the number of points within the circle by the total number of points, we should get a value that is an approximate ratio.
We know the area of the original square, so we multiply it by that ratio to get the value of Pi.

Go to GCP project and the Cloud Console.
Go straight to the menu and locate Dataproc in the big data section.
The first time you load Dataproc, you'll be prompted to enable the Cloud Dataproc API. Just click Enable API.

Click Create Cluster to create our first cluster.
The console will suggest a name for the cluster. You can change this if you like.
Pick a region that's near to you. I'm going to go with europe-west2.
Dataproc will then pick a zone for me.
We'll change the cluster mode to Single Node.
We'll leave the machine type as the default here, which is an n1-standard-4, including the service charge for Dataproc.
This should cost under 20 cents for 1 hour and you should finish this lab easily in that time.
Scroll down and click Create.

Dataproc is now provisioning a compute engine VM for us and installing and configuring Hadoop and Spark. This will take a few minutes.
The cluster is now ready to get to work without us having to do any configuration work at all.
If we select our cluster, we can see monitoring dashboards that will show us amongst other things our YARN and HDFS utilization.

It will take about a minute for the job to complete,
You may also experience an audit GCP Console bug where it looks like the job is still waiting, but if you click the page, it will refresh and you'll see it's actually already finished.
So here's our result, Pi is 3.141, and here's where it goes off track a bit, but we only used 1,000 points for our estimate.
Feel free to rerun this job with more points to see how more accurate the estimation gets.
Now that we've finished, we'll delete our cluster.
Go back to the clusters page, tick the box next to your cluster, click Delete and Confirm.
One useful thing about Dataproc in the Cloud Console here is that even after our clusters have been deleted, the logs of our jobs persist.
We can go back into the job section and revisit the output of the job we just ran, even though the cluster itself has been deleted.

Jobs can also be submitted via the G Cloud command line tool or using the Dataproc API.