Google DataProc Exam-Interview Questions


Which native output connectors are supported by Dataproc?  Choose 3

  • BigQuery
  • Cloud SQL
  • Cloud Firestore
  • Cloud Storage
  • Cloud Bigtable


Cloud Dataproc has built-in integration with BigQuery, Cloud Storage, Cloud Bigtable, Stackdriver Logging, and Stackdriver Monitoring. 


A customer wants to run Spark jobs on a low-cost ephemeral Dataproc cluster, utilizing preemptible workers wherever possible, but needs to store the results of Dataproc jobs persistently. What would you recommend?

  • Use a secondary group of preemptible worker nodes, but ensure there is enough persistent storage on the primary (non-preemptible) worker nodes to store all of the data.
  • Use the Cloud Storage connector, and specify GCS locations for the input and output of jobs.
  • Do not use preemptible workers at all, it will prevent you from choosing any persistent storage option.
  • Use a secondary group of preemptible worker nodes, but add custom code to a job that copies its results to Cloud Storage.


The Cloud Storage connector lets you run Apache Hadoop or Apache Spark jobs directly on data in Cloud Storage and offers a number of other benefits over HDFS.


Which features are not compatible with Dataproc autoscaling? Choose 2

  • MapReduce Tasks
  • High-Availability Clusters
  • Preemptible Workers
  • YARN Node Labels
  • Spark Structured Streaming


  • Autoscaling does not support YARN node labels, nor the property dataproc:am.primary_only. YARN incorrectly reports cluster metrics when node labels are used. Autoscaling clusters
  • Autoscaling is not compatible with Spark Structured Streaming since Spark Structured Streaming currently does not support dynamic allocation. Autoscaling clusters


Your customer would like to use Dataproc, but the standard image does not contain some additional Spark components required to run their jobs on the ephemeral clusters. What would you recommend?

  • Use a Dataproc cluster, but specify an initialization action that installs all of the additional components.
  • Create custom Dataproc image that fulfils the customer requirements and use it to deploy a Dataproc cluster.
  • Split the customer workloads into 2 clusters. Where the extra components are not required, use Dataproc. Where extra components are required, build a custom image and use it to deploy a custom Spark cluster using Compute Engine.
  • Create an image that fulfils the customer requirements and use it to deploy a custom Spark cluster using Compute Engine.


Cloud Dataproc clusters can be provisioned with a custom image that includes a user's pre-installed packages. You could alternatively use initialization actions to install the additional components, but this would be less efficient and incur more running time for ephemeral clusters.


Which primary Apache services does Dataproc run? Choose 2

  • Spark
  • Cassandra
  • Dataflow
  • Hadoop
  • Kafka


Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning.


Which GCP product implements the Apache Beam SDK and is sometimes recommended as an alternative to Dataproc particularly for streaming data?

  • Cloud Dataflow
  • Cloud Data Fusion
  • Cloud Composer
  • Cloud Datalab


The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines. You create your pipelines with an Apache Beam program and then run them on the Dataflow service.


True of False: Preemptible workers in a Dataproc cluster cannot store HDFS data.

  • False
  • True


Since preemptibles can be reclaimed at any time, preemptible workers do not store data.

No comments:
Write comments

Please do not enter spam links

Meet US


More Services