Install Spark on Windows 10

Steps to Install Spark on Windows 10 (Applicable to all versions)

Quick Links:





Prerequisite:


Lets verify Versions

  • Java: java -version
  • Python: Python --version






Grant Permission’s


Change Hidden folder settings in windows



Please provide permission ” full control ” to useri (used for installing) on folder “C:\ProgramData”


Download Spark


Once all prerequisites done, you can download spark software from http://spark.apache.org/downloads.html

  • Choose a Spark release: Pick latest stable release
  • Choose a package type:  Pre-built for Hadoop 2.6
  • Choose a download type: (click on highlighted link)








  • Now spark downloaded. 
  • Copy the files in folder where you want to setup engine



Install Spark



  • Now we have file “spark-2.2.0-bin-hadoop2.7.tgz”
  • You can use winzip or below command to extract file into .tar file  and then further into folder “spark-2.2.0-bin-hadoop2.7” having spark fies
  • gzip -d spark-2.2.0-bin-hadoop2.7.tgz tar xvf spark-2.2.0-bin-hadoop2.7.tar
  • You don’t need to execute any file for install , You have to place this file in specific folder



I have created folder “C:\Sparkinstall”and copied files here




Now download windows utility

    • Or Execute  “curl -k -L -o winutils.exehttps://github.com/steveloughran/winutils/blob/master/hadoop-2.6.0/bin/winutils.exe?raw=true” from command prompt







Install Spark (Setup environment variables)

Now please set environment variable, By running below command in command prompt
  • setx SPARK_HOME C:\Sparkinstall\spark-2.2.0-bin-hadoop2.7
  • setx HADOOP_HOME C:\Sparkinstall\spark-2.2.0-bin-hadoop2.7
  • setx PYSPARK_DRIVER_PYTHON ipython
  • setx PYSPARK_DRIVER_PYTHON_OPTS notebook


OR You can also do same from GUI (refer screenshot)

  • My computer --> right click properties --> Advance settings --> Environment variable’s
  • Add this in system variable path --> C:\spark-2.1.0-bin-hadoop2.7\bin












  • Now reboot your machine, its just recommendation from me and your spark is installed
  • Along spark now you have python also installed
  • Let’s check did we really have spark installed or have we missed some step
  • Go to command prompt and change directory to spark file location “C:\Sparkinstall\spark-2.2.0-bin-hadoop2.7\bin”   
    • cd C:\Sparkinstall\spark-2.2.0-bin-hadoop2.7\bin


Now run command   “spark-submit --version”




Congratulation you have installed spark engine if you get above screen shot else go thru steps again


Setup PySpark (install)


  • The shell for python is known as “PySpark”
  • PySpark helps data scientists interface with Resilient Distributed Datasets in apache spark and python.
  • Py4J is a popularly library integrated within PySpark that lets python interface dynamically with JVM objects (RDD's). 
  • Apache Spark comes with an interactive shell for python as it does for Scala.




Install pyspark


  • Go to command prompt and run “pip install pyspark”







  • Go to command prompt and run “pip install jupyter”




Open Pyspark



  • To open console for pyspark
  • Open command prompt and change directory to “C:\Sparkinstall\spark-2.2.0-bin-hadoop2.7\bin”  using command cd C:\Sparkinstall\spark-2.2.0-bin-hadoop2.7\bin
  • Run command “Pyspark





  • In your browser a new tab is open with url http://localhost:8888/tree
  • Click new --> python3







  • Now you are ready to code your first program





  • Type your first command and press ctrl+ enter to see output








Congratulation now you can start working on pySpark









No comments:
Write comments

Please do not enter spam links

Meet US

Services

More Services