Steps to Install Spark on Windows 10 (Applicable to all versions)
Quick Links:
- Spark Installation on Windows
- Spark Installation on AWS EC2 VM
- Spark Architecture
- Spark With Python Lab 1
- Spark Lab 2
Prerequisite:
- At least 4 GB RAM, i5 processer
- GOW: It allows you to use Linux commands on windows (Click here to see install \ Update GOW)
- Java: version 8 is good (Click here to update or install Java)
- Jupyter with Below: Interface to write code
- Python \Scala : Coding Language (Click here to install Python)
Grant Permission’s
Please provide permission ” full control ” to useri (used for installing) on folder “C:\ProgramData”
Download Spark
Once all prerequisites done, you can download spark software from http://spark.apache.org/downloads.html
- Choose a Spark release: Pick latest stable release
- Choose a package type: Pre-built for Hadoop 2.6
- Choose a download type: (click on highlighted link)
- Now spark downloaded.
- Copy the files in folder where you want to setup engine
Install Spark
- Now we have file “spark-2.2.0-bin-hadoop2.7.tgz”
- You can use winzip or below command to extract file into .tar file and then further into folder “spark-2.2.0-bin-hadoop2.7” having spark fies
- gzip -d spark-2.2.0-bin-hadoop2.7.tgz tar xvf spark-2.2.0-bin-hadoop2.7.tar
- You don’t need to execute any file for install , You have to place this file in specific folder
I have created folder “C:\Sparkinstall”and copied files here
Now download windows utility
- Download winutils.exe from https://github.com/steveloughran/winutils/blob/master/hadoop-2.6.0/bin/winutils.exe
- copy to “C:\Sparkinstall\spark-2.2.0-bin-hadoop2.7\bin”
- Or Execute “curl -k -L -o winutils.exehttps://github.com/steveloughran/winutils/blob/master/hadoop-2.6.0/bin/winutils.exe?raw=true” from command prompt
Install Spark (Setup environment variables)
Now please set environment variable, By running below command in command prompt
- setx SPARK_HOME C:\Sparkinstall\spark-2.2.0-bin-hadoop2.7
- setx HADOOP_HOME C:\Sparkinstall\spark-2.2.0-bin-hadoop2.7
- setx PYSPARK_DRIVER_PYTHON ipython
- setx PYSPARK_DRIVER_PYTHON_OPTS notebook
OR You can also do same from GUI (refer screenshot)
- My computer --> right click properties --> Advance settings --> Environment variable’s
- Add this in system variable path --> C:\spark-2.1.0-bin-hadoop2.7\bin
- Now reboot your machine, its just recommendation from me and your spark is installed
- Along spark now you have python also installed
- Let’s check did we really have spark installed or have we missed some step
- Go to command prompt and change directory to spark file location “C:\Sparkinstall\spark-2.2.0-bin-hadoop2.7\bin”
- cd C:\Sparkinstall\spark-2.2.0-bin-hadoop2.7\bin
Now run command “spark-submit --version”
Congratulation you have installed spark engine if you get above screen shot else go thru steps again
Setup PySpark (install)
- The shell for python is known as “PySpark”
- PySpark helps data scientists interface with Resilient Distributed Datasets in apache spark and python.
- Py4J is a popularly library integrated within PySpark that lets python interface dynamically with JVM objects (RDD's).
- Apache Spark comes with an interactive shell for python as it does for Scala.
Install pyspark
- Go to command prompt and run “pip install pyspark”
- Go to command prompt and run “pip install jupyter”
Open Pyspark
- To open console for pyspark
- Open command prompt and change directory to “C:\Sparkinstall\spark-2.2.0-bin-hadoop2.7\bin” using command cd C:\Sparkinstall\spark-2.2.0-bin-hadoop2.7\bin
- Run command “Pyspark”
- In your browser a new tab is open with url http://localhost:8888/tree
- Click new --> python3
- Now you are ready to code your first program
- Type your first command and press ctrl+ enter to see output
Congratulation now you can start working on pySpark
No comments:
Write commentsPlease do not enter spam links