Setup Spark on AWS Ubuntu EC2 Virtual Machine

How to install and setup Spark on Amazon web services (AWS) on Ubuntu OS

  • We are able to connect to AWS via Putty.

Install Components  (Python, Scala, Jupyter , Java) to setup Spark on EC2

  • Install update  on EC2, make sure you update EC2 instance, this will help to install python, pip3 and other things
    • command: sudo apt-get update

  • Now we will install pip3 to install python packages
    • Command: sudo apt install python3-pip
  • It might ask permission to continue, say “Y” for yes

  • Lest install Jupyter
    • Command: pip3 install jupyter

  • Now we will install Java, before setting up spark
  • Java is required for scala and scala is required for Spark
    • Command: sudo apt-get install default-jre

  • Now let’s install Scala
    • Command: sudo apt-get install scala

Let's verify version for 
  • Java:   Command:  java -version
  • Scala: Command:  scala -version
  • Python: Command:  python3 --version

  • To connect Python with java, we need Py4J library, Lets install
    • Command:  pip3 install py4j

  • Now let’s setup Spark
    • You can take any version but i will suggest to download Spark-2.2.2
    • Command: wget

  • Let’s unzip tar file and install spark
  • Command: sudo tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz

Lets run Linux commands and save the spark folder path
  • Command (List files): ls
  • Command (go to spark folder): cd spark-2.1.1-bin-hadoop2.7/
  • Command (check present working directory): Pwd
  • Output: “/home/ubuntu/spark-2.1.1-bin-hadoop2.7”
  • Command (come back to ubuntu home): cd

  • Install “findSpark” utility, it will help us to connect python with spark
    • Command: pip3 install findspark

  • Create Jupyter configuration
    • Command: jupyter notebook --generate-config

  • Create folder certs and inside that create .pem file.

    • Command:  cd
    • Command: mkdir certs
    • Command: cd certs
    • Command: sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

  • This will ask certain info, please share (you can just press enter also)

  • Now edit config file

    • Command: cd ~/.jupyter/
    • Command (open config file to edit): vi

  • This will open editor

  • Press key “i”   (to edit file)

  • Enter below content in config file

    • c = get_config()
    • c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem'
    • c.NotebookApp.ip = '*'
    • c.NotebookApp.open_browser = False
    • c.Notebook.ports = 8888

  • Press escape
  • Type: “:wq!” to write-quit and back to console

Open jupyter notebooksession

  • Command:  cd
  • Command: jupyter notebook --ip=*

  • Please copy highlighted url and replace local host with your EC2 instance name.

Copy URL in browser.

  • The URL looks like:

Open new python notebook

  • Run below commands to import spark from installed spark file:

Lets run our first set of commands in PySpark

Write below code 

  • Import findspark
  • findspark.init('/home/ubuntu/spark-2.1.1-bin-hadoop2.7’)
  • import pyspark

Now You are good to run any Spark code, You can use any version of spark to setup

Write comments
  1. I got stuck in the last thing , accessing thru browser

  2. teh ec2 instance ip i replaced was the public one but it didnt work. i have apache installed and teh index page, through the public ip I can see teh hosted page also


Please do not enter spam links

Meet US


More Services