Setup Spark on AWS Ubuntu EC2 Virtual Machine

How to install and setup Spark on Amazon web services (AWS) on Ubuntu OS

  • We are able to connect to AWS via Putty.

Install Components  (Python, Scala, Jupyter , Java) to setup Spark on EC2

  • Install update  on EC2, make sure you update EC2 instance, this will help to install python, pip3 and other things
    • command: sudo apt-get update

  • Now we will install pip3 to install python packages
    • Command: sudo apt install python3-pip
  • It might ask permission to continue, say “Y” for yes

  • Lest install Jupyter
    • Command: pip3 install jupyter

  • Now we will install Java, before setting up spark
  • Java is required for scala and scala is required for Spark
    • Command: sudo apt-get install default-jre

  • Now let’s install Scala
    • Command: sudo apt-get install scala

Let's verify version for 
  • Java:   Command:  java -version
  • Scala: Command:  scala -version
  • Python: Command:  python3 --version

  • To connect Python with java, we need Py4J library, Lets install
    • Command:  pip3 install py4j

  • Now let’s setup Spark
    • You can take any version but i will suggest to download Spark-2.2.2
    • Command: wget

  • Let’s unzip tar file and install spark
  • Command: sudo tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz

Lets run Linux commands and save the spark folder path
  • Command (List files): ls
  • Command (go to spark folder): cd spark-2.1.1-bin-hadoop2.7/
  • Command (check present working directory): Pwd
  • Output: “/home/ubuntu/spark-2.1.1-bin-hadoop2.7”
  • Command (come back to ubuntu home): cd

  • Install “findSpark” utility, it will help us to connect python with spark
    • Command: pip3 install findspark

  • Create Jupyter configuration
    • Command: jupyter notebook --generate-config

  • Create folder certs and inside that create .pem file.

    • Command:  cd
    • Command: mkdir certs
    • Command: cd certs
    • Command: sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

  • This will ask certain info, please share (you can just press enter also)

  • Now edit config file

    • Command: cd ~/.jupyter/
    • Command (open config file to edit): vi

  • This will open editor

  • Press key “i”   (to edit file)

  • Enter below content in config file

    • c = get_config()
    • c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem'
    • c.NotebookApp.ip = '*'
    • c.NotebookApp.open_browser = False
    • c.Notebook.ports = 8888

  • Press escape
  • Type: “:wq!” to write-quit and back to console

Open jupyter notebooksession

  • Command:  cd
  • Command: jupyter notebook --ip=*

  • Please copy highlighted url and replace local host with your EC2 instance name.

Copy URL in browser.

  • The URL looks like:

Open new python notebook

  • Run below commands to import spark from installed spark file:

Lets run our first set of commands in PySpark

Write below code 

  • Import findspark
  • findspark.init('/home/ubuntu/spark-2.1.1-bin-hadoop2.7’)
  • import pyspark

Now You are good to run any Spark code, You can use any version of spark to setup

Write comments
  1. I got stuck in the last thing , accessing thru browser

  2. teh ec2 instance ip i replaced was the public one but it didnt work. i have apache installed and teh index page, through the public ip I can see teh hosted page also

  3. These features work in tandem to create an immersive and responsive environment for users.


Please do not enter spam links

Meet US


More Services