Setup Spark on AWS Ubuntu EC2 Virtual Machine

How to install and setup Spark on Amazon web services (AWS) on Ubuntu OS




  • We are able to connect to AWS via Putty.














Install Components  (Python, Scala, Jupyter , Java) to setup Spark on EC2


  • Install update  on EC2, make sure you update EC2 instance, this will help to install python, pip3 and other things
    • command: sudo apt-get update









  • Now we will install pip3 to install python packages
    • Command: sudo apt install python3-pip
  • It might ask permission to continue, say “Y” for yes









  • Lest install Jupyter
    • Command: pip3 install jupyter










  • Now we will install Java, before setting up spark
  • Java is required for scala and scala is required for Spark
    • Command: sudo apt-get install default-jre








  • Now let’s install Scala
    • Command: sudo apt-get install scala



Let's verify version for 
  • Java:   Command:  java -version
  • Scala: Command:  scala -version
  • Python: Command:  python3 --version




  • To connect Python with java, we need Py4J library, Lets install
    • Command:  pip3 install py4j



  • Now let’s setup Spark
    • You can take any version but i will suggest to download Spark-2.2.2
    • Command: wget http://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz



  • Let’s unzip tar file and install spark
  • Command: sudo tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz


Lets run Linux commands and save the spark folder path
  • Command (List files): ls
  • Command (go to spark folder): cd spark-2.1.1-bin-hadoop2.7/
  • Command (check present working directory): Pwd
  • Output: “/home/ubuntu/spark-2.1.1-bin-hadoop2.7”
  • Command (come back to ubuntu home): cd



  • Install “findSpark” utility, it will help us to connect python with spark
    • Command: pip3 install findspark



  • Create Jupyter configuration
    • Command: jupyter notebook --generate-config




  • Create folder certs and inside that create .pem file.

    • Command:  cd
    • Command: mkdir certs
    • Command: cd certs
    • Command: sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem




  • This will ask certain info, please share (you can just press enter also)




  • Now edit config file

    • Command: cd ~/.jupyter/
    • Command (open config file to edit): vi jupyter_notebook.config.py




  • This will open editor



  • Press key “i”   (to edit file)




  • Enter below content in config file

    • c = get_config()
    • c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem'
    • c.NotebookApp.ip = '*'
    • c.NotebookApp.open_browser = False
    • c.Notebook.ports = 8888





  • Press escape
  • Type: “:wq!” to write-quit and back to console







Open jupyter notebooksession

  • Command:  cd
  • Command: jupyter notebook --ip=*




  • Please copy highlighted url and replace local host with your EC2 instance name.




Copy URL in browser.


  • The URL looks like: http://ec2-18-224-213-152.us-east-2.compute.amazonaws.com:8888/?token=2a18626548b9de30668c11255d92ab3948223486cf626c47



Open new python notebook


  • Run below commands to import spark from installed spark file:




Lets run our first set of commands in PySpark

Write below code 


  • Import findspark
  • findspark.init('/home/ubuntu/spark-2.1.1-bin-hadoop2.7’)
  • import pyspark




Now You are good to run any Spark code, You can use any version of spark to setup





3 comments:
Write comments
  1. I got stuck in the last thing , accessing thru browser

    ReplyDelete
  2. teh ec2 instance ip i replaced was the public one but it didnt work. i have apache installed and teh index page, through the public ip I can see teh hosted page also

    ReplyDelete
  3. These features work in tandem to create an immersive and responsive environment for users. https://petrogalant.com/

    ReplyDelete

Please do not enter spam links

Meet US

Services

More Services