How to install and setup Spark on Amazon web services (AWS) on Ubuntu OS
Quick Links:
- Spark Installation on Windows
- Spark Installation on AWS EC2 VM
- Spark Architecture
- Spark With Python Lab 1
- Spark Lab 2
- We have already setup AWS EC2 (Virtual Machine) and SSH from local machine.
- To setup AWS EC2 (Virtual Machine) (Click here for installation setup)
- We are able to connect to AWS via Putty.
Install Components (Python, Scala, Jupyter , Java) to setup Spark on EC2
- Install update on EC2, make sure you update EC2 instance, this will help to install python, pip3 and other things
- command: sudo apt-get update
- Now we will install pip3 to install python packages
- Command: sudo apt install python3-pip
- It might ask permission to continue, say “Y” for yes
- Lest install Jupyter
- Command: pip3 install jupyter
- Now we will install Java, before setting up spark
- Java is required for scala and scala is required for Spark
- Command: sudo apt-get install default-jre
- Now let’s install Scala
- Command: sudo apt-get install scala
Let's verify version for
- Java: Command: java -version
- Scala: Command: scala -version
- Python: Command: python3 --version
- To connect Python with java, we need Py4J library, Lets install
- Command: pip3 install py4j
- Now let’s setup Spark
- Its url path for spark tar file, to download in our EC2 container. You can download any version by clicking on url "http://archive.apache.org/dist/spark/"
- You can take any version but i will suggest to download Spark-2.2.2
- Command: wget http://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz
- Let’s unzip tar file and install spark
- Command: sudo tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz
Lets run Linux commands and save the spark folder path
- Command (List files): ls
- Command (go to spark folder): cd spark-2.1.1-bin-hadoop2.7/
- Command (check present working directory): Pwd
- Output: “/home/ubuntu/spark-2.1.1-bin-hadoop2.7”
- Command (come back to ubuntu home): cd
- Install “findSpark” utility, it will help us to connect python with spark
- Command: pip3 install findspark
- Create Jupyter configuration
- Command: jupyter notebook --generate-config
- Create folder certs and inside that create .pem file.
- Command: cd
- Command: mkdir certs
- Command: cd certs
- Command: sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
- This will ask certain info, please share (you can just press enter also)
- Now edit config file
- Command: cd ~/.jupyter/
- Command (open config file to edit): vi jupyter_notebook.config.py
- This will open editor
- Press key “i” (to edit file)
- Enter below content in config file
- c = get_config()
- c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem'
- c.NotebookApp.ip = '*'
- c.NotebookApp.open_browser = False
- c.Notebook.ports = 8888
- Press escape
- Type: “:wq!” to write-quit and back to console
Open jupyter notebooksession
- Command: cd
- Command: jupyter notebook --ip=*
- Please copy highlighted url and replace local host with your EC2 instance name.
Copy URL in browser.
- The URL looks like: http://ec2-18-224-213-152.us-east-2.compute.amazonaws.com:8888/?token=2a18626548b9de30668c11255d92ab3948223486cf626c47
Open new python notebook
- Run below commands to import spark from installed spark file:
Lets run our first set of commands in PySpark
Write below code
- Import findspark
- findspark.init('/home/ubuntu/spark-2.1.1-bin-hadoop2.7’)
- import pyspark
Now You are good to run any Spark code, You can use any version of spark to setup
- Spark Installation on Windows
- Spark Installation on AWS EC2 VM
- Spark Architecture
- Spark With Python Lab 1
- Spark Lab 2
I got stuck in the last thing , accessing thru browser
ReplyDeleteteh ec2 instance ip i replaced was the public one but it didnt work. i have apache installed and teh index page, through the public ip I can see teh hosted page also
ReplyDeleteThese features work in tandem to create an immersive and responsive environment for users. https://petrogalant.com/
ReplyDelete