Pig Lab - Defining Relation and Loading datset in HDFS

 


Use PIG Language to navigate through HDFS and explore dataset



Step 1: View the Raw Data in my Linux machine


  • Change the directories to  /root/Labs/Lab5.1   
    • Command: "cd /root/Labs/Lab5.1"

Pig ETL, Hadoop, HDfS, Lab, demo, linux, windows, hortonworks, cloudera, algae study, algae services

  • List the files in current folder
    • Command:  "ls'

Pig ETL, Hadoop, HDfS, Lab, demo, linux, windows, hortonworks, cloudera, algae study, algae services



  • Unzip the archive in the Lab5.1 folder, which contains a file named whitehouse_visits.txt.
    • Command: "unzip whitehouse_visits.txt"


Pig ETL, Hadoop, HDfS, Lab, demo, linux, windows, hortonworks, cloudera, algae study, algae services


  • View the contents of this file
    • Command: tail whitehouse_visits.txt

Now you will be able to the content of the file



Step 2: Load the Data into HDFS


  • Start the gruntshell
    • Command: pig


  • Make a new directory name whitehouse in hdfs.
    • Command: grunt> mkdir whitehouse   

Pig ETL, Hadoop, HDfS, Lab, demo, linux, windows, hortonworks, cloudera, algae study, algae services


  • Use copyFromLocal to copy the whitehouse_visits.txt file into the whitehouse folder in hdfs and rename file to visits.txt.
    • Command: grunt> copyfromLocal /root/Labs/Lab5.1/whitehouse_visits.txt  whitehouse/visits.txt


Pig ETL, Hadoop, HDfS, Lab, demo, linux, windows, hortonworks, cloudera, algae study, algae services


  • Verify the file uploaded successfully using ls command.
    • Command: grunt> ls whitehouse 


Pig ETL, Hadoop, HDfS, Lab, demo, linux, windows, hortonworks, cloudera, algae study, algae services



Step 3. Define a Relation

  • You will use TextLoader to load the visits.txt file.
  • TextLoader simply creates a tuple for each line of text. and it uses a single chararray field that contains the entire line.
  • It allows you to load lines of text and not to worry about the format or schema yet.
  • Define the following Load relation:
    • Command: grunt> A1 = LOAD 'user/root/whitehouse/' USING TextLoader();

Pig ETL, Hadoop, HDfS, Lab, demo, linux, windows, hortonworks, cloudera, algae study, algae services






  • Use Describe to notice that A does not have schema
    • Command: grunt > Describe A
Pig ETL, Hadoop, HDfS, Lab, demo, linux, windows, hortonworks, cloudera, algae study, algae services







  • To get sense of what data will look like. Use the Limit Operators to define a new relation named A_Limit that is limited to 10 records of A
    • Command: grunt> A_Limit = LIMIT A 10  


Pig ETL, Hadoop, HDfS, Lab, demo, linux, windows, hortonworks, cloudera, algae study, algae services




Step 4: View the Records


  • Use the DUMP operator to view the A_Limit relation.
    • Command: grunt> DUMP A_Limit;

Pig ETL, Hadoop, HDfS, Lab, demo, linux, windows, hortonworks, cloudera, algae study, algae services


  • Here we will get 10 Arbitary rows from vist.txt..




No comments:
Write comments

Please do not enter spam links

Meet US

Services

More Services