Implement Bucketing in Hive
What is Bucketing
- The bucketing in Hive is a data organizing technique.
 
- It's also an IO performance tuning technique
 
- It is similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as buckets.
 
- We can use bucketing in Hive when the implementation of partitioning becomes difficult.
 
- we can also divide partitions further into buckets.
 
Bucketing Logic:
- The concept of bucketing is based on the hashing technique.
 
- Here, modules of the current column value and the number of required buckets are calculated (let say, F(x) % 3).
 
- Now, based on the resulted value, the data is stored in the corresponding bucket
 
Implement Bucketing
In the Hadoop sandbox terminal:
- Copy the files in HDFS
 
- Create database and change scope to the created database
 - create database bucket;
 - use bucket;
 
- Enable hive. enforce.bucketing
 - set hive.enforce.bucketing = true;
 
- Create hive table emp and load 'databucket' file
 
- Create table emp and load file 'databucket'
 - create table emp(id int, name string, age int, sal int) row format delimited fields terminated by ',' ;
 - load data inpath 'databucket.csv' into table emp;
 
- Create table emp_bucket with bucketing on the id column
 - create table emp_bucket(id int, name string, age int, sal int) clustered by (id) into 4 buckets row format delimited fields terminated by ',' ;
 
- Load emp_bucket table from emp table
 - insert into table emp_bucket select id, name, age, sal from emp ;
 
Conclusion: We can see 4 files created for my bucketed table








No comments:
Write commentsPlease do not enter spam links