Hive organizes data using Partitions. By use of Partition, data of a table is organized into related parts based on values of partitioned columns such as Country, Department. It becomes easier to query certain portions of data using partition.
Partitions are defined using command PARTITIONED BY at the time of the table creation.
We can create partitions on more than one column of the table. For Example, We can create partitions on Country and State.
Syntax:
CREATE [EXTERNAL] TABLE table_name (col_name_1 data_type_1, ….)
PARTITIONED BY (col_name_n data_type_n , …);
Following are features of Partitioning:
- It’s used for distributing execution load horizontally.
- Query response is faster as query is processed on a small dataset instead of entire dataset.
- If we selected records for US, records would be fetched from directory ‘Country=US’ from all directories.
Limitations:
- Having large number of partitions create number of files/ directories in HDFS, which creates overhead for NameNode as it maintains metadata.
- It may optimize certain queries based on where clause, but may cause slow response for queries based on grouping clause.
It can be used for log analysis, we can segregate the records based on timestamp or date value to see the results day wise / month wise.
Another use case can be, Sales records by Product –type , Country and month.
Interested in a career in Data Analyst?
To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.