Hive存储和压缩结合

官网:+ORC
ORC存储方式的压缩:

KeyDefaultNotes
orc.compressZLIBhigh level compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size262,144number of bytes in each compression chunk
orc.stripe.size67,108,864number of bytes in each stripe
orc.row.index.stride10,000number of rows between index entries (must be >= 1000)
orc.create.indextruewhether to create row indexes
orc.bloom.filter.columns“”comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp0.05false positive probability for bloom filter (must >0.0 and <1.0)
  • 1)创建一个非压缩的的ORC存储方式
    (1)建表语句
create table log_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="NONE");
(2)插入数据
insert into table log_orc_none select * from log_text1 ;
(3)查看插入后数据
dfs -du -h /user/hive/warehouse/myhive.db/log_orc_none;


7.7 M /user/hive/warehouse/log_orc_none/123456_0

  • 2)创建一个SNAPPY压缩的ORC存储方式
    (1)建表语句
create table log_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");
(2)插入数据
insert into table log_orc_snappy select * from log_text1 ;
(3)查看插入后数据
dfs -du -h /user/hive/warehouse/myhive.db/log_orc_snappy ;


3.8 M /user/hive/warehouse/log_orc_snappy/123456_0

  • 3)上一节中默认创建的ORC存储方式,导入数据后的大小为

    2.8 M /user/hive/warehouse/log_orc/123456_0
    比Snappy压缩的还小。原因是orc存储文件默认采用ZLIB压缩。比snappy压缩的小。

  • 4)存储方式和压缩总结:
    在实际的项目开发当中,hive表的数据存储格式一般选择:orc或parquet。压缩方式一般选择snappy。

Hive存储和压缩结合

官网:+ORC
ORC存储方式的压缩:

KeyDefaultNotes
orc.compressZLIBhigh level compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size262,144number of bytes in each compression chunk
orc.stripe.size67,108,864number of bytes in each stripe
orc.row.index.stride10,000number of rows between index entries (must be >= 1000)
orc.create.indextruewhether to create row indexes
orc.bloom.filter.columns“”comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp0.05false positive probability for bloom filter (must >0.0 and <1.0)
  • 1)创建一个非压缩的的ORC存储方式
    (1)建表语句
create table log_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="NONE");
(2)插入数据
insert into table log_orc_none select * from log_text1 ;
(3)查看插入后数据
dfs -du -h /user/hive/warehouse/myhive.db/log_orc_none;


7.7 M /user/hive/warehouse/log_orc_none/123456_0

  • 2)创建一个SNAPPY压缩的ORC存储方式
    (1)建表语句
create table log_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");
(2)插入数据
insert into table log_orc_snappy select * from log_text1 ;
(3)查看插入后数据
dfs -du -h /user/hive/warehouse/myhive.db/log_orc_snappy ;


3.8 M /user/hive/warehouse/log_orc_snappy/123456_0

  • 3)上一节中默认创建的ORC存储方式,导入数据后的大小为

    2.8 M /user/hive/warehouse/log_orc/123456_0
    比Snappy压缩的还小。原因是orc存储文件默认采用ZLIB压缩。比snappy压缩的小。

  • 4)存储方式和压缩总结:
    在实际的项目开发当中,hive表的数据存储格式一般选择:orc或parquet。压缩方式一般选择snappy。