千亿级数仓项目环境初始化

文章目录

  • 3 项目环境初始化
    • 3.1 Hive分层说明
    • 3.2 创建ods层数据表
    • 3.3 ods层全量数据抽取
    • 3.4 ods层增量数据抽取

3 项目环境初始化

3.1 Hive分层说明

  • 分库存放
    ods层
    dw层
    ads层

  • 命名规则

  • ods层表与原始数据库表名称相同

  • dw层表
    ofact_前缀表示事实表
    odim_前缀表示维度表

创建分层数据库:

#hive>
create database itcast_ods;
create database itcast_dw;
create database itcast_ads;

3.2 创建ods层数据表

  • hive 分为外部表与内部表,为便于管理,该部分均使用内部表(内外部表的区别就在于删除表的时候真正的数据是否会被删除,我们一般是ods层使用外部表,因为这个表是我们所有部门共用的,不能轻易删除数据)
    执行“ods层建表语句业务数据.sql”

3.3 ods层全量数据抽取

步骤:
1、拖拽组件构建Kettle作业结构图

2、转换结构图–》配置命名参数

3、配置Hive SQL脚本

#重新插入添加此语句
#set hive.msck.path.validation=ignore;
msck repair table itcast_ods.itcast_orders;
msck repair table itcast_ods.itcast_goods;
msck repair table itcast_ods.itcast_order_goods;
msck repair table itcast_ods.itcast_shops;
msck repair table itcast_ods.itcast_goods_cats;
msck repair table itcast_ods.itcast_org;
msck repair table itcast_ods.itcast_order_refunds;
msck repair table itcast_ods.itcast_users;
msck repair table itcast_ods.itcast_user_address;
msck repair table itcast_ods.itcast_payments;



4、配置表输入

SELECT
*
FROM itcast_orders
WHERE DATE_FORMAT(createtime, '%Y%m%d') <= '${dt}';



5、配置字段选择指定日期格式,配置parquet格式并设置snappy压缩输出

配置文件位置


配置文件输出内容格式

测试数据是否都正确被加载!

select * from itcast_ods.itcast_orders limit 2;
select * from itcast_ods.itcast_goods limit 2;
select * from itcast_ods.itcast_order_goods limit 2;
select * from itcast_ods.itcast_shops limit 2;
select * from itcast_ods.itcast_goods_cats limit 2;
select * from itcast_ods.itcast_org limit 2;
select * from itcast_ods.itcast_order_refunds limit 2;
select * from itcast_ods.itcast_users limit 2;
select * from itcast_ods.itcast_user_address limit 2;
select * from itcast_ods.itcast_payments limit 2;

注意:

  • 1:其中itcast_orders,itcast_order_goods,itcast_order_refunds表是根据时间抽取,其余表进行全量抽取!!
  • 2:注意使用Hadoop file ouput组件时要注意修改日期格式为UTF8!!,parquet中fields中date类型改为UTF8类型!!

3.4 ods层增量数据抽取

增量抽取与全量抽取类似,只不过每次只抽取前一天的数据

测试SQL语句:

-- 查询订单
select * from itcast_ods.itcast_orders where dt='20190910' limit 2;
select * from itcast_ods.itcast_goods where dt='20190910' limit 2;
select * from itcast_ods.itcast_order_goods where dt='20190910' limit 2;
select * from itcast_ods.itcast_shops where dt='20190910' limit 2;
select * from itcast_ods.itcast_goods_cats where dt='20190910' limit 2;
select * from itcast_ods.itcast_org where dt='20190910' limit 2;
select * from itcast_ods.itcast_order_refunds where dt='20190910' limit 2;
select * from itcast_ods.itcast_users where dt='20190910' limit 2;
select * from itcast_ods.itcast_user_address where dt='20190910' limit 2;
select * from itcast_ods.itcast_payments where dt='20190910' limit 2;