大家好,又见面了,我是你们的朋友全栈君。
目录
- 一、数仓分层
- 二、数仓理论
- 三、数仓环境搭建
- 四、数仓搭建-ODS层
-
- 4.1 ODS层(用户行为数据)
- 4.2 ODS层(业务数据)
-
- 4.2.1 活动信息表
- 4.2.2 活动规则表
- 4.2.3 一级品类表
- 4.2.4 二级品类表
- 4.2.5 三级品类表
- 4.2.6 编码字典表
- 4.2.7 省份表
- 4.2.8 地区表
- 4.2.9 品牌表
- 4.2.10 购物车表
- 4.2.11 评论表
- 4.2.12 优惠券信息表
- 4.2.13 优惠券领用表
- 4.2.14 收藏表
- 4.2.15 订单明细表
- 4.2.16 订单明细活动关联表
- 4.2.17 订单明细优惠券关联表
- 4.2.18 订单表
- 4.2.19 退单表
- 4.2.20 订单状态日志表
- 4.2.21 支付表
- 4.2.22 退款表
- 4.2.23 商品平台属性表
- 4.2.24 商品(SKU)表
- 4.2.25 商品销售属性表
- 4.2.26 商品(SPU)表
- 4.2.27 用户表
- 4.2.28 ODS层业务表首日数据装载脚本
- 4.2.29 ODS层业务表每日数据装载脚本
- 五、数仓搭建-DIM层
- 六、数仓搭建-DWD层
—————————————————–分隔符—————————————————–
数据仓库之电商数仓– 1、用户行为数据采集==>
数据仓库之电商数仓– 2、业务数据采集平台==>
数据仓库之电商数仓– 3.1、电商数据仓库系统(DIM层、ODS层、DWD层)==>
数据仓库之电商数仓– 3.2、电商数据仓库系统(DWS层)==>
数据仓库之电商数仓– 3.3、电商数据仓库系统(DWT层)==>
数据仓库之电商数仓– 3.4、电商数据仓库系统(ADS层)==>
数据仓库之电商数仓– 4、可视化报表Superset==>
数据仓库之电商数仓– 5、即席查询Kylin==>
一、数仓分层
1.1 为什么要分层
数仓分层
总结⭐️:
⭐️数据仓库分层:
ODS(Operation Data Store)层: 原始数据层,存放原始数据
。直接加载原始日志、数据,数据保持原貌不做处理;
DWD(Data Warehouse Detail)层: 对ODS层数据进行清洗
(去除空值、脏数据、超过极限范围的数据)、脱敏等,保存业务事实明细
,一行信息代表一次业务行为,例如一次下单行为;
DIM(Dimension)层: 维度层,保存维度数据,主要是对业务事实的描述信息
,如何人何处何地等;
DWS(Data Warehouse Service)层: 以DWD层为基础,按天进行轻度汇总
。一行信息代表一个主题对象一天的汇总行为,如一个用户一天下单次数;
DWT(Data Warehouse Topic)层: 以DWS层为基础,对数据进行累积汇总
。一行信息代表一个主题对象的累积行为,例如一个用户从注册开始至今下了多少单;
ADS(Application Data Store)层: 为各种统计报表提供数据
。
⭐️数据仓库为什么要分层?
- 把复杂问题简单化:
将复杂的任务分解
成多层来完成✅,每一层只处理简单的任务,方便定位问题
; - 减少重复开发:
规范数据分层,通过中间层数据
,能够减少极大地重复计算,增加一次计算结果的复用性
; - 隔离原始数据:
不论是数据的异常还是数据的敏感性,使真实数据与统计数据解耦
开。
1.2 数据集市与数据仓库概念
⭐️数据集市与数据仓库区别:
1.3 数仓命名规范
1.3.1 表命名
ODS层命名为ods_表名
DIM层命名为dim_表名
DWD层命名为dwd_表名
DWS层命名为dws_表名
DWT层命名为dwt_表名
ADS层命名为ads_表名
临时表命名为tmp_表名
1.3.2 脚本命名
数据源_to_目标_db/log.sh;
用户行为脚本以log为后缀;业务数据脚本以db为后缀。
1.3.3 表字段类型
数量类型为bigint;
金额类型为decimal(16, 2); 表示:16位有效数字,其中小数部分2位;
字符串(名字,描述信息等)类型为string;
主键外键类型为string;
时间戳类型为bigint;
二、数仓理论
2.1 范式理论
2.1.1 范式概念
-
定义
数据建模必须遵循一定的规则,在关系建模中,这种规则就是范式。 -
目的
采用范式,可以降低数据的冗余性。
为什么要降低数据冗余性?
1). 十几年前,磁盘很贵,为了减少磁盘存储;
2). 以前没有分布式系统,都是单机,只能增加磁盘,磁盘个数也是有限的;
3). 一次修改,需要修改多个表,很难保证数据一致性
。
-
缺点
范式的缺点是获取数据时,需要通过Join拼接出最后的数据。 -
分类
目前业界范式有:第一范式(1NF)、第二范式(2NF)、第三范式(3NF)
、巴斯-科德范式(BCNF)、第四范式(4NF)、第五范式(5NF)。
2.1.2 函数依赖
函数依赖:完全函数依赖、部分函数依赖、传递函数依赖。
2.1.3 三范式区分
-
第一范式1NF核心原则:属性不可分割;
-
第二范式2NF核心原则: 不能存在“
部分函数依赖
”,即不能存在非主键字段部分函数依赖于主键函数的现象;
-
第三范式3NF核心原则:
不能存在传递函数依赖
,即不能存在非主键字段传递函数依赖于主键字段的现象。
2.2 关系建模与维度建模
关系建模和维度建模是两种数据仓库的建模技术。关系建模由Bill Inmon所倡导,维度建模由Ralph Kimball所倡导。
2.2.1 关系建模
关系建模将复杂的数据抽象为两个概念——实体和关系,并使用规范化的方式表示出来。关系模型如图所示,从图中可以看出,较为松散、零碎,物理表数量多。
图为关系模型示意图:
关系模型严格遵循第三范式(3NF),数据冗余程度低,数据的一致性容易得到保证
。由于数据分布于众多的表中,查询会相对复杂,在大数据的场景下,查询效率相对较低
。
2.2.2 维度建模⭐️
维度模型如图所示,从图中可以看出,模型相对清晰、简洁。
图为维度模型示意图:
维度模型以数据分析作为出发点,不遵循三范式,故数据存在一定的冗余
。维度模型面向业务
,将业务用事实表
和维度表
呈现出来。表结构简单,故查询简单,查询效率较高
。
join少、shuffle少。
2.3 维度表和事实表⭐️
2.3.1 维度表
维度表:一般是对事实的描述信息
。每一张维表对应现实世界中的一个对象或者概念。
?用户、商品、日期、地区等。
维表的特征:
- 维表的范围很宽(具有多个属性、列比较多);
- 跟事实表相比,行数相对较小:通常< 10万条;
- 内容相对固定:编码表;
?:
时间维度表:
2.3.2 事实表
事实表中的每行数据代表一个业务事件
(下单、支付、退款、评价等
);
“事实”这个术语表示的是业务事件的度量值
(可统计次数、个数、金额等
)。
?2020年5月21日,小明花50块买了颗?;
维度表:时间、用户、商品、商家。事实表:50块钱、?。
每一个事实表的行包括:具有可加性的数值型的度量值、与维表相连接的外键,通常具有两个和两个以上的外键。
事实表的特征:
- 非常大;
- 内容相对的窄:列数较少(主要是外键id和度量值);
- 经常发生变化,每天会新增加很多。
1)事务型事实表
以每个事务或事件为单位
,例如一个销售订单记录,一笔支付记录等,作为事实表里的一行数据。一旦事务被提交,事实表数据被插入,数据就不再进行更改
,其更新方式为增量更新
。
2)周期型快照事实表
周期型快照事实表中不会保留所有数据,只保留固定时间间隔的数据,使用全量同步策略
,例如每天或者每月的销售额,或每月的账户余额等。
?购物车,有加减商品,随时都有可能变化,但是我们更关心每天结束时这里面有多少商品,方便我们后期统计分析。
3)累积型快照事实表
累积快照事实表用于跟踪业务事实的变化,使用新增和变化同步策略
。
?数据仓库中可能需要累积或者存储订单从下订单开始,到订单商品被打包、运输、和签收的各个业务阶段的时间点数据来跟踪订单声明周期的进展情况。当这个业务过程进行时,事实表的记录也要不断更新。
总结:
事务型事实表: 适用于不会发生变化的业务,通常使用增量同步;
周期型事实表:适用于不关心明细操作、只关心结果的业务,通常使用全量同步策略;
累积型事实表:适用于会发生周期性变化的业务,通常使用新增和变化同步策略。
2.4 维度模型分类
在维度建模的基础上又分为三种模型:星型模型、雪花模型、星座模型。
2.5 数据仓库建模⭐️?
2.5.1 ODS层
-
HDFS用户行为数据: 通过kafka-flume-kafka数据采集通道传输到hdfs的log日志;
-
HDFS业务数据:通sqoop从mysql同步到hdfs的数据文件;
-
针对HDFS上的用户行为数据和业务数据,我们如何规划处理?
1). 保持数据原貌不做任何修改,起到备份数据的作用;
2). 数据采用压缩,减少磁盘存储空间(例如:原始数据100G,可以压缩到10G左右);
3). 创建分区表,防止后续的全表扫描。
2.5.2 DIM层和DWD层
DIM层DWD层需构建维度模型,一般采用星型模型
,呈现的状态一般为星座模型。
维度建模一般按照以下四个步骤:
选择业务过程→声明粒度→确认维度→确认事实
-
选择业务过程
在业务系统中,挑选我们感兴趣的业务线,比如下单业务,支付业务,退款业务,物流业务,一条业务线对应一张事实表。 -
声明粒度
数据粒度指数据仓库的数据中保存数据的细化程度或综合程度的级别。
声明粒度意味着精确定义事实表中的一行数据表示什么,应该尽可能选择最小粒度
,以此来应各种各样的需求。典型的粒度声明如下:
订单事实表中一行数据表示的是一个订单中的一个商品项;
支付事实表中一行数据表示的是一个支付记录。 -
确定维度
维度的主要作用是描述业务是事实,主要表示的是“谁,何处,何时
”等信息。
确定维度的原则是:后续需求中是否要分析相关维度的指标。
?需要统计,什么时间下的订单多,哪个地区下的订单多,哪个用户下的订单多。需要确定的维度就包括:时间维度、地区维度、用户维度。
- 确定事实
此处的“事实”一词,指的是业务中的度量值(次数、个数、件数、金额,可以进行累加)
,例如订单金额、下单次数等。
在DWD层,以业务过程
为建模驱动,基于每个具体业务过程的特点,构建最细粒度
的明细层事实表。事实表可做适当的宽表化处理。
事实表和维度表的关联比较灵活,但是为了应对更复杂的业务需求,可以将能关联上的表尽量关联上。
2.5.3 DWS层与DWT层
DWS层和DWT层统称宽表层
,这两层的设计思想大致相同,通过以下案例进行阐述。
- 问题引出:两个需求,统计每个省份订单的个数、统计每个省份订单的总金额
- 处理办法:都是将省份表和订单表进行join,group by省份,然后计算。同样数据被计算了两次,实际上类似的场景还会更多。
那怎么设计能避免重复计算呢?
针对上述场景,可以设计一张地区宽表,其主键为地区ID,字段包含为:下单次数、下单金额、支付次数、支付金额等。上述所有指标都统一进行计算,并将结果保存在该宽表中,这样就能有效避免数据的重复计算。
- 总结:
1). 需要建哪些宽表:以维度为基准
。
2). 宽表里面的字段:是站在不同维度的角度去看事实表,重点关注事实表聚合后的度量值。
3). DWS和DWT层的区别:DWS层存放的所有主题对象当天的汇总行为
;
?每个地区当天的下单次数,下单金额等;
DWT层存放的是所有主题对象的累积行为
;
?每个地区最近7天(15天、30天、60天)的下单次数、下单金额等。
2.5.4 ADS层
对电商系统各大主题指标分别进行分析。
三、数仓环境搭建
3.1 Hive环境搭建
3.1.1 Hive引擎简介
Hive引擎包括:默认MR、tez、spark;
Hive on Spark: Hive既作为存储元数据又负责SQL的解析优化,语法是HQL语法
,执行引擎变成了Spark,Spark负责采用RDD执行。
Spark on Hive : Hive只作为存储元数据,Spark负责SQL解析优化,语法是Spark SQL语法
,Spark负责采用RDD执行。
对比:
Hive on Spark:周边生态更完整;
Spark on Hive:计算性能高。
3.1.2 Hive on Spark配置
1. 兼容性说明:
注⚠️: 官网下载的Hive3.1.2和Spark3.0.0默认是不兼容的。因为Hive3.1.2支持的Spark版本是2.4.5,所以需要我们重新编译Hive3.1.2版本。
编译步骤: 官网下载Hive3.1.2源码,修改pom文件中引用的Spark版本为3.0.0,如果编译通过,直接打包获取jar包。如果报错,就根据提示,修改相关方法,直到不报错,打包获取jar包。
2. 在Hive所在节点部署Spark:
- Spark官网下载jar包地址:
http://spark.apache.org/downloads.html - 上传
spark-3.0.0-bin-hadoop3.2.tgz
至/opt/software/spark
目录下:
[xiaobai@hadoop102 spark]$ ll
total 372316
-rw-r--r--. 1 root root 224453229 Oct 4 17:02 spark-3.0.0-bin-hadoop3.2.tgz
-rw-r--r--. 1 root root 156791324 Oct 4 17:01 spark-3.0.0-bin-without-hadoop.tgz
- 将重新编译后带有依赖的spark-3.0.0-bin-hadoop3.2.tgz解压至/opt/module/目录下:
[xiaobai@hadoop102 spark]$ tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz -C /opt/module/
- 将解压后的spark-3.0.0-bin-hadoop3.2改成
spark
:
[xiaobai@hadoop102 module]$ mv spark-3.0.0-bin-hadoop3.2/ spark
- 配置SPARK_HOME环境变量
/etc/profile.d/my_env.sh
:
[xiaobai@hadoop102 module]$ sudo vim /etc/profile.d/my_env.sh
增加以下内容:
# SPARK_HOME
export SPARK_HOME=/opt/module/spark
export PATH=$PATH:$SPARK_HOME/bin
- source环境变量,使其生效:
[xiaobai@hadoop102 module]$ source /etc/profile.d/my_env.sh
3. 在hive中创建spark配置文件:
- 在hive中创建spark配置文件
spark-defaults.conf
:
[xiaobai@hadoop102 software]$ vim /opt/module/hive/conf/spark-defaults.conf
- 添加如下内容(在执行任务时,会根据如下参数执行):
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hadoop102:8020/spark-history
spark.executor.memory 1g
spark.driver.memory 1g
- 在HDFS创建
spark-history
路径,用于存储历史日志:
[xiaobai@hadoop102 software]$ hadoop fs -mkdir /spark-history
4. 向HDFS上传Spark纯净版jar包:
- 在hdfs创建
spark-jars
路径:
[xiaobai@hadoop102 spark]$ hadoop fs -mkdir /spark-jars
- 解压
spark-3.0.0-bin-without-hadoop.tgz
至/opt/software/spark
路径:
[xiaobai@hadoop102 spark]$ tar -zxvf spark-3.0.0-bin-without-hadoop.tgz
- 将解压后的Spark纯净版jar包
spark-3.0.0-bin-without-hadoop
上传到hdfs/spark-jars路径下:
[xiaobai@hadoop102 spark]$ hadoop fs -put spark-3.0.0-bin-without-hadoop/jars/* /spark-jars
注⚠️:
- 由于Spark3.0.0非纯净版默认支持的是hive2.3.7版本,直接使用会和安装的Hive3.1.2出现兼容性问题。所以采用Spark纯净版jar包,不包含hadoop和hive相关依赖,避免冲突。
- Hive任务最终由Spark来执行,Spark任务资源分配由Yarn来调度,该任务有可能被分配到集群的任何一个节点。所以需要将Spark的依赖上传到HDFS集群路径,这样集群中任何一个节点都能获取到。
5. 修改hive-site.xml文件:
- 在/opt/module/hive/conf路径下修改
hive-site.xml
文件:
[xiaobai@hadoop102 jars]$ vim /opt/module/hive/conf/hive-site.xml
- 增加以下内容:
<!--Spark依赖位置(注意:端口号8020必须和namenode的端口号一致)-->
<property>
<name>spark.yarn.jars</name>
<value>hdfs://hadoop102:8020/spark-jars/*</value>
</property>
<!--Hive执行引擎-->
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
3.1.3 Hive on Spark测试
- 启动hive客户端:
[xiaobai@hadoop102 jars]$ hive
- 创建一张测试表
student
:
hive (default)> create table student(id int,name string);
OK
Time taken: 2.96 seconds
- 通过insert测试效果:
hive (default)> insert into table student values(1001,'Tom');
若出现以下结果,则说明配置成功!
3.2 Yarn配置
3.2.1 增加ApplicationMaster资源比例
容量调度器对每个资源队列中同时运行的Application Master占用的资源进行了限制,该限制通过yarn.scheduler.capacity.maximum-am-resource-percent
参数实现,其默认值是0.1
,表示每个资源队列上Application Master最多可使用的资源为该队列总资源的10%
,目的是防止大部分资源都被Application Master占用,而导致Map/Reduce Task无法执行。
生产环境该参数可使用默认值。
因本项目使用的是Linux虚拟机,集群资源很少,为防止同一时刻只能运行一个Job的情况出现,将默认值调大为0.8;
- 在hadoop102的/opt/module/hadoop-3.2.2/etc/hadoop/
capacity-scheduler.xml
文件中修改如下参数值:
[xiaobai@hadoop102 hadoop]$ vim capacity-scheduler.xml
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>0.8</value>
<description>
2.分发capacity-scheduler.xml配置文件:
[xiaobai@hadoop102 hadoop]$ xsync capacity-scheduler.xml
- 正在运行的任务,在
hadoop103
关闭重新启动yarn集群:
[xiaobai@hadoop103 ~]$ stop-yarn.sh
Stopping nodemanagers
Stopping resourcemanager
[xiaobai@hadoop103 ~]$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers
可在http://hadoop103:8088/cluster/scheduler–>Application Queues–>>Queue:default下查看默认值和修改后的值:
修改前:
修改后:
3.3 数仓开发环境
数仓开发工具可选用DBeaver或DataGrip,
官方链接:
https://www.jetbrains.com/datagrip/
https://dbeaver.io/download/
以下为Mac版本下载⬇️
两者都需要用到JDBC协议连接到Hive,故需要启动HiveServer2!
- 启动
HiveServer2
[xiaobai@hadoop102 hive]$ hiveserver2
- 配置DataGrip连接
1). 创建连接
2). 配置连接属性:可点击Test Connection进行连接测试,随机点击OK⬇️
- 测试使用
创建数据库gmall,并观察是否创建成功。
1). 创建数据库
2). 查看数据库
3). 修改连接,指明连接数据库
如图,需在右上角选择gmall数据库,为防止遗忘,可修改properties为gmall
点击properties按钮,修改连接:
将Schema设置为我们需要的数据库gmall:
4). 选择当前数据库为gmall;
3.4 数据准备
假定数仓上线的日期为2020-06-14.
1. 用户行为日志
用户行为日志,一般是没有历史数据的,故日志只需要准备2020-06-14一天的数据。
1). 启动日志采集通道,包括Flume、Kafak等;
2). 修改两个日志服务器(hadoop102、hadoop103)中的/opt/module/applog/application.yml
配置文件,将mock.date
参数改为2020-06-14;
[xiaobai@hadoop102 applog]$ vim application.yml
3). 执行日志生成脚本lg.sh
;
[xiaobai@hadoop102 applog]$ lg.sh
4). 查看HDFS是否出现相应文件⬇️
2. 业务数据
业务数据一般存在历史数据,此处需准备2020-06-10至2020-06-14的数据。具体操作如下。
1). 修改hadoop102节点上的/opt/module/db_log/application.properties
文件,将mock.date
、mock.clear
,mock.clear.user
三个参数:
[xiaobai@hadoop102 db_log]$ vim application.properties
tips:2020-06-10为第一天数据,所以重置需设为1!
2). 执行模拟生成业务数据的命令,生成第一天2020-06-10的历史数据:
[xiaobai@hadoop102 db_log]$ java -jar gmall2020-mock-db-2021-01-22.jar
3). 设置第二天2020-06-11的数据,修改参数为:
注⚠️:重置参数只有第一天需要设置为1!
4). 执行模拟生成业务数据的命令,生成第二天2020-06-11的历史数据:
[xiaobai@hadoop102 db_log]$ java -jar gmall2020-mock-db-2021-01-22.jar
以此类推,设置2020-06-10到2020-06-14多天数据并生成!
5). 在/home/xiaobai/bin目录下执行mysql_to_hdfs_init.sh脚本,将模拟生成的业务数据同步到HDFS。
[xiaobai@hadoop102 bin]$ ./mysql_to_hdfs_init.sh all 2020-06-14
四、数仓搭建-ODS层
- 保持数据原貌不做任何修改,起到备份数据的作用;
- 数据采用LZO压缩,减少磁盘存储空间。100G数据可以压缩到10G以内;
- 创建分区表,防止后续的全表扫描,在企业开发中大量使用分区表;
- 创建外部表,在企业开发中,除了自己用的临时表,创建内部表外,绝大多数场景都是创建外部表。
4.1 ODS层(用户行为数据)
4.1.1 创建日志表ods_log
- 创建支持lzo压缩的分区表
1). 建表语句
create database gmall;
--ODS层
--ods日志表
drop table if exists ods_log;
CREATE EXTERNAL TABLE ods_log (`line` string)
PARTITIONED BY (`dt` string) -- 按照时间创建分区
STORED AS -- 指定存储方式,读数据采用LzoTextInputFormat;
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_log' -- 指定数据在hdfs上的存储位置
;
2). 分区规划
2. 加载数据
4.1.2 ODS层日志表加载数据脚本
- 在/home/xiaobai/bin创建一个
hdfs_to_ods_log.sh
文件:
[xiaobai@hadoop102 bin]$ vim hdfs_to_ods_log.sh
在文件中添加如下内容:
#!/bin/bash
# 定义变量方便修改
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
echo ================== 日志日期为 $do_date ==================
sql=" load data inpath '/origin_data/$APP/log/topic_log/$do_date' into table ${APP}.ods_log partition(dt='$do_date'); "
hive -e "$sql"
hadoop jar /opt/module/hadoop-3.2.2/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /warehouse/$APP/ods/ods_log/dt=$do_date
tips:
[ -n 变量值 ] 判断变量的值,是否为空;
– 变量的值,非空,返回true;
– 变量的值,为空,返回false;
注意:[ -n 变量值 ]不会解析数据,使用[ -n 变量值 ]时,需要对变量加上双引号(” “);
查看date命令的使用,date –help.
- 增加脚本执行权限:
[xiaobai@hadoop102 bin]$ chmod +x hdfs_to_ods_log.sh
- 脚本使用:
执行脚本:
[xiaobai@hadoop102 bin]$ ./hdfs_to_ods_log.sh 2020-06-14
在dataGrip中查看导入数据
4.2 ODS层(业务数据)
ODS层业务表分区规划如下:
ODS层业务表数据装载思路如下:
4.2.1 活动信息表
DROP TABLE IF EXISTS ods_activity_info;
CREATE EXTERNAL TABLE ods_activity_info(
`id` STRING COMMENT '编号',
`activity_name` STRING COMMENT '活动名称',
`activity_type` STRING COMMENT '活动类型',
`start_time` STRING COMMENT '开始时间',
`end_time` STRING COMMENT '结束时间',
`create_time` STRING COMMENT '创建时间'
) COMMENT '活动信息表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_activity_info/';
4.2.2 活动规则表
DROP TABLE IF EXISTS ods_activity_rule;
CREATE EXTERNAL TABLE ods_activity_rule(
`id` STRING COMMENT '编号',
`activity_id` STRING COMMENT '活动ID',
`activity_type` STRING COMMENT '活动类型',
`condition_amount` DECIMAL(16,2) COMMENT '满减金额',
`condition_num` BIGINT COMMENT '满减件数',
`benefit_amount` DECIMAL(16,2) COMMENT '优惠金额',
`benefit_discount` DECIMAL(16,2) COMMENT '优惠折扣',
`benefit_level` STRING COMMENT '优惠级别'
) COMMENT '活动规则表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_activity_rule/';
4.2.3 一级品类表
DROP TABLE IF EXISTS ods_base_category1;
CREATE EXTERNAL TABLE ods_base_category1(
`id` STRING COMMENT 'id',
`name` STRING COMMENT '名称'
) COMMENT '商品一级分类表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_base_category1/';
4.2.4 二级品类表
DROP TABLE IF EXISTS ods_base_category2;
CREATE EXTERNAL TABLE ods_base_category2(
`id` STRING COMMENT ' id',
`name` STRING COMMENT '名称',
`category1_id` STRING COMMENT '一级品类id'
) COMMENT '商品二级分类表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_base_category2/';
4.2.5 三级品类表
DROP TABLE IF EXISTS ods_base_category3;
CREATE EXTERNAL TABLE ods_base_category3(
`id` STRING COMMENT ' id',
`name` STRING COMMENT '名称',
`category2_id` STRING COMMENT '二级品类id'
) COMMENT '商品三级分类表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_base_category3/';
4.2.6 编码字典表
DROP TABLE IF EXISTS ods_base_dic;
CREATE EXTERNAL TABLE ods_base_dic(
`dic_code` STRING COMMENT '编号',
`dic_name` STRING COMMENT '编码名称',
`parent_code` STRING COMMENT '父编码',
`create_time` STRING COMMENT '创建日期',
`operate_time` STRING COMMENT '操作日期'
) COMMENT '编码字典表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_base_dic/';
4.2.7 省份表
DROP TABLE IF EXISTS ods_base_province;
CREATE EXTERNAL TABLE ods_base_province (
`id` STRING COMMENT '编号',
`name` STRING COMMENT '省份名称',
`region_id` STRING COMMENT '地区ID',
`area_code` STRING COMMENT '地区编码',
`iso_code` STRING COMMENT 'ISO-3166编码,供可视化使用',
`iso_3166_2` STRING COMMENT 'IOS-3166-2编码,供可视化使用'
) COMMENT '省份表'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_base_province/';
4.2.8 地区表
DROP TABLE IF EXISTS ods_base_region;
CREATE EXTERNAL TABLE ods_base_region (
`id` STRING COMMENT '编号',
`region_name` STRING COMMENT '地区名称'
) COMMENT '地区表'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_base_region/';
4.2.9 品牌表
DROP TABLE IF EXISTS ods_base_trademark;
CREATE EXTERNAL TABLE ods_base_trademark (
`id` STRING COMMENT '编号',
`tm_name` STRING COMMENT '品牌名称'
) COMMENT '品牌表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_base_trademark/';
4.2.10 购物车表
DROP TABLE IF EXISTS ods_cart_info;
CREATE EXTERNAL TABLE ods_cart_info(
`id` STRING COMMENT '编号',
`user_id` STRING COMMENT '用户id',
`sku_id` STRING COMMENT 'skuid',
`cart_price` DECIMAL(16,2) COMMENT '放入购物车时价格',
`sku_num` BIGINT COMMENT '数量',
`sku_name` STRING COMMENT 'sku名称 (冗余)',
`create_time` STRING COMMENT '创建时间',
`operate_time` STRING COMMENT '修改时间',
`is_ordered` STRING COMMENT '是否已经下单',
`order_time` STRING COMMENT '下单时间',
`source_type` STRING COMMENT '来源类型',
`source_id` STRING COMMENT '来源编号'
) COMMENT '加购表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_cart_info/';
4.2.11 评论表
DROP TABLE IF EXISTS ods_comment_info;
CREATE EXTERNAL TABLE ods_comment_info(
`id` STRING COMMENT '编号',
`user_id` STRING COMMENT '用户ID',
`sku_id` STRING COMMENT '商品sku',
`spu_id` STRING COMMENT '商品spu',
`order_id` STRING COMMENT '订单ID',
`appraise` STRING COMMENT '评价',
`create_time` STRING COMMENT '评价时间'
) COMMENT '商品评论表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_comment_info/';
4.2.12 优惠券信息表
DROP TABLE IF EXISTS ods_coupon_info;
CREATE EXTERNAL TABLE ods_coupon_info(
`id` STRING COMMENT '购物券编号',
`coupon_name` STRING COMMENT '购物券名称',
`coupon_type` STRING COMMENT '购物券类型 1 现金券 2 折扣券 3 满减券 4 满件打折券',
`condition_amount` DECIMAL(16,2) COMMENT '满额数',
`condition_num` BIGINT COMMENT '满件数',
`activity_id` STRING COMMENT '活动编号',
`benefit_amount` DECIMAL(16,2) COMMENT '减金额',
`benefit_discount` DECIMAL(16,2) COMMENT '折扣',
`create_time` STRING COMMENT '创建时间',
`range_type` STRING COMMENT '范围类型 1、商品 2、品类 3、品牌',
`limit_num` BIGINT COMMENT '最多领用次数',
`taken_count` BIGINT COMMENT '已领用次数',
`start_time` STRING COMMENT '开始领取时间',
`end_time` STRING COMMENT '结束领取时间',
`operate_time` STRING COMMENT '修改时间',
`expire_time` STRING COMMENT '过期时间'
) COMMENT '优惠券表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_coupon_info/';
4.2.13 优惠券领用表
DROP TABLE IF EXISTS ods_coupon_use;
CREATE EXTERNAL TABLE ods_coupon_use(
`id` STRING COMMENT '编号',
`coupon_id` STRING COMMENT '优惠券ID',
`user_id` STRING COMMENT 'skuid',
`order_id` STRING COMMENT 'spuid',
`coupon_status` STRING COMMENT '优惠券状态',
`get_time` STRING COMMENT '领取时间',
`using_time` STRING COMMENT '使用时间(下单)',
`used_time` STRING COMMENT '使用时间(支付)',
`expire_time` STRING COMMENT '过期时间'
) COMMENT '优惠券领用表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_coupon_use/';
4.2.14 收藏表
DROP TABLE IF EXISTS ods_favor_info;
CREATE EXTERNAL TABLE ods_favor_info(
`id` STRING COMMENT '编号',
`user_id` STRING COMMENT '用户id',
`sku_id` STRING COMMENT 'skuid',
`spu_id` STRING COMMENT 'spuid',
`is_cancel` STRING COMMENT '是否取消',
`create_time` STRING COMMENT '收藏时间',
`cancel_time` STRING COMMENT '取消时间'
) COMMENT '商品收藏表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_favor_info/';
4.2.15 订单明细表
DROP TABLE IF EXISTS ods_order_detail;
CREATE EXTERNAL TABLE ods_order_detail(
`id` STRING COMMENT '编号',
`order_id` STRING COMMENT '订单号',
`sku_id` STRING COMMENT '商品id',
`sku_name` STRING COMMENT '商品名称',
`order_price` DECIMAL(16,2) COMMENT '商品价格',
`sku_num` BIGINT COMMENT '商品数量',
`create_time` STRING COMMENT '创建时间',
`source_type` STRING COMMENT '来源类型',
`source_id` STRING COMMENT '来源编号',
`split_final_amount` DECIMAL(16,2) COMMENT '分摊最终金额',
`split_activity_amount` DECIMAL(16,2) COMMENT '分摊活动优惠',
`split_coupon_amount` DECIMAL(16,2) COMMENT '分摊优惠券优惠'
) COMMENT '订单详情表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_order_detail/';
4.2.16 订单明细活动关联表
DROP TABLE IF EXISTS ods_order_detail_activity;
CREATE EXTERNAL TABLE ods_order_detail_activity(
`id` STRING COMMENT '编号',
`order_id` STRING COMMENT '订单号',
`order_detail_id` STRING COMMENT '订单明细id',
`activity_id` STRING COMMENT '活动id',
`activity_rule_id` STRING COMMENT '活动规则id',
`sku_id` BIGINT COMMENT '商品id',
`create_time` STRING COMMENT '创建时间'
) COMMENT '订单详情活动关联表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_order_detail_activity/';
4.2.17 订单明细优惠券关联表
DROP TABLE IF EXISTS ods_order_detail_coupon;
CREATE EXTERNAL TABLE ods_order_detail_coupon(
`id` STRING COMMENT '编号',
`order_id` STRING COMMENT '订单号',
`order_detail_id` STRING COMMENT '订单明细id',
`coupon_id` STRING COMMENT '优惠券id',
`coupon_use_id` STRING COMMENT '优惠券领用记录id',
`sku_id` STRING COMMENT '商品id',
`create_time` STRING COMMENT '创建时间'
) COMMENT '订单详情活动关联表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_order_detail_coupon/';
4.2.18 订单表
DROP TABLE IF EXISTS ods_order_info;
CREATE EXTERNAL TABLE ods_order_info (
`id` STRING COMMENT '订单号',
`final_amount` DECIMAL(16,2) COMMENT '订单最终金额',
`order_status` STRING COMMENT '订单状态',
`user_id` STRING COMMENT '用户id',
`payment_way` STRING COMMENT '支付方式',
`delivery_address` STRING COMMENT '送货地址',
`out_trade_no` STRING COMMENT '支付流水号',
`create_time` STRING COMMENT '创建时间',
`operate_time` STRING COMMENT '操作时间',
`expire_time` STRING COMMENT '过期时间',
`tracking_no` STRING COMMENT '物流单编号',
`province_id` STRING COMMENT '省份ID',
`activity_reduce_amount` DECIMAL(16,2) COMMENT '活动减免金额',
`coupon_reduce_amount` DECIMAL(16,2) COMMENT '优惠券减免金额',
`original_amount` DECIMAL(16,2) COMMENT '订单原价金额',
`feight_fee` DECIMAL(16,2) COMMENT '运费',
`feight_fee_reduce` DECIMAL(16,2) COMMENT '运费减免'
) COMMENT '订单表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_order_info/';
4.2.19 退单表
DROP TABLE IF EXISTS ods_order_refund_info;
CREATE EXTERNAL TABLE ods_order_refund_info(
`id` STRING COMMENT '编号',
`user_id` STRING COMMENT '用户ID',
`order_id` STRING COMMENT '订单ID',
`sku_id` STRING COMMENT '商品ID',
`refund_type` STRING COMMENT '退单类型',
`refund_num` BIGINT COMMENT '退单件数',
`refund_amount` DECIMAL(16,2) COMMENT '退单金额',
`refund_reason_type` STRING COMMENT '退单原因类型',
`refund_status` STRING COMMENT '退单状态',--退单状态应包含买家申请、卖家审核、卖家收货、退款完成等状态。此处未涉及到,故该表按增量处理
`create_time` STRING COMMENT '退单时间'
) COMMENT '退单表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_order_refund_info/';
4.2.20 订单状态日志表
DROP TABLE IF EXISTS ods_order_status_log;
CREATE EXTERNAL TABLE ods_order_status_log (
`id` STRING COMMENT '编号',
`order_id` STRING COMMENT '订单ID',
`order_status` STRING COMMENT '订单状态',
`operate_time` STRING COMMENT '修改时间'
) COMMENT '订单状态表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_order_status_log/';
4.2.21 支付表
DROP TABLE IF EXISTS ods_payment_info;
CREATE EXTERNAL TABLE ods_payment_info(
`id` STRING COMMENT '编号',
`out_trade_no` STRING COMMENT '对外业务编号',
`order_id` STRING COMMENT '订单编号',
`user_id` STRING COMMENT '用户编号',
`payment_type` STRING COMMENT '支付类型',
`trade_no` STRING COMMENT '交易编号',
`payment_amount` DECIMAL(16,2) COMMENT '支付金额',
`subject` STRING COMMENT '交易内容',
`payment_status` STRING COMMENT '支付状态',
`create_time` STRING COMMENT '创建时间',
`callback_time` STRING COMMENT '回调时间'
) COMMENT '支付流水表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_payment_info/';
4.2.22 退款表
DROP TABLE IF EXISTS ods_refund_payment;
CREATE EXTERNAL TABLE ods_refund_payment(
`id` STRING COMMENT '编号',
`out_trade_no` STRING COMMENT '对外业务编号',
`order_id` STRING COMMENT '订单编号',
`sku_id` STRING COMMENT 'SKU编号',
`payment_type` STRING COMMENT '支付类型',
`trade_no` STRING COMMENT '交易编号',
`refund_amount` DECIMAL(16,2) COMMENT '支付金额',
`subject` STRING COMMENT '交易内容',
`refund_status` STRING COMMENT '支付状态',
`create_time` STRING COMMENT '创建时间',
`callback_time` STRING COMMENT '回调时间'
) COMMENT '支付流水表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_refund_payment/';
4.2.23 商品平台属性表
DROP TABLE IF EXISTS ods_sku_attr_value;
CREATE EXTERNAL TABLE ods_sku_attr_value(
`id` STRING COMMENT '编号',
`attr_id` STRING COMMENT '平台属性ID',
`value_id` STRING COMMENT '平台属性值ID',
`sku_id` STRING COMMENT '商品ID',
`attr_name` STRING COMMENT '平台属性名称',
`value_name` STRING COMMENT '平台属性值名称'
) COMMENT 'sku平台属性表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_sku_attr_value/';
4.2.24 商品(SKU)表
DROP TABLE IF EXISTS ods_sku_info;
CREATE EXTERNAL TABLE ods_sku_info(
`id` STRING COMMENT 'skuId',
`spu_id` STRING COMMENT 'spuid',
`price` DECIMAL(16,2) COMMENT '价格',
`sku_name` STRING COMMENT '商品名称',
`sku_desc` STRING COMMENT '商品描述',
`weight` DECIMAL(16,2) COMMENT '重量',
`tm_id` STRING COMMENT '品牌id',
`category3_id` STRING COMMENT '品类id',
`is_sale` STRING COMMENT '是否在售',
`create_time` STRING COMMENT '创建时间'
) COMMENT 'SKU商品表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_sku_info/';
4.2.25 商品销售属性表
DROP TABLE IF EXISTS ods_sku_sale_attr_value;
CREATE EXTERNAL TABLE ods_sku_sale_attr_value(
`id` STRING COMMENT '编号',
`sku_id` STRING COMMENT 'sku_id',
`spu_id` STRING COMMENT 'spu_id',
`sale_attr_value_id` STRING COMMENT '销售属性值id',
`sale_attr_id` STRING COMMENT '销售属性id',
`sale_attr_name` STRING COMMENT '销售属性名称',
`sale_attr_value_name` STRING COMMENT '销售属性值名称'
) COMMENT 'sku销售属性名称'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_sku_sale_attr_value/';
4.2.26 商品(SPU)表
DROP TABLE IF EXISTS ods_spu_info;
CREATE EXTERNAL TABLE ods_spu_info(
`id` STRING COMMENT 'spuid',
`spu_name` STRING COMMENT 'spu名称',
`category3_id` STRING COMMENT '品类id',
`tm_id` STRING COMMENT '品牌id'
) COMMENT 'SPU商品表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_spu_info/';
4.2.27 用户表
DROP TABLE IF EXISTS ods_user_info;
CREATE EXTERNAL TABLE ods_user_info(
`id` STRING COMMENT '用户id',
`login_name` STRING COMMENT '用户名称',
`nick_name` STRING COMMENT '用户昵称',
`name` STRING COMMENT '用户姓名',
`phone_num` STRING COMMENT '手机号码',
`email` STRING COMMENT '邮箱',
`user_level` STRING COMMENT '用户等级',
`birthday` STRING COMMENT '生日',
`gender` STRING COMMENT '性别',
`create_time` STRING COMMENT '创建时间',
`operate_time` STRING COMMENT '操作时间'
) COMMENT '用户表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_user_info/';
4.2.28 ODS层业务表首日数据装载脚本
- 编写脚本: 在/home/xiaobai/bin目录下创建脚本
hdfs_to_ods_db_init.sh
:
[xiaobai@hadoop102 bin]$ vim hdfs_to_ods_db_init.sh
在脚本中填写如下内容:
#!/bin/bash
APP=gmall
if [ -n "$2" ] ;then
do_date=$2
else
echo "请传入日期参数"
exit
fi
ods_order_info=" load data inpath '/origin_data/$APP/db/order_info/$do_date' OVERWRITE into table ${APP}.ods_order_info partition(dt='$do_date');"
ods_order_detail=" load data inpath '/origin_data/$APP/db/order_detail/$do_date' OVERWRITE into table ${APP}.ods_order_detail partition(dt='$do_date');"
ods_sku_info=" load data inpath '/origin_data/$APP/db/sku_info/$do_date' OVERWRITE into table ${APP}.ods_sku_info partition(dt='$do_date');"
ods_user_info=" load data inpath '/origin_data/$APP/db/user_info/$do_date' OVERWRITE into table ${APP}.ods_user_info partition(dt='$do_date');"
ods_payment_info=" load data inpath '/origin_data/$APP/db/payment_info/$do_date' OVERWRITE into table ${APP}.ods_payment_info partition(dt='$do_date');"
ods_base_category1=" load data inpath '/origin_data/$APP/db/base_category1/$do_date' OVERWRITE into table ${APP}.ods_base_category1 partition(dt='$do_date');"
ods_base_category2=" load data inpath '/origin_data/$APP/db/base_category2/$do_date' OVERWRITE into table ${APP}.ods_base_category2 partition(dt='$do_date');"
ods_base_category3=" load data inpath '/origin_data/$APP/db/base_category3/$do_date' OVERWRITE into table ${APP}.ods_base_category3 partition(dt='$do_date'); "
ods_base_trademark=" load data inpath '/origin_data/$APP/db/base_trademark/$do_date' OVERWRITE into table ${APP}.ods_base_trademark partition(dt='$do_date'); "
ods_activity_info=" load data inpath '/origin_data/$APP/db/activity_info/$do_date' OVERWRITE into table ${APP}.ods_activity_info partition(dt='$do_date'); "
ods_cart_info=" load data inpath '/origin_data/$APP/db/cart_info/$do_date' OVERWRITE into table ${APP}.ods_cart_info partition(dt='$do_date'); "
ods_comment_info=" load data inpath '/origin_data/$APP/db/comment_info/$do_date' OVERWRITE into table ${APP}.ods_comment_info partition(dt='$do_date'); "
ods_coupon_info=" load data inpath '/origin_data/$APP/db/coupon_info/$do_date' OVERWRITE into table ${APP}.ods_coupon_info partition(dt='$do_date'); "
ods_coupon_use=" load data inpath '/origin_data/$APP/db/coupon_use/$do_date' OVERWRITE into table ${APP}.ods_coupon_use partition(dt='$do_date'); "
ods_favor_info=" load data inpath '/origin_data/$APP/db/favor_info/$do_date' OVERWRITE into table ${APP}.ods_favor_info partition(dt='$do_date'); "
ods_order_refund_info=" load data inpath '/origin_data/$APP/db/order_refund_info/$do_date' OVERWRITE into table ${APP}.ods_order_refund_info partition(dt='$do_date'); "
ods_order_status_log=" load data inpath '/origin_data/$APP/db/order_status_log/$do_date' OVERWRITE into table ${APP}.ods_order_status_log partition(dt='$do_date'); "
ods_spu_info=" load data inpath '/origin_data/$APP/db/spu_info/$do_date' OVERWRITE into table ${APP}.ods_spu_info partition(dt='$do_date'); "
ods_activity_rule=" load data inpath '/origin_data/$APP/db/activity_rule/$do_date' OVERWRITE into table ${APP}.ods_activity_rule partition(dt='$do_date');"
ods_base_dic=" load data inpath '/origin_data/$APP/db/base_dic/$do_date' OVERWRITE into table ${APP}.ods_base_dic partition(dt='$do_date'); "
ods_order_detail_activity=" load data inpath '/origin_data/$APP/db/order_detail_activity/$do_date' OVERWRITE into table ${APP}.ods_order_detail_activity partition(dt='$do_date'); "
ods_order_detail_coupon=" load data inpath '/origin_data/$APP/db/order_detail_coupon/$do_date' OVERWRITE into table ${APP}.ods_order_detail_coupon partition(dt='$do_date'); "
ods_refund_payment=" load data inpath '/origin_data/$APP/db/refund_payment/$do_date' OVERWRITE into table ${APP}.ods_refund_payment partition(dt='$do_date'); "
ods_sku_attr_value=" load data inpath '/origin_data/$APP/db/sku_attr_value/$do_date' OVERWRITE into table ${APP}.ods_sku_attr_value partition(dt='$do_date'); "
ods_sku_sale_attr_value=" load data inpath '/origin_data/$APP/db/sku_sale_attr_value/$do_date' OVERWRITE into table ${APP}.ods_sku_sale_attr_value partition(dt='$do_date'); "
ods_base_province=" load data inpath '/origin_data/$APP/db/base_province/$do_date' OVERWRITE into table ${APP}.ods_base_province;"
ods_base_region=" load data inpath '/origin_data/$APP/db/base_region/$do_date' OVERWRITE into table ${APP}.ods_base_region;"
case $1 in
"ods_order_info"){
hive -e "$ods_order_info"
};;
"ods_order_detail"){
hive -e "$ods_order_detail"
};;
"ods_sku_info"){
hive -e "$ods_sku_info"
};;
"ods_user_info"){
hive -e "$ods_user_info"
};;
"ods_payment_info"){
hive -e "$ods_payment_info"
};;
"ods_base_category1"){
hive -e "$ods_base_category1"
};;
"ods_base_category2"){
hive -e "$ods_base_category2"
};;
"ods_base_category3"){
hive -e "$ods_base_category3"
};;
"ods_base_trademark"){
hive -e "$ods_base_trademark"
};;
"ods_activity_info"){
hive -e "$ods_activity_info"
};;
"ods_cart_info"){
hive -e "$ods_cart_info"
};;
"ods_comment_info"){
hive -e "$ods_comment_info"
};;
"ods_coupon_info"){
hive -e "$ods_coupon_info"
};;
"ods_coupon_use"){
hive -e "$ods_coupon_use"
};;
"ods_favor_info"){
hive -e "$ods_favor_info"
};;
"ods_order_refund_info"){
hive -e "$ods_order_refund_info"
};;
"ods_order_status_log"){
hive -e "$ods_order_status_log"
};;
"ods_spu_info"){
hive -e "$ods_spu_info"
};;
"ods_activity_rule"){
hive -e "$ods_activity_rule"
};;
"ods_base_dic"){
hive -e "$ods_base_dic"
};;
"ods_order_detail_activity"){
hive -e "$ods_order_detail_activity"
};;
"ods_order_detail_coupon"){
hive -e "$ods_order_detail_coupon"
};;
"ods_refund_payment"){
hive -e "$ods_refund_payment"
};;
"ods_sku_attr_value"){
hive -e "$ods_sku_attr_value"
};;
"ods_sku_sale_attr_value"){
hive -e "$ods_sku_sale_attr_value"
};;
"ods_base_province"){
hive -e "$ods_base_province"
};;
"ods_base_region"){
hive -e "$ods_base_region"
};;
"all"){
hive -e "$ods_order_info$ods_order_detail$ods_sku_info$ods_user_info$ods_payment_info$ods_base_category1$ods_base_category2$ods_base_category3$ods_base_trademark$ods_activity_info$ods_cart_info$ods_comment_info$ods_coupon_info$ods_coupon_use$ods_favor_info$ods_order_refund_info$ods_order_status_log$ods_spu_info$ods_activity_rule$ods_base_dic$ods_order_detail_activity$ods_order_detail_coupon$ods_refund_payment$ods_sku_attr_value$ods_sku_sale_attr_value$ods_base_province$ods_base_region"
};;
esac
- 增加执行权限:
[xiaobai@hadoop102 bin]$ chmod +x hdfs_to_ods_db_init.sh
- 执行脚本
[xiaobai@hadoop102 bin]$ ./hdfs_to_ods_db_init.sh all 2020-06-14
在dataGrip中查看数据是否导入成功,以下成功!
4.2.29 ODS层业务表每日数据装载脚本
- 创建脚本: 在/home/xiaobai/bin目录下创建脚本
hdfs_to_ods_db.sh
:
[xiaobai@hadoop102 bin]$ vim hdfs_to_ods_db.sh
在脚本中填写如下内容:
#!/bin/bash
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$2" ] ;then
do_date=$2
else
do_date=`date -d "-1 day" +%F`
fi
ods_order_info=" load data inpath '/origin_data/$APP/db/order_info/$do_date' OVERWRITE into table ${APP}.ods_order_info partition(dt='$do_date');"
ods_order_detail=" load data inpath '/origin_data/$APP/db/order_detail/$do_date' OVERWRITE into table ${APP}.ods_order_detail partition(dt='$do_date');"
ods_sku_info=" load data inpath '/origin_data/$APP/db/sku_info/$do_date' OVERWRITE into table ${APP}.ods_sku_info partition(dt='$do_date');"
ods_user_info=" load data inpath '/origin_data/$APP/db/user_info/$do_date' OVERWRITE into table ${APP}.ods_user_info partition(dt='$do_date');"
ods_payment_info=" load data inpath '/origin_data/$APP/db/payment_info/$do_date' OVERWRITE into table ${APP}.ods_payment_info partition(dt='$do_date');"
ods_base_category1=" load data inpath '/origin_data/$APP/db/base_category1/$do_date' OVERWRITE into table ${APP}.ods_base_category1 partition(dt='$do_date');"
ods_base_category2=" load data inpath '/origin_data/$APP/db/base_category2/$do_date' OVERWRITE into table ${APP}.ods_base_category2 partition(dt='$do_date');"
ods_base_category3=" load data inpath '/origin_data/$APP/db/base_category3/$do_date' OVERWRITE into table ${APP}.ods_base_category3 partition(dt='$do_date'); "
ods_base_trademark=" load data inpath '/origin_data/$APP/db/base_trademark/$do_date' OVERWRITE into table ${APP}.ods_base_trademark partition(dt='$do_date'); "
ods_activity_info=" load data inpath '/origin_data/$APP/db/activity_info/$do_date' OVERWRITE into table ${APP}.ods_activity_info partition(dt='$do_date'); "
ods_cart_info=" load data inpath '/origin_data/$APP/db/cart_info/$do_date' OVERWRITE into table ${APP}.ods_cart_info partition(dt='$do_date'); "
ods_comment_info=" load data inpath '/origin_data/$APP/db/comment_info/$do_date' OVERWRITE into table ${APP}.ods_comment_info partition(dt='$do_date'); "
ods_coupon_info=" load data inpath '/origin_data/$APP/db/coupon_info/$do_date' OVERWRITE into table ${APP}.ods_coupon_info partition(dt='$do_date'); "
ods_coupon_use=" load data inpath '/origin_data/$APP/db/coupon_use/$do_date' OVERWRITE into table ${APP}.ods_coupon_use partition(dt='$do_date'); "
ods_favor_info=" load data inpath '/origin_data/$APP/db/favor_info/$do_date' OVERWRITE into table ${APP}.ods_favor_info partition(dt='$do_date'); "
ods_order_refund_info=" load data inpath '/origin_data/$APP/db/order_refund_info/$do_date' OVERWRITE into table ${APP}.ods_order_refund_info partition(dt='$do_date'); "
ods_order_status_log=" load data inpath '/origin_data/$APP/db/order_status_log/$do_date' OVERWRITE into table ${APP}.ods_order_status_log partition(dt='$do_date'); "
ods_spu_info=" load data inpath '/origin_data/$APP/db/spu_info/$do_date' OVERWRITE into table ${APP}.ods_spu_info partition(dt='$do_date'); "
ods_activity_rule=" load data inpath '/origin_data/$APP/db/activity_rule/$do_date' OVERWRITE into table ${APP}.ods_activity_rule partition(dt='$do_date');"
ods_base_dic=" load data inpath '/origin_data/$APP/db/base_dic/$do_date' OVERWRITE into table ${APP}.ods_base_dic partition(dt='$do_date'); "
ods_order_detail_activity=" load data inpath '/origin_data/$APP/db/order_detail_activity/$do_date' OVERWRITE into table ${APP}.ods_order_detail_activity partition(dt='$do_date'); "
ods_order_detail_coupon=" load data inpath '/origin_data/$APP/db/order_detail_coupon/$do_date' OVERWRITE into table ${APP}.ods_order_detail_coupon partition(dt='$do_date'); "
ods_refund_payment=" load data inpath '/origin_data/$APP/db/refund_payment/$do_date' OVERWRITE into table ${APP}.ods_refund_payment partition(dt='$do_date'); "
ods_sku_attr_value=" load data inpath '/origin_data/$APP/db/sku_attr_value/$do_date' OVERWRITE into table ${APP}.ods_sku_attr_value partition(dt='$do_date'); "
ods_sku_sale_attr_value=" load data inpath '/origin_data/$APP/db/sku_sale_attr_value/$do_date' OVERWRITE into table ${APP}.ods_sku_sale_attr_value partition(dt='$do_date'); "
ods_base_province=" load data inpath '/origin_data/$APP/db/base_province/$do_date' OVERWRITE into table ${APP}.ods_base_province;"
ods_base_region=" load data inpath '/origin_data/$APP/db/base_region/$do_date' OVERWRITE into table ${APP}.ods_base_region;"
case $1 in
"ods_order_info"){
hive -e "$ods_order_info"
};;
"ods_order_detail"){
hive -e "$ods_order_detail"
};;
"ods_sku_info"){
hive -e "$ods_sku_info"
};;
"ods_user_info"){
hive -e "$ods_user_info"
};;
"ods_payment_info"){
hive -e "$ods_payment_info"
};;
"ods_base_category1"){
hive -e "$ods_base_category1"
};;
"ods_base_category2"){
hive -e "$ods_base_category2"
};;
"ods_base_category3"){
hive -e "$ods_base_category3"
};;
"ods_base_trademark"){
hive -e "$ods_base_trademark"
};;
"ods_activity_info"){
hive -e "$ods_activity_info"
};;
"ods_cart_info"){
hive -e "$ods_cart_info"
};;
"ods_comment_info"){
hive -e "$ods_comment_info"
};;
"ods_coupon_info"){
hive -e "$ods_coupon_info"
};;
"ods_coupon_use"){
hive -e "$ods_coupon_use"
};;
"ods_favor_info"){
hive -e "$ods_favor_info"
};;
"ods_order_refund_info"){
hive -e "$ods_order_refund_info"
};;
"ods_order_status_log"){
hive -e "$ods_order_status_log"
};;
"ods_spu_info"){
hive -e "$ods_spu_info"
};;
"ods_activity_rule"){
hive -e "$ods_activity_rule"
};;
"ods_base_dic"){
hive -e "$ods_base_dic"
};;
"ods_order_detail_activity"){
hive -e "$ods_order_detail_activity"
};;
"ods_order_detail_coupon"){
hive -e "$ods_order_detail_coupon"
};;
"ods_refund_payment"){
hive -e "$ods_refund_payment"
};;
"ods_sku_attr_value"){
hive -e "$ods_sku_attr_value"
};;
"ods_sku_sale_attr_value"){
hive -e "$ods_sku_sale_attr_value"
};;
"all"){
hive -e "$ods_order_info$ods_order_detail$ods_sku_info$ods_user_info$ods_payment_info$ods_base_category1$ods_base_category2$ods_base_category3$ods_base_trademark$ods_activity_info$ods_cart_info$ods_comment_info$ods_coupon_info$ods_coupon_use$ods_favor_info$ods_order_refund_info$ods_order_status_log$ods_spu_info$ods_activity_rule$ods_base_dic$ods_order_detail_activity$ods_order_detail_coupon$ods_refund_payment$ods_sku_attr_value$ods_sku_sale_attr_value"
};;
esac
- 修改权限:
[xiaobai@hadoop102 bin]$ chmod +x hdfs_to_ods_db.sh
- 执行脚本:
在2020-06-15数据更新执行!
hdfs_to_ods_db.sh all 2020-06-14
五、数仓搭建-DIM层
5.1 商品维度表(全量)
tips:
商品维度表
每日采用全量同步,故首日装载语句与每日装载语句除了日期之外,都相同!
- 建表语句
DROP TABLE IF EXISTS dim_sku_info;
CREATE EXTERNAL TABLE dim_sku_info (
`id` STRING COMMENT '商品id',
`price` DECIMAL(16,2) COMMENT '商品价格',
`sku_name` STRING COMMENT '商品名称',
`sku_desc` STRING COMMENT '商品描述',
`weight` DECIMAL(16,2) COMMENT '重量',
`is_sale` BOOLEAN COMMENT '是否在售',
`spu_id` STRING COMMENT 'spu编号',
`spu_name` STRING COMMENT 'spu名称',
`category3_id` STRING COMMENT '三级分类id',
`category3_name` STRING COMMENT '三级分类名称',
`category2_id` STRING COMMENT '二级分类id',
`category2_name` STRING COMMENT '二级分类名称',
`category1_id` STRING COMMENT '一级分类id',
`category1_name` STRING COMMENT '一级分类名称',
`tm_id` STRING COMMENT '品牌id',
`tm_name` STRING COMMENT '品牌名称',
`sku_attr_values` ARRAY<STRUCT<attr_id:STRING,value_id:STRING,attr_name:STRING,value_name:STRING>> COMMENT '平台属性',
`sku_sale_attr_values` ARRAY<STRUCT<sale_attr_id:STRING,sale_attr_value_id:STRING,sale_attr_name:STRING,sale_attr_value_name:STRING>> COMMENT '销售属性',
`create_time` STRING COMMENT '创建时间'
) COMMENT '商品维度表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dim/dim_sku_info/'
TBLPROPERTIES ("parquet.compression"="lzo");
- 分区规划
商品维度表分区:
- 数据装载
1). 首日装载
with
sku as
(
select
id,
price,
sku_name,
sku_desc,
weight,
is_sale,
spu_id,
category3_id,
tm_id,
create_time
from ods_sku_info
where dt='2020-06-14'
),
spu as
(
select
id,
spu_name
from ods_spu_info
where dt='2020-06-14'
),
c3 as
(
select
id,
name,
category2_id
from ods_base_category3
where dt='2020-06-14'
),
c2 as
(
select
id,
name,
category1_id
from ods_base_category2
where dt='2020-06-14'
),
c1 as
(
select
id,
name
from ods_base_category1
where dt='2020-06-14'
),
tm as
(
select
id,
tm_name
from ods_base_trademark
where dt='2020-06-14'
),
attr as
(
select
sku_id,
collect_set(named_struct('attr_id',attr_id,'value_id',value_id,'attr_name',attr_name,'value_name',value_name)) attrs
from ods_sku_attr_value
where dt='2020-06-14'
group by sku_id
),
sale_attr as
(
select
sku_id,
collect_set(named_struct('sale_attr_id',sale_attr_id,'sale_attr_value_id',sale_attr_value_id,'sale_attr_name',sale_attr_name,'sale_attr_value_name',sale_attr_value_name)) sale_attrs
from ods_sku_sale_attr_value
where dt='2020-06-14'
group by sku_id
)
insert overwrite table dim_sku_info partition(dt='2020-06-14')
select
sku.id,
sku.price,
sku.sku_name,
sku.sku_desc,
sku.weight,
sku.is_sale,
sku.spu_id,
spu.spu_name,
sku.category3_id,
c3.name,
c3.category2_id,
c2.name,
c2.category1_id,
c1.name,
sku.tm_id,
tm.tm_name,
attr.attrs,
sale_attr.sale_attrs,
sku.create_time
from sku
left join spu on sku.spu_id=spu.id
left join c3 on sku.category3_id=c3.id
left join c2 on c3.category2_id=c2.id
left join c1 on c2.category1_id=c1.id
left join tm on sku.tm_id=tm.id
left join attr on sku.id=attr.sku_id
left join sale_attr on sku.id=sale_attr.sku_id;
如图,执行首日装载语句,数据正常导入!
but dim_sku_info表的最后一行标红部分为异常数据,是因为hive默认情况下对map端的小文件进行合并导致,而insert语句会被解析成一个计算任务来读取与ods层相关的业务数据,这些数据被压缩存储为.lzo
与其对应的索引文件.index
,此2个文件都较小,所以hive从表中读取数据时会误将此2个文件当作普通小文件进行合并,这会导致lzo文件无法切片。
如图,hive.input.format默认值为hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;故我们需关闭此小文件合并功能。
✅解决方法:
修改CombineHiveInputFormat为HiveInputFormat
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
重新执行首日装载语句,异常数据消失!
注⚠️: 如果当我们读取的表是lzo文件类型且为它创建了索引,此时我们就要关闭小文件合并功能!
2). 每日装载
with
sku as
(
select
id,
price,
sku_name,
sku_desc,
weight,
is_sale,
spu_id,
category3_id,
tm_id,
create_time
from ods_sku_info
where dt='2020-06-15'
),
spu as
(
select
id,
spu_name
from ods_spu_info
where dt='2020-06-15'
),
c3 as
(
select
id,
name,
category2_id
from ods_base_category3
where dt='2020-06-15'
),
c2 as
(
select
id,
name,
category1_id
from ods_base_category2
where dt='2020-06-15'
),
c1 as
(
select
id,
name
from ods_base_category1
where dt='2020-06-15'
),
tm as
(
select
id,
tm_name
from ods_base_trademark
where dt='2020-06-15'
),
attr as
(
select
sku_id,
collect_set(named_struct('attr_id',attr_id,'value_id',value_id,'attr_name',attr_name,'value_name',value_name)) attrs
from ods_sku_attr_value
where dt='2020-06-15'
group by sku_id
),
sale_attr as
(
select
sku_id,
collect_set(named_struct('sale_attr_id',sale_attr_id,'sale_attr_value_id',sale_attr_value_id,'sale_attr_name',sale_attr_name,'sale_attr_value_name',sale_attr_value_name)) sale_attrs
from ods_sku_sale_attr_value
where dt='2020-06-15'
group by sku_id
)
insert overwrite table dim_sku_info partition(dt='2020-06-15')
select
sku.id,
sku.price,
sku.sku_name,
sku.sku_desc,
sku.weight,
sku.is_sale,
sku.spu_id,
spu.spu_name,
sku.category3_id,
c3.name,
c3.category2_id,
c2.name,
c2.category1_id,
c1.name,
sku.tm_id,
tm.tm_name,
attr.attrs,
sale_attr.sale_attrs,
sku.create_time
from sku
left join spu on sku.spu_id=spu.id
left join c3 on sku.category3_id=c3.id
left join c2 on c3.category2_id=c2.id
left join c1 on c2.category1_id=c1.id
left join tm on sku.tm_id=tm.id
优惠券维度表(全量)
tips:
优惠券维度表
每日采用全量同步,故首日装载语句与每日装载语句除了日期之外,都相同!
- 建表语句
DROP TABLE IF EXISTS dim_coupon_info;
CREATE EXTERNAL TABLE dim_coupon_info(
`id` STRING COMMENT '购物券编号',
`coupon_name` STRING COMMENT '购物券名称',
`coupon_type` STRING COMMENT '购物券类型 1 现金券 2 折扣券 3 满减券 4 满件打折券',
`condition_amount` DECIMAL(16,2) COMMENT '满额数',
`condition_num` BIGINT COMMENT '满件数',
`activity_id` STRING COMMENT '活动编号',
`benefit_amount` DECIMAL(16,2) COMMENT '减金额',
`benefit_discount` DECIMAL(16,2) COMMENT '折扣',
`create_time` STRING COMMENT '创建时间',
`range_type` STRING COMMENT '范围类型 1、商品 2、品类 3、品牌',
`limit_num` BIGINT COMMENT '最多领取次数',
`taken_count` BIGINT COMMENT '已领取次数',
`start_time` STRING COMMENT '可以领取的开始日期',
`end_time` STRING COMMENT '可以领取的结束日期',
`operate_time` STRING COMMENT '修改时间',
`expire_time` STRING COMMENT '过期时间'
) COMMENT '优惠券维度表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dim/dim_coupon_info/'
TBLPROPERTIES ("parquet.compression"="lzo");
- 分区规划
- 数据装载
1). 首日装载
insert overwrite table dim_coupon_info partition(dt='2020-06-14')
select
id,
coupon_name,
coupon_type,
condition_amount,
condition_num,
activity_id,
benefit_amount,
benefit_discount,
create_time,
range_type,
limit_num,
taken_count,
start_time,
end_time,
operate_time,
expire_time
from ods_coupon_info
where dt='2020-06-14';
2). 每日装载
insert overwrite table dim_coupon_info partition(dt='2020-06-15')
select
id,
coupon_name,
coupon_type,
condition_amount,
condition_num,
activity_id,
benefit_amount,
benefit_discount,
create_time,
range_type,
limit_num,
taken_count,
start_time,
end_time,
operate_time,
expire_time
from ods_coupon_info
where dt='2020-06-15';
5.3 活动维度表(全量)
tip:每行数据为一个活动规则,而非一个活动!
- 建表语句
DROP TABLE IF EXISTS dim_activity_rule_info;
CREATE EXTERNAL TABLE dim_activity_rule_info(
`activity_rule_id` STRING COMMENT '活动规则ID',
`activity_id` STRING COMMENT '活动ID',
`activity_name` STRING COMMENT '活动名称',
`activity_type` STRING COMMENT '活动类型',
`start_time` STRING COMMENT '开始时间',
`end_time` STRING COMMENT '结束时间',
`create_time` STRING COMMENT '创建时间',
`condition_amount` DECIMAL(16,2) COMMENT '满减金额',
`condition_num` BIGINT COMMENT '满减件数',
`benefit_amount` DECIMAL(16,2) COMMENT '优惠金额',
`benefit_discount` DECIMAL(16,2) COMMENT '优惠折扣',
`benefit_level` STRING COMMENT '优惠级别'
) COMMENT '活动信息表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dim/dim_activity_rule_info/'
TBLPROPERTIES ("parquet.compression"="lzo");
-
分区规划
-
数据装载
1). 首日装载
insert overwrite table dim_activity_rule_info partition(dt='2020-06-14')
select
ar.id,
ar.activity_id,
ai.activity_name,
ar.activity_type,
ai.start_time,
ai.end_time,
ai.create_time,
ar.condition_amount,
ar.condition_num,
ar.benefit_amount,
ar.benefit_discount,
ar.benefit_level
from
(
select
id,
activity_id,
activity_type,
condition_amount,
condition_num,
benefit_amount,
benefit_discount,
benefit_level
from ods_activity_rule
where dt='2020-06-14'
)ar
left join
(
select
id,
activity_name,
start_time,
end_time,
create_time
from ods_activity_info
where dt='2020-06-14'
)ai
on ar.activity_id=ai.id;
2). 每日转载
insert overwrite table dim_activity_rule_info partition(dt='2020-06-15')
select
ar.id,
ar.activity_id,
ai.activity_name,
ar.activity_type,
ai.start_time,
ai.end_time,
ai.create_time,
ar.condition_amount,
ar.condition_num,
ar.benefit_amount,
ar.benefit_discount,
ar.benefit_level
from
(
select
id,
activity_id,
activity_type,
condition_amount,
condition_num,
benefit_amount,
benefit_discount,
benefit_level
from ods_activity_rule
where dt='2020-06-15'
)ar
left join
(
select
id,
activity_name,
start_time,
end_time,
create_time
from ods_activity_info
where dt='2020-06-15'
)ai
on ar.activity_id=ai.id;
5.4 地区维度表(特殊)
tips: 地区维度表数据相对稳定,变化概率较低,故无需每日装载。
- 建表语句
DROP TABLE IF EXISTS dim_base_province;
CREATE EXTERNAL TABLE dim_base_province (
`id` STRING COMMENT 'id',
`province_name` STRING COMMENT '省市名称',
`area_code` STRING COMMENT '地区编码',
`iso_code` STRING COMMENT 'ISO-3166编码,供可视化使用',
`iso_3166_2` STRING COMMENT 'IOS-3166-2编码,供可视化使用',
`region_id` STRING COMMENT '地区id',
`region_name` STRING COMMENT '地区名称'
) COMMENT '地区维度表'
STORED AS PARQUET
LOCATION '/warehouse/gmall/dim/dim_base_province/'
TBLPROPERTIES ("parquet.compression"="lzo");
- 数据装载
装载语句
insert overwrite table dim_base_province
select
bp.id,
bp.name,
bp.area_code,
bp.iso_code,
bp.iso_3166_2,
bp.region_id,
br.region_name
from ods_base_province bp
join ods_base_region br on bp.region_id = br.id;
5.5 时间维度表(特殊)
通常情况下,时间维度表的数据并不是来自于业务系统,而是手动写入,并且由于时间维度表数据的可预见性,无须每日导入,一般可一次性导入一年的数据。
- 建表语句
DROP TABLE IF EXISTS dim_date_info;
CREATE EXTERNAL TABLE dim_date_info(
`date_id` STRING COMMENT '日',
`week_id` STRING COMMENT '周ID',
`week_day` STRING COMMENT '周几',
`day` STRING COMMENT '每月的第几天',
`month` STRING COMMENT '第几月',
`quarter` STRING COMMENT '第几季度',
`year` STRING COMMENT '年',
`is_workday` STRING COMMENT '是否是工作日',
`holiday_id` STRING COMMENT '节假日'
) COMMENT '时间维度表'
STORED AS PARQUET
LOCATION '/warehouse/gmall/dim/dim_date_info/'
TBLPROPERTIES ("parquet.compression"="lzo");
- 数据装载
因节假日较特殊,可以请求网络上的日历接口,也可将其专门处理,之后将日期属性写入文件date_info.txt
,将文件load到时间维度表dim_date_info
中;
因dim_date_info是PARQUET列式存储+lzo压缩格式,不能识别date_info.txt文本文件,故不可直接导入!可先创建一个不采用PARQUET列式存储+lzo压缩格式的临时表tmp_dim_date_info
load date_info.txt,再将数据insert到date_info.txt中。
1). 创建临时表
DROP TABLE IF EXISTS tmp_dim_date_info;
CREATE EXTERNAL TABLE tmp_dim_date_info (
`date_id` STRING COMMENT '日',
`week_id` STRING COMMENT '周ID',
`week_day` STRING COMMENT '周几',
`day` STRING COMMENT '每月的第几天',
`month` STRING COMMENT '第几月',
`quarter` STRING COMMENT '第几季度',
`year` STRING COMMENT '年',
`is_workday` STRING COMMENT '是否是工作日',
`holiday_id` STRING COMMENT '节假日'
) COMMENT '时间维度表'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/warehouse/gmall/tmp/tmp_dim_date_info/';
2). 将数据文件上传到HFDS上临时表指定路径/warehouse/gmall/tmp/tmp_dim_date_info/
可以看到临时表tmp_dim_date_info 表已经生成数据!
3). 执行以下语句将其导入时间维度表
insert overwrite table dim_date_info select * from tmp_dim_date_info;
可以看到时间维度表dim_date_info已经成功导入数据!
4). 检查数据是否导入成功
select * from dim_date_info;
5.6 用户维度表(拉链表)
拉链表首日装载,需要进行初始化操作,具体工作为将截止到初始化当日的全部历史用户导入一次性导入到拉链表中。目前的ods_user_info表的第一个分区,即2020-06-14分区中就是全部的历史用户,故将该分区数据进行一定处理后导入拉链表的9999-99-99分区即可。
5.6.1 拉链表概述
- 什么是拉链表?
-
为什么要做拉链表?
-
如何使用拉链表?
-
拉链表形成过程
5.6.2 制作拉链表
- 建表语句
DROP TABLE IF EXISTS dim_user_info;
CREATE EXTERNAL TABLE dim_user_info(
`id` STRING COMMENT '用户id',
`login_name` STRING COMMENT '用户名称',
`nick_name` STRING COMMENT '用户昵称',
`name` STRING COMMENT '用户姓名',
`phone_num` STRING COMMENT '手机号码',
`email` STRING COMMENT '邮箱',
`user_level` STRING COMMENT '用户等级',
`birthday` STRING COMMENT '生日',
`gender` STRING COMMENT '性别',
`create_time` STRING COMMENT '创建时间',
`operate_time` STRING COMMENT '操作时间',
`start_date` STRING COMMENT '开始日期',
`end_date` STRING COMMENT '结束日期'
) COMMENT '用户表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dim/dim_user_info/'
TBLPROPERTIES ("parquet.compression"="lzo");
- 分区规划
- 数据装载
1). 首日装载
insert overwrite table dim_user_info partition(dt='9999-99-99')
select
id,
login_name,
nick_name,
md5(name),
md5(phone_num),
md5(email),
user_level,
birthday,
gender,
create_time,
operate_time,
'2020-06-14',
'9999-99-99'
from ods_user_info
where dt='2020-06-14';
2). 每日装载
实现思路
sql编写
with
tmp as
(
select
old.id old_id,
old.login_name old_login_name,
old.nick_name old_nick_name,
old.name old_name,
old.phone_num old_phone_num,
old.email old_email,
old.user_level old_user_level,
old.birthday old_birthday,
old.gender old_gender,
old.create_time old_create_time,
old.operate_time old_operate_time,
old.start_date old_start_date,
old.end_date old_end_date,
new.id new_id,
new.login_name new_login_name,
new.nick_name new_nick_name,
new.name new_name,
new.phone_num new_phone_num,
new.email new_email,
new.user_level new_user_level,
new.birthday new_birthday,
new.gender new_gender,
new.create_time new_create_time,
new.operate_time new_operate_time,
new.start_date new_start_date,
new.end_date new_end_date
from
(
select
id,
login_name,
nick_name,
name,
phone_num,
email,
user_level,
birthday,
gender,
create_time,
operate_time,
start_date,
end_date
from dim_user_info
where dt='9999-99-99'
)old
full outer join
(
select
id,
login_name,
nick_name,
md5(name) name,
md5(phone_num) phone_num,
md5(email) email,
user_level,
birthday,
gender,
create_time,
operate_time,
'2020-06-15' start_date,
'9999-99-99' end_date
from ods_user_info
where dt='2020-06-15'
)new
on old.id=new.id
)
insert overwrite table dim_user_info partition(dt)
select
nvl(new_id,old_id),
nvl(new_login_name,old_login_name),
nvl(new_nick_name,old_nick_name),
nvl(new_name,old_name),
nvl(new_phone_num,old_phone_num),
nvl(new_email,old_email),
nvl(new_user_level,old_user_level),
nvl(new_birthday,old_birthday),
nvl(new_gender,old_gender),
nvl(new_create_time,old_create_time),
nvl(new_operate_time,old_operate_time),
nvl(new_start_date,old_start_date),
nvl(new_end_date,old_end_date),
nvl(new_end_date,old_end_date) dt
from tmp
union all
select
old_id,
old_login_name,
old_nick_name,
old_name,
old_phone_num,
old_email,
old_user_level,
old_birthday,
old_gender,
old_create_time,
old_operate_time,
old_start_date,
cast(date_add('2020-06-15',-1) as string),
cast(date_add('2020-06-15',-1) as string) dt
from tmp
where new_id is not null and old_id is not null;
5.7 DIM层首日数据装载脚本
- 编写脚本
1). 在/home/xiaobai/bin目录下创建脚本ods_to_dim_db_init.sh
[xiaobai@hadoop102 bin]$ vim ods_to_dim_db_init.sh
在脚本中填写如下内容:
#!/bin/bash
APP=gmall
if [ -n "$2" ] ;then
do_date=$2
else
echo "请传入日期参数"
exit
fi
dim_user_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dim_user_info partition(dt='9999-99-99') select id, login_name, nick_name, md5(name), md5(phone_num), md5(email), user_level, birthday, gender, create_time, operate_time, '$do_date', '9999-99-99' from ${APP}.ods_user_info where dt='$do_date'; "
dim_sku_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; with sku as ( select id, price, sku_name, sku_desc, weight, is_sale, spu_id, category3_id, tm_id, create_time from ${APP}.ods_sku_info where dt='$do_date' ), spu as ( select id, spu_name from ${APP}.ods_spu_info where dt='$do_date' ), c3 as ( select id, name, category2_id from ${APP}.ods_base_category3 where dt='$do_date' ), c2 as ( select id, name, category1_id from ${APP}.ods_base_category2 where dt='$do_date' ), c1 as ( select id, name from ${APP}.ods_base_category1 where dt='$do_date' ), tm as ( select id, tm_name from ${APP}.ods_base_trademark where dt='$do_date' ), attr as ( select sku_id, collect_set(named_struct('attr_id',attr_id,'value_id',value_id,'attr_name',attr_name,'value_name',value_name)) attrs from ${APP}.ods_sku_attr_value where dt='$do_date' group by sku_id ), sale_attr as ( select sku_id, collect_set(named_struct('sale_attr_id',sale_attr_id,'sale_attr_value_id',sale_attr_value_id,'sale_attr_name',sale_attr_name,'sale_attr_value_name',sale_attr_value_name)) sale_attrs from ${APP}.ods_sku_sale_attr_value where dt='$do_date' group by sku_id ) insert overwrite table ${APP}.dim_sku_info partition(dt='$do_date') select sku.id, sku.price, sku.sku_name, sku.sku_desc, sku.weight, sku.is_sale, sku.spu_id, spu.spu_name, sku.category3_id, c3.name, c3.category2_id, c2.name, c2.category1_id, c1.name, sku.tm_id, tm.tm_name, attr.attrs, sale_attr.sale_attrs, sku.create_time from sku left join spu on sku.spu_id=spu.id left join c3 on sku.category3_id=c3.id left join c2 on c3.category2_id=c2.id left join c1 on c2.category1_id=c1.id left join tm on sku.tm_id=tm.id left join attr on sku.id=attr.sku_id left join sale_attr on sku.id=sale_attr.sku_id; "
dim_base_province=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dim_base_province select bp.id, bp.name, bp.area_code, bp.iso_code, bp.iso_3166_2, bp.region_id, br.region_name from ${APP}.ods_base_province bp join ${APP}.ods_base_region br on bp.region_id = br.id; "
dim_coupon_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dim_coupon_info partition(dt='$do_date') select id, coupon_name, coupon_type, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, create_time, range_type, limit_num, taken_count, start_time, end_time, operate_time, expire_time from ${APP}.ods_coupon_info where dt='$do_date'; "
dim_activity_rule_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dim_activity_rule_info partition(dt='$do_date') select ar.id, ar.activity_id, ai.activity_name, ar.activity_type, ai.start_time, ai.end_time, ai.create_time, ar.condition_amount, ar.condition_num, ar.benefit_amount, ar.benefit_discount, ar.benefit_level from ( select id, activity_id, activity_type, condition_amount, condition_num, benefit_amount, benefit_discount, benefit_level from ${APP}.ods_activity_rule where dt='$do_date' )ar left join ( select id, activity_name, start_time, end_time, create_time from ${APP}.ods_activity_info where dt='$do_date' )ai on ar.activity_id=ai.id; "
case $1 in
"dim_user_info"){
hive -e "$dim_user_info"
};;
"dim_sku_info"){
hive -e "$dim_sku_info"
};;
"dim_base_province"){
hive -e "$dim_base_province"
};;
"dim_coupon_info"){
hive -e "$dim_coupon_info"
};;
"dim_activity_rule_info"){
hive -e "$dim_activity_rule_info"
};;
"all"){
hive -e "$dim_user_info$dim_sku_info$dim_coupon_info$dim_activity_rule_info$dim_base_province"
};;
esac
2). 增加执行权限:见每日装载脚本!
3). 执行脚本
[xiaobai@hadoop102 bin]$ ./ods_to_dim_db_init.sh all 2020-06-14
4). 查看数据是否导入成功
5.8 DIM层每日数据装载脚本
- 编写脚本
1). 在/home/xiaobai/bin目录下创建脚本ods_to_dim_db.sh
[xiaobai@hadoop102 bin]$ vim ods_to_dim_db.sh
2). 在脚本中填写如下内容:
#!/bin/bash
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$2" ] ;then
do_date=$2
else
do_date=`date -d "-1 day" +%F`
fi
dim_user_info=" set hive.exec.dynamic.partition.mode=nonstrict; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; with tmp as ( select old.id old_id, old.login_name old_login_name, old.nick_name old_nick_name, old.name old_name, old.phone_num old_phone_num, old.email old_email, old.user_level old_user_level, old.birthday old_birthday, old.gender old_gender, old.create_time old_create_time, old.operate_time old_operate_time, old.start_date old_start_date, old.end_date old_end_date, new.id new_id, new.login_name new_login_name, new.nick_name new_nick_name, new.name new_name, new.phone_num new_phone_num, new.email new_email, new.user_level new_user_level, new.birthday new_birthday, new.gender new_gender, new.create_time new_create_time, new.operate_time new_operate_time, new.start_date new_start_date, new.end_date new_end_date from ( select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date from ${APP}.dim_user_info where dt='9999-99-99' and start_date<'$do_date' )old full outer join ( select id, login_name, nick_name, md5(name) name, md5(phone_num) phone_num, md5(email) email, user_level, birthday, gender, create_time, operate_time, '$do_date' start_date, '9999-99-99' end_date from ${APP}.ods_user_info where dt='$do_date' )new on old.id=new.id ) insert overwrite table ${APP}.dim_user_info partition(dt) select nvl(new_id,old_id), nvl(new_login_name,old_login_name), nvl(new_nick_name,old_nick_name), nvl(new_name,old_name), nvl(new_phone_num,old_phone_num), nvl(new_email,old_email), nvl(new_user_level,old_user_level), nvl(new_birthday,old_birthday), nvl(new_gender,old_gender), nvl(new_create_time,old_create_time), nvl(new_operate_time,old_operate_time), nvl(new_start_date,old_start_date), nvl(new_end_date,old_end_date), nvl(new_end_date,old_end_date) dt from tmp union all select old_id, old_login_name, old_nick_name, old_name, old_phone_num, old_email, old_user_level, old_birthday, old_gender, old_create_time, old_operate_time, old_start_date, cast(date_add('$do_date',-1) as string), cast(date_add('$do_date',-1) as string) dt from tmp where new_id is not null and old_id is not null; "
dim_sku_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; with sku as ( select id, price, sku_name, sku_desc, weight, is_sale, spu_id, category3_id, tm_id, create_time from ${APP}.ods_sku_info where dt='$do_date' ), spu as ( select id, spu_name from ${APP}.ods_spu_info where dt='$do_date' ), c3 as ( select id, name, category2_id from ${APP}.ods_base_category3 where dt='$do_date' ), c2 as ( select id, name, category1_id from ${APP}.ods_base_category2 where dt='$do_date' ), c1 as ( select id, name from ${APP}.ods_base_category1 where dt='$do_date' ), tm as ( select id, tm_name from ${APP}.ods_base_trademark where dt='$do_date' ), attr as ( select sku_id, collect_set(named_struct('attr_id',attr_id,'value_id',value_id,'attr_name',attr_name,'value_name',value_name)) attrs from ${APP}.ods_sku_attr_value where dt='$do_date' group by sku_id ), sale_attr as ( select sku_id, collect_set(named_struct('sale_attr_id',sale_attr_id,'sale_attr_value_id',sale_attr_value_id,'sale_attr_name',sale_attr_name,'sale_attr_value_name',sale_attr_value_name)) sale_attrs from ${APP}.ods_sku_sale_attr_value where dt='$do_date' group by sku_id ) insert overwrite table ${APP}.dim_sku_info partition(dt='$do_date') select sku.id, sku.price, sku.sku_name, sku.sku_desc, sku.weight, sku.is_sale, sku.spu_id, spu.spu_name, sku.category3_id, c3.name, c3.category2_id, c2.name, c2.category1_id, c1.name, sku.tm_id, tm.tm_name, attr.attrs, sale_attr.sale_attrs, sku.create_time from sku left join spu on sku.spu_id=spu.id left join c3 on sku.category3_id=c3.id left join c2 on c3.category2_id=c2.id left join c1 on c2.category1_id=c1.id left join tm on sku.tm_id=tm.id left join attr on sku.id=attr.sku_id left join sale_attr on sku.id=sale_attr.sku_id; "
dim_base_province=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dim_base_province select bp.id, bp.name, bp.area_code, bp.iso_code, bp.iso_3166_2, bp.region_id, bp.name from ${APP}.ods_base_province bp join ${APP}.ods_base_region br on bp.region_id = br.id; "
dim_coupon_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dim_coupon_info partition(dt='$do_date') select id, coupon_name, coupon_type, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, create_time, range_type, limit_num, taken_count, start_time, end_time, operate_time, expire_time from ${APP}.ods_coupon_info where dt='$do_date'; "
dim_activity_rule_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dim_activity_rule_info partition(dt='$do_date') select ar.id, ar.activity_id, ai.activity_name, ar.activity_type, ai.start_time, ai.end_time, ai.create_time, ar.condition_amount, ar.condition_num, ar.benefit_amount, ar.benefit_discount, ar.benefit_level from ( select id, activity_id, activity_type, condition_amount, condition_num, benefit_amount, benefit_discount, benefit_level from ${APP}.ods_activity_rule where dt='$do_date' )ar left join ( select id, activity_name, start_time, end_time, create_time from ${APP}.ods_activity_info where dt='$do_date' )ai on ar.activity_id=ai.id; "
case $1 in
"dim_user_info"){
hive -e "$dim_user_info"
};;
"dim_sku_info"){
hive -e "$dim_sku_info"
};;
"dim_base_province"){
hive -e "$dim_base_province"
};;
"dim_coupon_info"){
hive -e "$dim_coupon_info"
};;
"dim_activity_rule_info"){
hive -e "$dim_activity_rule_info"
};;
"all"){
hive -e "$dim_user_info$dim_sku_info$dim_coupon_info$dim_activity_rule_info"
};;
esac
2). 增加执行权限:
[xiaobai@hadoop102 bin]$ chmod +x ods_to_dim_db*
3). 执行脚本
ods_to_dim_db.sh all 2020-06-14
六、数仓搭建-DWD层
- 对用户行为数据解析;
- 对业务数据采用维度模型重新建模。
6.1 DWD层 (用户行为日志)
6.1.1 日志解析思路
- 日志结构
1). 页面埋点日志
2). 启动日志
- 日志解析思路
6.1.2 get_json_object函数使用
- 数据
[{
"name":"小明","sex":"男","age":"25"},{
"name":"小红","sex":"女","age":"23"}]
- 取出第一个json对象:
select get_json_object('[{"name":"小明","sex":"男","age":"25"},{"name":"小红","sex":"女","age":"23"}]','$[0]')
结果:
- 取出第一个json的age字段的值:
select get_json_object('[{"name":"小明","sex":"男","age":"25"},{"name":"小红","sex":"女","age":"23"}]','$[0].age')
结果:
6.1.3 启动日志表
启动日志解析思路:启动日志表中每行数据对应一个启动记录,一个启动记录应该包含日志中的公共信息和启动信息。先将所有包含start字段的日志过滤出来,然后使用get_json_object
函数解析每个字段。
- 建表语句
DROP TABLE IF EXISTS dwd_start_log;
CREATE EXTERNAL TABLE dwd_start_log(
`area_code` STRING COMMENT '地区编码',
`brand` STRING COMMENT '手机品牌',
`channel` STRING COMMENT '渠道',
`is_new` STRING COMMENT '是否首次启动',
`model` STRING COMMENT '手机型号',
`mid_id` STRING COMMENT '设备id',
`os` STRING COMMENT '操作系统',
`user_id` STRING COMMENT '会员id',
`version_code` STRING COMMENT 'app版本号',
`entry` STRING COMMENT 'icon手机图标 notice 通知 install 安装后启动',
`loading_time` BIGINT COMMENT '启动加载时间',
`open_ad_id` STRING COMMENT '广告页ID ',
`open_ad_ms` BIGINT COMMENT '广告总共播放时间',
`open_ad_skip_ms` BIGINT COMMENT '用户跳过广告时点',
`ts` BIGINT COMMENT '时间'
) COMMENT '启动日志表'
PARTITIONED BY (`dt` STRING) -- 按照时间创建分区
STORED AS PARQUET -- 采用parquet列式存储
LOCATION '/warehouse/gmall/dwd/dwd_start_log' -- 指定在HDFS上存储位置
TBLPROPERTIES('parquet.compression'='lzo') -- 采用LZO压缩;
- 数据装载
首日与每日相同!
hive (gmall)>
insert overwrite table dwd_start_log partition(dt='2020-06-14')
select
get_json_object(line,'$.common.ar'),
get_json_object(line,'$.common.ba'),
get_json_object(line,'$.common.ch'),
get_json_object(line,'$.common.is_new'),
get_json_object(line,'$.common.md'),
get_json_object(line,'$.common.mid'),
get_json_object(line,'$.common.os'),
get_json_object(line,'$.common.uid'),
get_json_object(line,'$.common.vc'),
get_json_object(line,'$.start.entry'),
get_json_object(line,'$.start.loading_time'),
get_json_object(line,'$.start.open_ad_id'),
get_json_object(line,'$.start.open_ad_ms'),
get_json_object(line,'$.start.open_ad_skip_ms'),
get_json_object(line,'$.ts')
from ods_log
where dt='2020-06-14'
and get_json_object(line,'$.start') is not null;
- 查看数据
select * from dwd_start_log where dt='2020-06-14' limit 2;
6.1.4 页面日志表
**页面日志解析思路:**页面日志表中每行数据对应一个页面访问记录
,一个页面访问记录应该包含日志中的公共信息和页面信息。先将所有包含page字段的日志过滤出来
,然后使用get_json_object函数解析每个字段。
- 建表语句
DROP TABLE IF EXISTS dwd_page_log;
CREATE EXTERNAL TABLE dwd_page_log(
`area_code` STRING COMMENT '地区编码',
`brand` STRING COMMENT '手机品牌',
`channel` STRING COMMENT '渠道',
`is_new` STRING COMMENT '是否首次启动',
`model` STRING COMMENT '手机型号',
`mid_id` STRING COMMENT '设备id',
`os` STRING COMMENT '操作系统',
`user_id` STRING COMMENT '会员id',
`version_code` STRING COMMENT 'app版本号',
`during_time` BIGINT COMMENT '持续时间毫秒',
`page_item` STRING COMMENT '目标id ',
`page_item_type` STRING COMMENT '目标类型',
`last_page_id` STRING COMMENT '上页类型',
`page_id` STRING COMMENT '页面ID ',
`source_type` STRING COMMENT '来源类型',
`ts` bigint
) COMMENT '页面日志表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_page_log'
TBLPROPERTIES('parquet.compression'='lzo');
- 数据装载
insert overwrite table dwd_page_log partition(dt='2020-06-14')
select
get_json_object(line,'$.common.ar'),
get_json_object(line,'$.common.ba'),
get_json_object(line,'$.common.ch'),
get_json_object(line,'$.common.is_new'),
get_json_object(line,'$.common.md'),
get_json_object(line,'$.common.mid'),
get_json_object(line,'$.common.os'),
get_json_object(line,'$.common.uid'),
get_json_object(line,'$.common.vc'),
get_json_object(line,'$.page.during_time'),
get_json_object(line,'$.page.item'),
get_json_object(line,'$.page.item_type'),
get_json_object(line,'$.page.last_page_id'),
get_json_object(line,'$.page.page_id'),
get_json_object(line,'$.page.source_type'),
get_json_object(line,'$.ts')
from ods_log
where dt='2020-06-14'
and get_json_object(line,'$.page') is not null;
- 查看数据:
select * from dwd_page_log where dt='2020-06-14' limit 2;
6.1.5 动作日志表
动作日志解析思路:动作日志表中每行数据对应用户的一个动作记录
,一个动作记录应当包含公共信息、页面信息以及动作信息。先将包含action字段的日志过滤
出来,然后通过UDTF函数,将action数组“炸开”
(类似于explode函数的效果),然后使用get_json_object函数解析每个字段。
- 建表语句
DROP TABLE IF EXISTS dwd_action_log;
CREATE EXTERNAL TABLE dwd_action_log(
`area_code` STRING COMMENT '地区编码',
`brand` STRING COMMENT '手机品牌',
`channel` STRING COMMENT '渠道',
`is_new` STRING COMMENT '是否首次启动',
`model` STRING COMMENT '手机型号',
`mid_id` STRING COMMENT '设备id',
`os` STRING COMMENT '操作系统',
`user_id` STRING COMMENT '会员id',
`version_code` STRING COMMENT 'app版本号',
`during_time` BIGINT COMMENT '持续时间毫秒',
`page_item` STRING COMMENT '目标id ',
`page_item_type` STRING COMMENT '目标类型',
`last_page_id` STRING COMMENT '上页类型',
`page_id` STRING COMMENT '页面id ',
`source_type` STRING COMMENT '来源类型',
`action_id` STRING COMMENT '动作id',
`item` STRING COMMENT '目标id ',
`item_type` STRING COMMENT '目标类型',
`ts` BIGINT COMMENT '时间'
) COMMENT '动作日志表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_action_log'
TBLPROPERTIES('parquet.compression'='lzo');
- 创建UDTF函数——设计思路
- 创建UDTF函数——编写代码
1). 创建一个maven工程:hivefunction
2). 创建包名:com.atguigu.hive.udtf
3). 引入如下依赖
<dependencies>
<!--添加hive依赖-->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>3.1.2</version>
</dependency>
</dependencies>
4). 编码
package com.xiaobai.gmall.hive.udtf;
import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
import org.json.JSONArray;
public class ExplodeJSONArray extends GenericUDTF {
private PrimitiveObjectInspector inputOI;
@Override
public void process(Object[] args) throws HiveException {
Object arg = args[0];
String jsonArrayStr = PrimitiveObjectInspectorUtils.getString(arg, inputOI);
JSONArray jsonArray = new JSONArray(jsonArrayStr);
for (int i = 0; i < jsonArray.length(); i++) {
String json = jsonArray.getString(i);
String[] result = {
json};
forward(result);
}
}
@Override
public void close() throws HiveException {
}
@Override
public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
if(argOIs.length!= 1){
throw new UDFArgumentException("explode_json_array函数只能接收1个函数");
}
ObjectInspector argOI = argOIs[0];
if(argOI.getCategory()!= ObjectInspector.Category.PRIMITIVE){
throw new UDFArgumentException("explode_json_array函数只能接收基本数据类型的函数");
}
PrimitiveObjectInspector primitiveOI = (PrimitiveObjectInspector) argOI;
inputOI=primitiveOI;
if(primitiveOI.getPrimitiveCategory()!= PrimitiveObjectInspector.PrimitiveCategory.STRING){
throw new UDFArgumentException("explode_json_array函数只能接收STRING类型的函数");
}
ArrayList<String> fieldNames = new ArrayList<String>();
ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();
fieldNames.add("item");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs);
}
}
- 创建函数
1). 打包
2). 将gmall-udtf-1.0-SNAPSHOT.jar
上传到hadoop102的/opt/module/路径下,然后再将该jar包上传到HDFS的/user/hive/jars路径下:
可手动在hdfs上创建jars目录并上传jarbao,也可通过命令上传:
[xiaobai@hadoop102 module]$ hadoop fs -mkdir -p /user/hive/jars
[xiaobai@hadoop102 module]$ hadoop fs -put hivefunction-1.0-SNAPSHOT.jar /user/hive/jars
3). 创建永久函数与开发好的java class关联:
create function explode_json_array as 'com.xiaobai.gmall.hive.udtf.ExplodeJSONArray' using jar 'hdfs://hadoop102:8020/user/hive/jars/gmall-udtf-1.0-SNAPSHOT.jar';
注⚠️:
如果修改了自定义函数重新生成jar包怎么处理?只需要替换HDFS路径上的旧jar包,然后重启Hive客户端即可。
- 数据导入
insert overwrite table dwd_action_log partition(dt='2020-06-14')
select
get_json_object(line,'$.common.ar'),
get_json_object(line,'$.common.ba'),
get_json_object(line,'$.common.ch'),
get_json_object(line,'$.common.is_new'),
get_json_object(line,'$.common.md'),
get_json_object(line,'$.common.mid'),
get_json_object(line,'$.common.os'),
get_json_object(line,'$.common.uid'),
get_json_object(line,'$.common.vc'),
get_json_object(line,'$.page.during_time'),
get_json_object(line,'$.page.item'),
get_json_object(line,'$.page.item_type'),
get_json_object(line,'$.page.last_page_id'),
get_json_object(line,'$.page.page_id'),
get_json_object(line,'$.page.source_type'),
get_json_object(action,'$.action_id'),
get_json_object(action,'$.item'),
get_json_object(action,'$.item_type'),
get_json_object(action,'$.ts')
from ods_log lateral view explode_json_array(get_json_object(line,'$.actions')) tmp as action
where dt='2020-06-14'
and get_json_object(line,'$.actions') is not null;
- 查看数据
select * from dwd_action_log where dt='2020-06-14' limit 2;
6.1.6 曝光日志表
曝光日志解析思路:曝光日志表中每行数据对应一个曝光记录
,一个曝光记录应当包含公共信息、页面信息以及曝光信息。先将包含display字段的日志过滤
出来,然后通过UDTF函数,将display数组“炸开”(
类似于explode函数的效果),然后使用get_json_object函数解析每个字段。
- 建表语句
DROP TABLE IF EXISTS dwd_display_log;
CREATE EXTERNAL TABLE dwd_display_log(
`area_code` STRING COMMENT '地区编码',
`brand` STRING COMMENT '手机品牌',
`channel` STRING COMMENT '渠道',
`is_new` STRING COMMENT '是否首次启动',
`model` STRING COMMENT '手机型号',
`mid_id` STRING COMMENT '设备id',
`os` STRING COMMENT '操作系统',
`user_id` STRING COMMENT '会员id',
`version_code` STRING COMMENT 'app版本号',
`during_time` BIGINT COMMENT 'app版本号',
`page_item` STRING COMMENT '目标id ',
`page_item_type` STRING COMMENT '目标类型',
`last_page_id` STRING COMMENT '上页类型',
`page_id` STRING COMMENT '页面ID ',
`source_type` STRING COMMENT '来源类型',
`ts` BIGINT COMMENT 'app版本号',
`display_type` STRING COMMENT '曝光类型',
`item` STRING COMMENT '曝光对象id ',
`item_type` STRING COMMENT 'app版本号',
`order` BIGINT COMMENT '曝光顺序',
`pos_id` BIGINT COMMENT '曝光位置'
) COMMENT '曝光日志表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_display_log'
TBLPROPERTIES('parquet.compression'='lzo');
- 数据导入
insert overwrite table dwd_display_log partition(dt='2020-06-14')
select
get_json_object(line,'$.common.ar'),
get_json_object(line,'$.common.ba'),
get_json_object(line,'$.common.ch'),
get_json_object(line,'$.common.is_new'),
get_json_object(line,'$.common.md'),
get_json_object(line,'$.common.mid'),
get_json_object(line,'$.common.os'),
get_json_object(line,'$.common.uid'),
get_json_object(line,'$.common.vc'),
get_json_object(line,'$.page.during_time'),
get_json_object(line,'$.page.item'),
get_json_object(line,'$.page.item_type'),
get_json_object(line,'$.page.last_page_id'),
get_json_object(line,'$.page.page_id'),
get_json_object(line,'$.page.source_type'),
get_json_object(line,'$.ts'),
get_json_object(display,'$.display_type'),
get_json_object(display,'$.item'),
get_json_object(display,'$.item_type'),
get_json_object(display,'$.order'),
get_json_object(display,'$.pos_id')
from ods_log lateral view explode_json_array(get_json_object(line,'$.displays')) tmp as display
where dt='2020-06-14'
and get_json_object(line,'$.displays') is not null;
- 查看数据
select * from dwd_display_log where dt='2020-06-14' limit 2;
6.1.7 错误日志表
错误日志解析思路:错误日志表中每行数据对应一个错误记录
,为方便定位错误,一个错误记录应当包含与之对应的公共信息、页面信息、曝光信息、动作信息、启动信息以及错误信息。先将包含err字段的日志过滤
出来,然后使用get_json_object函数解析所有字段。
- 建表语句
DROP TABLE IF EXISTS dwd_error_log;
CREATE EXTERNAL TABLE dwd_error_log(
`area_code` STRING COMMENT '地区编码',
`brand` STRING COMMENT '手机品牌',
`channel` STRING COMMENT '渠道',
`is_new` STRING COMMENT '是否首次启动',
`model` STRING COMMENT '手机型号',
`mid_id` STRING COMMENT '设备id',
`os` STRING COMMENT '操作系统',
`user_id` STRING COMMENT '会员id',
`version_code` STRING COMMENT 'app版本号',
`page_item` STRING COMMENT '目标id ',
`page_item_type` STRING COMMENT '目标类型',
`last_page_id` STRING COMMENT '上页类型',
`page_id` STRING COMMENT '页面ID ',
`source_type` STRING COMMENT '来源类型',
`entry` STRING COMMENT ' icon手机图标 notice 通知 install 安装后启动',
`loading_time` STRING COMMENT '启动加载时间',
`open_ad_id` STRING COMMENT '广告页ID ',
`open_ad_ms` STRING COMMENT '广告总共播放时间',
`open_ad_skip_ms` STRING COMMENT '用户跳过广告时点',
`actions` STRING COMMENT '动作',
`displays` STRING COMMENT '曝光',
`ts` STRING COMMENT '时间',
`error_code` STRING COMMENT '错误码',
`msg` STRING COMMENT '错误信息'
) COMMENT '错误日志表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_error_log'
TBLPROPERTIES('parquet.compression'='lzo');
注⚠️:
此处为对动作数组和曝光数组做处理,如需分析错误与单个动作或曝光的关联,可先使用explode_json_array函数将数组“炸开”,再使用get_json_object函数获取具体字段。
- 数据装载
insert overwrite table dwd_error_log partition(dt='2020-06-14')
select
get_json_object(line,'$.common.ar'),
get_json_object(line,'$.common.ba'),
get_json_object(line,'$.common.ch'),
get_json_object(line,'$.common.is_new'),
get_json_object(line,'$.common.md'),
get_json_object(line,'$.common.mid'),
get_json_object(line,'$.common.os'),
get_json_object(line,'$.common.uid'),
get_json_object(line,'$.common.vc'),
get_json_object(line,'$.page.item'),
get_json_object(line,'$.page.item_type'),
get_json_object(line,'$.page.last_page_id'),
get_json_object(line,'$.page.page_id'),
get_json_object(line,'$.page.source_type'),
get_json_object(line,'$.start.entry'),
get_json_object(line,'$.start.loading_time'),
get_json_object(line,'$.start.open_ad_id'),
get_json_object(line,'$.start.open_ad_ms'),
get_json_object(line,'$.start.open_ad_skip_ms'),
get_json_object(line,'$.actions'),
get_json_object(line,'$.displays'),
get_json_object(line,'$.ts'),
get_json_object(line,'$.err.error_code'),
get_json_object(line,'$.err.msg')
from ods_log
where dt='2020-06-14'
and get_json_object(line,'$.err') is not null;
6.1.8 DWD层用户行为数据加载脚本
- 编写脚本
在hadoop102的/home/xiaobai/bin目录下创建脚本
[xiaobai@hadoop102 bin]$ vim ods_to_dwd_log.sh
填入以下内容:
#!/bin/bash
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$2" ] ;then
do_date=$2
else
do_date=`date -d "-1 day" +%F`
fi
dwd_start_log=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_start_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.is_new'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.start.entry'), get_json_object(line,'$.start.loading_time'), get_json_object(line,'$.start.open_ad_id'), get_json_object(line,'$.start.open_ad_ms'), get_json_object(line,'$.start.open_ad_skip_ms'), get_json_object(line,'$.ts') from ${APP}.ods_log where dt='$do_date' and get_json_object(line,'$.start') is not null;"
dwd_page_log=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_page_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.is_new'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.during_time'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.source_type'), get_json_object(line,'$.ts') from ${APP}.ods_log where dt='$do_date' and get_json_object(line,'$.page') is not null;"
dwd_action_log=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_action_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.is_new'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.during_time'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.source_type'), get_json_object(action,'$.action_id'), get_json_object(action,'$.item'), get_json_object(action,'$.item_type'), get_json_object(action,'$.ts') from ${APP}.ods_log lateral view ${APP}.explode_json_array(get_json_object(line,'$.actions')) tmp as action where dt='$do_date' and get_json_object(line,'$.actions') is not null;"
dwd_display_log=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_display_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.is_new'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.during_time'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.source_type'), get_json_object(line,'$.ts'), get_json_object(display,'$.display_type'), get_json_object(display,'$.item'), get_json_object(display,'$.item_type'), get_json_object(display,'$.order'), get_json_object(display,'$.pos_id') from ${APP}.ods_log lateral view ${APP}.explode_json_array(get_json_object(line,'$.displays')) tmp as display where dt='$do_date' and get_json_object(line,'$.displays') is not null;"
dwd_error_log=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_error_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.is_new'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.source_type'), get_json_object(line,'$.start.entry'), get_json_object(line,'$.start.loading_time'), get_json_object(line,'$.start.open_ad_id'), get_json_object(line,'$.start.open_ad_ms'), get_json_object(line,'$.start.open_ad_skip_ms'), get_json_object(line,'$.actions'), get_json_object(line,'$.displays'), get_json_object(line,'$.ts'), get_json_object(line,'$.err.error_code'), get_json_object(line,'$.err.msg') from ${APP}.ods_log where dt='$do_date' and get_json_object(line,'$.err') is not null;"
case $1 in
dwd_start_log )
hive -e "$dwd_start_log"
;;
dwd_page_log )
hive -e "$dwd_page_log"
;;
dwd_action_log )
hive -e "$dwd_action_log"
;;
dwd_display_log )
hive -e "$dwd_display_log"
;;
dwd_error_log )
hive -e "$dwd_error_log"
;;
all )
hive -e "$dwd_start_log$dwd_page_log$dwd_action_log$dwd_display_log$dwd_error_log"
;;
esac
- 增加脚本执行权限
[xiaobai@hadoop102 bin]$ chmod +x ods_to_dwd_log.sh
- 执行脚本
[xiaobai@hadoop102 bin]$ ./ods_to_dwd_log.sh all 2020-06-14
查看导入结果:
6.2 DWD层(业务数据)
业务数据方面DWD层的搭建主要注意点在于维度建模。
6.2.1 评价事实表(事务型事实表)
- 建表语句
DROP TABLE IF EXISTS dwd_comment_info;
CREATE EXTERNAL TABLE dwd_comment_info(
`id` STRING COMMENT '编号',
`user_id` STRING COMMENT '用户ID',
`sku_id` STRING COMMENT '商品sku',
`spu_id` STRING COMMENT '商品spu',
`order_id` STRING COMMENT '订单ID',
`appraise` STRING COMMENT '评价(好评、中评、差评、默认评价)',
`create_time` STRING COMMENT '评价时间'
) COMMENT '评价事实表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_comment_info/'
TBLPROPERTIES ("parquet.compression"="lzo");
-
分区规划
-
数据装载
1). 首日装载
insert overwrite table dwd_comment_info partition (dt)
select
id,
user_id,
sku_id,
spu_id,
order_id,
appraise,
create_time,
date_format(create_time,'yyyy-MM-dd')
from ods_comment_info
where dt='2020-06-14';
2). 每日装载
insert overwrite table dwd_comment_info partition(dt='2020-06-14')
select
id,
user_id,
sku_id,
spu_id,
order_id,
appraise,
create_time
from ods_comment_info
where dt='2020-06-15'
6.2.2 订单明细事实表(事务型事实表)
- 建表语句
DROP TABLE IF EXISTS dwd_order_detail;
CREATE EXTERNAL TABLE dwd_order_detail (
`id` STRING COMMENT '订单编号',
`order_id` STRING COMMENT '订单号',
`user_id` STRING COMMENT '用户id',
`sku_id` STRING COMMENT 'sku商品id',
`province_id` STRING COMMENT '省份ID',
`activity_id` STRING COMMENT '活动ID',
`activity_rule_id` STRING COMMENT '活动规则ID',
`coupon_id` STRING COMMENT '优惠券ID',
`create_time` STRING COMMENT '创建时间',
`source_type` STRING COMMENT '来源类型',
`source_id` STRING COMMENT '来源编号',
`sku_num` BIGINT COMMENT '商品数量',
`original_amount` DECIMAL(16,2) COMMENT '原始价格',
`split_activity_amount` DECIMAL(16,2) COMMENT '活动优惠分摊',
`split_coupon_amount` DECIMAL(16,2) COMMENT '优惠券优惠分摊',
`split_final_amount` DECIMAL(16,2) COMMENT '最终价格分摊'
) COMMENT '订单明细事实表表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_order_detail/'
TBLPROPERTIES ("parquet.compression"="lzo");
- 分区规划
- 数据装载
1). 首日装载
insert overwrite table dwd_order_detail partition(dt)
select
od.id,
od.order_id,
oi.user_id,
od.sku_id,
oi.province_id,
oda.activity_id,
oda.activity_rule_id,
odc.coupon_id,
od.create_time,
od.source_type,
od.source_id,
od.sku_num,
od.order_price*od.sku_num,
od.split_activity_amount,
od.split_coupon_amount,
od.split_final_amount,
date_format(create_time,'yyyy-MM-dd')
from
(
select
*
from ods_order_detail
where dt='2020-06-14'
)od
left join
(
select
id,
user_id,
province_id
from ods_order_info
where dt='2020-06-14'
)oi
on od.order_id=oi.id
left join
(
select
order_detail_id,
activity_id,
activity_rule_id
from ods_order_detail_activity
where dt='2020-06-14'
)oda
on od.id=oda.order_detail_id
left join
(
select
order_detail_id,
coupon_id
from ods_order_detail_coupon
where dt='2020-06-14'
)odc
on od.id=odc.order_detail_id;
2). 每日装载
insert overwrite table dwd_order_detail partition(dt='2020-06-15')
select
od.id,
od.order_id,
oi.user_id,
od.sku_id,
oi.province_id,
oda.activity_id,
oda.activity_rule_id,
odc.coupon_id,
od.create_time,
od.source_type,
od.source_id,
od.sku_num,
od.order_price*od.sku_num,
od.split_activity_amount,
od.split_coupon_amount,
od.split_final_amount
from
(
select
*
from ods_order_detail
where dt='2020-06-15'
)od
left join
(
select
id,
user_id,
province_id
from ods_order_info
where dt='2020-06-15'
)oi
on od.order_id=oi.id
left join
(
select
order_detail_id,
activity_id,
activity_rule_id
from ods_order_detail_activity
where dt='2020-06-15'
)oda
on od.id=oda.order_detail_id
left join
(
select
order_detail_id,
coupon_id
from ods_order_detail_coupon
where dt='2020-06-15'
)odc
on od.id=odc.order_detail_id;
6.2.3 退单事实表(事务型事实表)
- 建表语句
DROP TABLE IF EXISTS dwd_order_refund_info;
CREATE EXTERNAL TABLE dwd_order_refund_info(
`id` STRING COMMENT '编号',
`user_id` STRING COMMENT '用户ID',
`order_id` STRING COMMENT '订单ID',
`sku_id` STRING COMMENT '商品ID',
`province_id` STRING COMMENT '地区ID',
`refund_type` STRING COMMENT '退单类型',
`refund_num` BIGINT COMMENT '退单件数',
`refund_amount` DECIMAL(16,2) COMMENT '退单金额',
`refund_reason_type` STRING COMMENT '退单原因类型',
`create_time` STRING COMMENT '退单时间'
) COMMENT '退单事实表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_order_refund_info/'
TBLPROPERTIES ("parquet.compression"="lzo");
- 分区规划
- 数据装载
1). 首日装载
insert overwrite table dwd_order_refund_info partition(dt)
select
ri.id,
ri.user_id,
ri.order_id,
ri.sku_id,
oi.province_id,
ri.refund_type,
ri.refund_num,
ri.refund_amount,
ri.refund_reason_type,
ri.create_time,
date_format(ri.create_time,'yyyy-MM-dd')
from
(
select * from ods_order_refund_info where dt='2020-06-14'
)ri
left join
(
select id,province_id from ods_order_info where dt='2020-06-14'
)oi
on ri.order_id=oi.id;
2). 每日装载
insert overwrite table dwd_order_refund_info partition(dt='2020-06-15')
select
ri.id,
ri.user_id,
ri.order_id,
ri.sku_id,
oi.province_id,
ri.refund_type,
ri.refund_num,
ri.refund_amount,
ri.refund_reason_type,
ri.create_time
from
(
select * from ods_order_refund_info where dt='2020-06-15'
)ri
left join
(
select id,province_id from ods_order_info where dt='2020-06-15'
)oi
on ri.order_id=oi.id;
6.2.4 加购事实表(周期型快照事实表,每日快照)
- 建表语句
DROP TABLE IF EXISTS dwd_cart_info;
CREATE EXTERNAL TABLE dwd_cart_info(
`id` STRING COMMENT '编号',
`user_id` STRING COMMENT '用户ID',
`sku_id` STRING COMMENT '商品ID',
`source_type` STRING COMMENT '来源类型',
`source_id` STRING COMMENT '来源编号',
`cart_price` DECIMAL(16,2) COMMENT '加入购物车时的价格',
`is_ordered` STRING COMMENT '是否已下单',
`create_time` STRING COMMENT '创建时间',
`operate_time` STRING COMMENT '修改时间',
`order_time` STRING COMMENT '下单时间',
`sku_num` BIGINT COMMENT '加购数量'
) COMMENT '加购事实表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_cart_info/'
TBLPROPERTIES ("parquet.compression"="lzo");
- 分区规划
- 数据装载
1). 首日装载
insert overwrite table dwd_cart_info partition(dt='2020-06-14')
select
id,
user_id,
sku_id,
source_type,
source_id,
cart_price,
is_ordered,
create_time,
operate_time,
order_time,
sku_num
from ods_cart_info
where dt='2020-06-14';
2). 每日装载
insert overwrite table dwd_cart_info partition(dt='2020-06-15')
select
id,
user_id,
sku_id,
source_type,
source_id,
cart_price,
is_ordered,
create_time,
operate_time,
order_time,
sku_num
from ods_cart_info
where dt='2020-06-15';
6.2.5 收藏事实表(周期型快照事实表,每日快照)
- 建表语句
DROP TABLE IF EXISTS dwd_favor_info;
CREATE EXTERNAL TABLE dwd_favor_info(
`id` STRING COMMENT '编号',
`user_id` STRING COMMENT '用户id',
`sku_id` STRING COMMENT 'skuid',
`spu_id` STRING COMMENT 'spuid',
`is_cancel` STRING COMMENT '是否取消',
`create_time` STRING COMMENT '收藏时间',
`cancel_time` STRING COMMENT '取消时间'
) COMMENT '收藏事实表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_favor_info/'
TBLPROPERTIES ("parquet.compression"="lzo");
-
分区规划
-
数据装载
1).首日装载
insert overwrite table dwd_favor_info partition(dt='2020-06-14')
select
id,
user_id,
sku_id,
spu_id,
is_cancel,
create_time,
cancel_time
from ods_favor_info
where dt='2020-06-14';
2). 每日装载
insert overwrite table dwd_favor_info partition(dt='2020-06-15')
select
id,
user_id,
sku_id,
spu_id,
is_cancel,
create_time,
cancel_time
from ods_favor_info
where dt='2020-06-15';
6.2.6 优惠券领用事实表(累积型快照事实表)
- 建表语句
DROP TABLE IF EXISTS dwd_coupon_use;
CREATE EXTERNAL TABLE dwd_coupon_use(
`id` STRING COMMENT '编号',
`coupon_id` STRING COMMENT '优惠券ID',
`user_id` STRING COMMENT 'userid',
`order_id` STRING COMMENT '订单id',
`coupon_status` STRING COMMENT '优惠券状态',
`get_time` STRING COMMENT '领取时间',
`using_time` STRING COMMENT '使用时间(下单)',
`used_time` STRING COMMENT '使用时间(支付)',
`expire_time` STRING COMMENT '过期时间'
) COMMENT '优惠券领用事实表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_coupon_use/'
TBLPROPERTIES ("parquet.compression"="lzo");
-
分区规划
-
数据装载
1). 首日装载
insert overwrite table dwd_coupon_use partition(dt)
select
id,
coupon_id,
user_id,
order_id,
coupon_status,
get_time,
using_time,
used_time,
expire_time,
coalesce(date_format(used_time,'yyyy-MM-dd'),date_format(expire_time,'yyyy-MM-dd'),'9999-99-99')
from ods_coupon_use
where dt='2020-06-14';
2). 每日装载
a.装载逻辑
b.转载语句
insert overwrite table dwd_coupon_use partition(dt)
select
nvl(new.id,old.id),
nvl(new.coupon_id,old.coupon_id),
nvl(new.user_id,old.user_id),
nvl(new.order_id,old.order_id),
nvl(new.coupon_status,old.coupon_status),
nvl(new.get_time,old.get_time),
nvl(new.using_time,old.using_time),
nvl(new.used_time,old.used_time),
nvl(new.expire_time,old.expire_time),
coalesce(date_format(nvl(new.used_time,old.used_time),'yyyy-MM-dd'),date_format(nvl(new.expire_time,old.expire_time),'yyyy-MM-dd'),'9999-99-99')
from
(
select
id,
coupon_id,
user_id,
order_id,
coupon_status,
get_time,
using_time,
used_time,
expire_time
from dwd_coupon_use
where dt='9999-99-99'
)old
full outer join
(
select
id,
coupon_id,
user_id,
order_id,
coupon_status,
get_time,
using_time,
used_time,
expire_time
from ods_coupon_use
where dt='2020-06-15'
)new
on old.id=new.id;
6.2.7 支付事实表(累积型快照事实表)
- 建表语句
DROP TABLE IF EXISTS dwd_payment_info;
CREATE EXTERNAL TABLE dwd_payment_info (
`id` STRING COMMENT '编号',
`order_id` STRING COMMENT '订单编号',
`user_id` STRING COMMENT '用户编号',
`province_id` STRING COMMENT '地区ID',
`trade_no` STRING COMMENT '交易编号',
`out_trade_no` STRING COMMENT '对外交易编号',
`payment_type` STRING COMMENT '支付类型',
`payment_amount` DECIMAL(16,2) COMMENT '支付金额',
`payment_status` STRING COMMENT '支付状态',
`create_time` STRING COMMENT '创建时间',--调用第三方支付接口的时间
`callback_time` STRING COMMENT '完成时间'--支付完成时间,即支付成功回调时间
) COMMENT '支付事实表表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_payment_info/'
TBLPROPERTIES ("parquet.compression"="lzo");
-
分区规划
-
数据装载
1). 首日装载
insert overwrite table dwd_payment_info partition(dt)
select
pi.id,
pi.order_id,
pi.user_id,
oi.province_id,
pi.trade_no,
pi.out_trade_no,
pi.payment_type,
pi.payment_amount,
pi.payment_status,
pi.create_time,
pi.callback_time,
nvl(date_format(pi.callback_time,'yyyy-MM-dd'),'9999-99-99')
from
(
select * from ods_payment_info where dt='2020-06-14'
)pi
left join
(
select id,province_id from ods_order_info where dt='2020-06-14'
)oi
on pi.order_id=oi.id;
2). 每日装载
insert overwrite table dwd_payment_info partition(dt)
select
nvl(new.id,old.id),
nvl(new.order_id,old.order_id),
nvl(new.user_id,old.user_id),
nvl(new.province_id,old.province_id),
nvl(new.trade_no,old.trade_no),
nvl(new.out_trade_no,old.out_trade_no),
nvl(new.payment_type,old.payment_type),
nvl(new.payment_amount,old.payment_amount),
nvl(new.payment_status,old.payment_status),
nvl(new.create_time,old.create_time),
nvl(new.callback_time,old.callback_time),
nvl(date_format(nvl(new.callback_time,old.callback_time),'yyyy-MM-dd'),'9999-99-99')
from
(
select id,
order_id,
user_id,
province_id,
trade_no,
out_trade_no,
payment_type,
payment_amount,
payment_status,
create_time,
callback_time
from dwd_payment_info
where dt = '9999-99-99'
)old
full outer join
(
select
pi.id,
pi.out_trade_no,
pi.order_id,
pi.user_id,
oi.province_id,
pi.payment_type,
pi.trade_no,
pi.payment_amount,
pi.payment_status,
pi.create_time,
pi.callback_time
from
(
select * from ods_payment_info where dt='2020-06-15'
)pi
left join
(
select id,province_id from ods_order_info where dt='2020-06-15'
)oi
on pi.order_id=oi.id
)new
on old.id=new.id;
6.2.8 退款事实表(累积型快照事实表)
- 建表语句
DROP TABLE IF EXISTS dwd_refund_payment;
CREATE EXTERNAL TABLE dwd_refund_payment (
`id` STRING COMMENT '编号',
`user_id` STRING COMMENT '用户ID',
`order_id` STRING COMMENT '订单编号',
`sku_id` STRING COMMENT 'SKU编号',
`province_id` STRING COMMENT '地区ID',
`trade_no` STRING COMMENT '交易编号',
`out_trade_no` STRING COMMENT '对外交易编号',
`payment_type` STRING COMMENT '支付类型',
`refund_amount` DECIMAL(16,2) COMMENT '退款金额',
`refund_status` STRING COMMENT '退款状态',
`create_time` STRING COMMENT '创建时间',--调用第三方支付接口的时间
`callback_time` STRING COMMENT '回调时间'--支付接口回调时间,即支付成功时间
) COMMENT '退款事实表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_refund_payment/'
TBLPROPERTIES ("parquet.compression"="lzo");
- 分区规划
- 数据装载
1). 首日装载
insert overwrite table dwd_refund_payment partition(dt)
select
rp.id,
user_id,
order_id,
sku_id,
province_id,
trade_no,
out_trade_no,
payment_type,
refund_amount,
refund_status,
create_time,
callback_time,
nvl(date_format(callback_time,'yyyy-MM-dd'),'9999-99-99')
from
(
select
id,
out_trade_no,
order_id,
sku_id,
payment_type,
trade_no,
refund_amount,
refund_status,
create_time,
callback_time
from ods_refund_payment
where dt='2020-06-14'
)rp
left join
(
select
id,
user_id,
province_id
from ods_order_info
where dt='2020-06-14'
)oi
on rp.order_id=oi.id;
2). 每日装载
insert overwrite table dwd_refund_payment partition(dt)
select
nvl(new.id,old.id),
nvl(new.user_id,old.user_id),
nvl(new.order_id,old.order_id),
nvl(new.sku_id,old.sku_id),
nvl(new.province_id,old.province_id),
nvl(new.trade_no,old.trade_no),
nvl(new.out_trade_no,old.out_trade_no),
nvl(new.payment_type,old.payment_type),
nvl(new.refund_amount,old.refund_amount),
nvl(new.refund_status,old.refund_status),
nvl(new.create_time,old.create_time),
nvl(new.callback_time,old.callback_time),
nvl(date_format(nvl(new.callback_time,old.callback_time),'yyyy-MM-dd'),'9999-99-99')
from
(
select
id,
user_id,
order_id,
sku_id,
province_id,
trade_no,
out_trade_no,
payment_type,
refund_amount,
refund_status,
create_time,
callback_time
from dwd_refund_payment
where dt='9999-99-99'
)old
full outer join
(
select
rp.id,
user_id,
order_id,
sku_id,
province_id,
trade_no,
out_trade_no,
payment_type,
refund_amount,
refund_status,
create_time,
callback_time
from
(
select
id,
out_trade_no,
order_id,
sku_id,
payment_type,
trade_no,
refund_amount,
refund_status,
create_time,
callback_time
from ods_refund_payment
where dt='2020-06-15'
)rp
left join
(
select
id,
user_id,
province_id
from ods_order_info
where dt='2020-06-15'
)oi
on rp.order_id=oi.id
)new
on old.id=new.id;
6.2.9 订单事实表(累积型快照事实表)
- 建表语句
DROP TABLE IF EXISTS dwd_order_info;
CREATE EXTERNAL TABLE dwd_order_info(
`id` STRING COMMENT '编号',
`order_status` STRING COMMENT '订单状态',
`user_id` STRING COMMENT '用户ID',
`province_id` STRING COMMENT '地区ID',
`payment_way` STRING COMMENT '支付方式',
`delivery_address` STRING COMMENT '邮寄地址',
`out_trade_no` STRING COMMENT '对外交易编号',
`tracking_no` STRING COMMENT '物流单号',
`create_time` STRING COMMENT '创建时间(未支付状态)',
`payment_time` STRING COMMENT '支付时间(已支付状态)',
`cancel_time` STRING COMMENT '取消时间(已取消状态)',
`finish_time` STRING COMMENT '完成时间(已完成状态)',
`refund_time` STRING COMMENT '退款时间(退款中状态)',
`refund_finish_time` STRING COMMENT '退款完成时间(退款完成状态)',
`expire_time` STRING COMMENT '过期时间',
`feight_fee` DECIMAL(16,2) COMMENT '运费',
`feight_fee_reduce` DECIMAL(16,2) COMMENT '运费减免',
`activity_reduce_amount` DECIMAL(16,2) COMMENT '活动减免',
`coupon_reduce_amount` DECIMAL(16,2) COMMENT '优惠券减免',
`original_amount` DECIMAL(16,2) COMMENT '订单原始价格',
`final_amount` DECIMAL(16,2) COMMENT '订单最终价格'
) COMMENT '订单事实表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_order_info/'
TBLPROPERTIES ("parquet.compression"="lzo");
- 分区规划
3. 数据装载
1). 首日装载
insert overwrite table dwd_order_info partition(dt)
select
oi.id,
oi.order_status,
oi.user_id,
oi.province_id,
oi.payment_way,
oi.delivery_address,
oi.out_trade_no,
oi.tracking_no,
oi.create_time,
times.ts['1002'] payment_time,
times.ts['1003'] cancel_time,
times.ts['1004'] finish_time,
times.ts['1005'] refund_time,
times.ts['1006'] refund_finish_time,
oi.expire_time,
feight_fee,
feight_fee_reduce,
activity_reduce_amount,
coupon_reduce_amount,
original_amount,
final_amount,
case
when times.ts['1003'] is not null then date_format(times.ts['1003'],'yyyy-MM-dd')
when times.ts['1004'] is not null and date_add(date_format(times.ts['1004'],'yyyy-MM-dd'),7)<='2020-06-14' and times.ts['1005'] is null then date_add(date_format(times.ts['1004'],'yyyy-MM-dd'),7)
when times.ts['1006'] is not null then date_format(times.ts['1006'],'yyyy-MM-dd')
when oi.expire_time is not null then date_format(oi.expire_time,'yyyy-MM-dd')
else '9999-99-99'
end
from
(
select
*
from ods_order_info
where dt='2020-06-14'
)oi
left join
(
select
order_id,
str_to_map(concat_ws(',',collect_set(concat(order_status,'=',operate_time))),',','=') ts
from ods_order_status_log
where dt='2020-06-14'
group by order_id
)times
on oi.id=times.order_id;
2). 每日装载
insert overwrite table dwd_order_info partition(dt)
select
nvl(new.id,old.id),
nvl(new.order_status,old.order_status),
nvl(new.user_id,old.user_id),
nvl(new.province_id,old.province_id),
nvl(new.payment_way,old.payment_way),
nvl(new.delivery_address,old.delivery_address),
nvl(new.out_trade_no,old.out_trade_no),
nvl(new.tracking_no,old.tracking_no),
nvl(new.create_time,old.create_time),
nvl(new.payment_time,old.payment_time),
nvl(new.cancel_time,old.cancel_time),
nvl(new.finish_time,old.finish_time),
nvl(new.refund_time,old.refund_time),
nvl(new.refund_finish_time,old.refund_finish_time),
nvl(new.expire_time,old.expire_time),
nvl(new.feight_fee,old.feight_fee),
nvl(new.feight_fee_reduce,old.feight_fee_reduce),
nvl(new.activity_reduce_amount,old.activity_reduce_amount),
nvl(new.coupon_reduce_amount,old.coupon_reduce_amount),
nvl(new.original_amount,old.original_amount),
nvl(new.final_amount,old.final_amount),
case
when new.cancel_time is not null then date_format(new.cancel_time,'yyyy-MM-dd')
when new.finish_time is not null and date_add(date_format(new.finish_time,'yyyy-MM-dd'),7)='2020-06-15' and new.refund_time is null then '2020-06-15'
when new.refund_finish_time is not null then date_format(new.refund_finish_time,'yyyy-MM-dd')
when new.expire_time is not null then date_format(new.expire_time,'yyyy-MM-dd')
else '9999-99-99'
end
from
(
select
id,
order_status,
user_id,
province_id,
payment_way,
delivery_address,
out_trade_no,
tracking_no,
create_time,
payment_time,
cancel_time,
finish_time,
refund_time,
refund_finish_time,
expire_time,
feight_fee,
feight_fee_reduce,
activity_reduce_amount,
coupon_reduce_amount,
original_amount,
final_amount
from dwd_order_info
where dt='9999-99-99'
)old
full outer join
(
select
oi.id,
oi.order_status,
oi.user_id,
oi.province_id,
oi.payment_way,
oi.delivery_address,
oi.out_trade_no,
oi.tracking_no,
oi.create_time,
times.ts['1002'] payment_time,
times.ts['1003'] cancel_time,
times.ts['1004'] finish_time,
times.ts['1005'] refund_time,
times.ts['1006'] refund_finish_time,
oi.expire_time,
feight_fee,
feight_fee_reduce,
activity_reduce_amount,
coupon_reduce_amount,
original_amount,
final_amount
from
(
select
*
from ods_order_info
where dt='2020-06-15'
)oi
left join
(
select
order_id,
str_to_map(concat_ws(',',collect_set(concat(order_status,'=',operate_time))),',','=') ts
from ods_order_status_log
where dt='2020-06-15'
group by order_id
)times
on oi.id=times.order_id
)new
on old.id=new.id;
6.2.10 DWD层业务数据首日装载脚本
- 在/home/xiaobai/bin目录下创建脚本
ods_to_dwd_db_init.sh
[xiaobai@hadoop102 bin]$ vim ods_to_dwd_db_init.sh
#!/bin/bash
APP=gmall
if [ -n "$2" ] ;then
do_date=$2
else
echo "请传入日期参数"
exit
fi
dwd_order_info=" set hive.exec.dynamic.partition.mode=nonstrict; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_order_info partition(dt) select oi.id, oi.order_status, oi.user_id, oi.province_id, oi.payment_way, oi.delivery_address, oi.out_trade_no, oi.tracking_no, oi.create_time, times.ts['1002'] payment_time, times.ts['1003'] cancel_time, times.ts['1004'] finish_time, times.ts['1005'] refund_time, times.ts['1006'] refund_finish_time, oi.expire_time, feight_fee, feight_fee_reduce, activity_reduce_amount, coupon_reduce_amount, original_amount, final_amount, case when times.ts['1003'] is not null then date_format(times.ts['1003'],'yyyy-MM-dd') when times.ts['1004'] is not null and date_add(date_format(times.ts['1004'],'yyyy-MM-dd'),7)<='$do_date' and times.ts['1005'] is null then date_add(date_format(times.ts['1004'],'yyyy-MM-dd'),7) when times.ts['1006'] is not null then date_format(times.ts['1006'],'yyyy-MM-dd') when oi.expire_time is not null then date_format(oi.expire_time,'yyyy-MM-dd') else '9999-99-99' end from ( select * from ${APP}.ods_order_info where dt='$do_date' )oi left join ( select order_id, str_to_map(concat_ws(',',collect_set(concat(order_status,'=',operate_time))),',','=') ts from ${APP}.ods_order_status_log where dt='$do_date' group by order_id )times on oi.id=times.order_id;"
dwd_order_detail=" set hive.exec.dynamic.partition.mode=nonstrict; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_order_detail partition(dt) select od.id, od.order_id, oi.user_id, od.sku_id, oi.province_id, oda.activity_id, oda.activity_rule_id, odc.coupon_id, od.create_time, od.source_type, od.source_id, od.sku_num, od.order_price*od.sku_num, od.split_activity_amount, od.split_coupon_amount, od.split_final_amount, date_format(create_time,'yyyy-MM-dd') from ( select * from ${APP}.ods_order_detail where dt='$do_date' )od left join ( select id, user_id, province_id from ${APP}.ods_order_info where dt='$do_date' )oi on od.order_id=oi.id left join ( select order_detail_id, activity_id, activity_rule_id from ${APP}.ods_order_detail_activity where dt='$do_date' )oda on od.id=oda.order_detail_id left join ( select order_detail_id, coupon_id from ${APP}.ods_order_detail_coupon where dt='$do_date' )odc on od.id=odc.order_detail_id;"
dwd_payment_info=" set hive.exec.dynamic.partition.mode=nonstrict; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_payment_info partition(dt) select pi.id, pi.order_id, pi.user_id, oi.province_id, pi.trade_no, pi.out_trade_no, pi.payment_type, pi.payment_amount, pi.payment_status, pi.create_time, pi.callback_time, nvl(date_format(pi.callback_time,'yyyy-MM-dd'),'9999-99-99') from ( select * from ${APP}.ods_payment_info where dt='$do_date' )pi left join ( select id,province_id from ${APP}.ods_order_info where dt='$do_date' )oi on pi.order_id=oi.id;"
dwd_cart_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_cart_info partition(dt='$do_date') select id, user_id, sku_id, source_type, source_id, cart_price, is_ordered, create_time, operate_time, order_time, sku_num from ${APP}.ods_cart_info where dt='$do_date';"
dwd_comment_info=" set hive.exec.dynamic.partition.mode=nonstrict; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_comment_info partition(dt) select id, user_id, sku_id, spu_id, order_id, appraise, create_time, date_format(create_time,'yyyy-MM-dd') from ${APP}.ods_comment_info where dt='$do_date'; "
dwd_favor_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_favor_info partition(dt='$do_date') select id, user_id, sku_id, spu_id, is_cancel, create_time, cancel_time from ${APP}.ods_favor_info where dt='$do_date';"
dwd_coupon_use=" set hive.exec.dynamic.partition.mode=nonstrict; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_coupon_use partition(dt) select id, coupon_id, user_id, order_id, coupon_status, get_time, using_time, used_time, expire_time, coalesce(date_format(used_time,'yyyy-MM-dd'),date_format(expire_time,'yyyy-MM-dd'),'9999-99-99') from ${APP}.ods_coupon_use where dt='$do_date';"
dwd_order_refund_info=" set hive.exec.dynamic.partition.mode=nonstrict; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_order_refund_info partition(dt) select ri.id, ri.user_id, ri.order_id, ri.sku_id, oi.province_id, ri.refund_type, ri.refund_num, ri.refund_amount, ri.refund_reason_type, ri.create_time, date_format(ri.create_time,'yyyy-MM-dd') from ( select * from ${APP}.ods_order_refund_info where dt='$do_date' )ri left join ( select id,province_id from ${APP}.ods_order_info where dt='$do_date' )oi on ri.order_id=oi.id;"
dwd_refund_payment=" set hive.exec.dynamic.partition.mode=nonstrict; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_refund_payment partition(dt) select rp.id, user_id, order_id, sku_id, province_id, trade_no, out_trade_no, payment_type, refund_amount, refund_status, create_time, callback_time, nvl(date_format(callback_time,'yyyy-MM-dd'),'9999-99-99') from ( select id, out_trade_no, order_id, sku_id, payment_type, trade_no, refund_amount, refund_status, create_time, callback_time from ${APP}.ods_refund_payment where dt='$do_date' )rp left join ( select id, user_id, province_id from ${APP}.ods_order_info where dt='$do_date' )oi on rp.order_id=oi.id;"
case $1 in
dwd_order_info )
hive -e "$dwd_order_info"
;;
dwd_order_detail )
hive -e "$dwd_order_detail"
;;
dwd_payment_info )
hive -e "$dwd_payment_info"
;;
dwd_cart_info )
hive -e "$dwd_cart_info"
;;
dwd_comment_info )
hive -e "$dwd_comment_info"
;;
dwd_favor_info )
hive -e "$dwd_favor_info"
;;
dwd_coupon_use )
hive -e "$dwd_coupon_use"
;;
dwd_order_refund_info )
hive -e "$dwd_order_refund_info"
;;
dwd_refund_payment )
hive -e "$dwd_refund_payment"
;;
all )
hive -e "$dwd_order_info$dwd_order_detail$dwd_payment_info$dwd_cart_info$dwd_comment_info$dwd_favor_info$dwd_coupon_use$dwd_order_refund_info$dwd_refund_payment"
;;
esac
- 权限
[xiaobai@hadoop102 bin]$ chmod +x ods_to_dwd_db_init.sh
- 执行脚本
ods_to_dwd_db_init.sh all 2020-06-14
6.2.11 DWD层业务数据每日装载脚本
- 编写脚本
在/home/xiaobai/bin目录下创建脚本ods_to_dwd_db.sh
[xiaobai@hadoop102 bin]$ vim ods_to_dwd_db.sh
#!/bin/bash
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$2" ] ;then
do_date=$2
else
do_date=`date -d "-1 day" +%F`
fi
# 假设某累积型快照事实表,某天所有的业务记录全部完成,则会导致9999-99-99分区的数据未被覆盖,从而导致数据重复,该函数根据9999-99-99分区的数据的末次修改时间判断其是否被覆盖了,如果未被覆盖,就手动清理
clear_data(){
current_date=`date +%F`
current_date_timestamp=`date -d "$current_date" +%s`
last_modified_date=`hadoop fs -ls /warehouse/gmall/dwd/$1 | grep '9999-99-99' | awk '{print $6}'`
last_modified_date_timestamp=`date -d "$last_modified_date" +%s`
if [[ $last_modified_date_timestamp -lt $current_date_timestamp ]]; then
echo "clear table $1 partition(dt=9999-99-99)"
hadoop fs -rm -r -f /warehouse/gmall/dwd/$1/dt=9999-99-99/*
fi
}
dwd_order_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table ${APP}.dwd_order_info partition(dt) select nvl(new.id,old.id), nvl(new.order_status,old.order_status), nvl(new.user_id,old.user_id), nvl(new.province_id,old.province_id), nvl(new.payment_way,old.payment_way), nvl(new.delivery_address,old.delivery_address), nvl(new.out_trade_no,old.out_trade_no), nvl(new.tracking_no,old.tracking_no), nvl(new.create_time,old.create_time), nvl(new.payment_time,old.payment_time), nvl(new.cancel_time,old.cancel_time), nvl(new.finish_time,old.finish_time), nvl(new.refund_time,old.refund_time), nvl(new.refund_finish_time,old.refund_finish_time), nvl(new.expire_time,old.expire_time), nvl(new.feight_fee,old.feight_fee), nvl(new.feight_fee_reduce,old.feight_fee_reduce), nvl(new.activity_reduce_amount,old.activity_reduce_amount), nvl(new.coupon_reduce_amount,old.coupon_reduce_amount), nvl(new.original_amount,old.original_amount), nvl(new.final_amount,old.final_amount), case when new.cancel_time is not null then date_format(new.cancel_time,'yyyy-MM-dd') when new.finish_time is not null and date_add(date_format(new.finish_time,'yyyy-MM-dd'),7)='$do_date' and new.refund_time is null then '$do_date' when new.refund_finish_time is not null then date_format(new.refund_finish_time,'yyyy-MM-dd') when new.expire_time is not null then date_format(new.expire_time,'yyyy-MM-dd') else '9999-99-99' end from ( select id, order_status, user_id, province_id, payment_way, delivery_address, out_trade_no, tracking_no, create_time, payment_time, cancel_time, finish_time, refund_time, refund_finish_time, expire_time, feight_fee, feight_fee_reduce, activity_reduce_amount, coupon_reduce_amount, original_amount, final_amount from ${APP}.dwd_order_info where dt='9999-99-99' )old full outer join ( select oi.id, oi.order_status, oi.user_id, oi.province_id, oi.payment_way, oi.delivery_address, oi.out_trade_no, oi.tracking_no, oi.create_time, times.ts['1002'] payment_time, times.ts['1003'] cancel_time, times.ts['1004'] finish_time, times.ts['1005'] refund_time, times.ts['1006'] refund_finish_time, oi.expire_time, feight_fee, feight_fee_reduce, activity_reduce_amount, coupon_reduce_amount, original_amount, final_amount from ( select * from ${APP}.ods_order_info where dt='$do_date' )oi left join ( select order_id, str_to_map(concat_ws(',',collect_set(concat(order_status,'=',operate_time))),',','=') ts from ${APP}.ods_order_status_log where dt='$do_date' group by order_id )times on oi.id=times.order_id )new on old.id=new.id;"
dwd_order_detail=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_order_detail partition(dt='$do_date') select od.id, od.order_id, oi.user_id, od.sku_id, oi.province_id, oda.activity_id, oda.activity_rule_id, odc.coupon_id, od.create_time, od.source_type, od.source_id, od.sku_num, od.order_price*od.sku_num, od.split_activity_amount, od.split_coupon_amount, od.split_final_amount from ( select * from ${APP}.ods_order_detail where dt='$do_date' )od left join ( select id, user_id, province_id from ${APP}.ods_order_info where dt='$do_date' )oi on od.order_id=oi.id left join ( select order_detail_id, activity_id, activity_rule_id from ${APP}.ods_order_detail_activity where dt='$do_date' )oda on od.id=oda.order_detail_id left join ( select order_detail_id, coupon_id from ${APP}.ods_order_detail_coupon where dt='$do_date' )odc on od.id=odc.order_detail_id;"
dwd_payment_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table ${APP}.dwd_payment_info partition(dt) select nvl(new.id,old.id), nvl(new.order_id,old.order_id), nvl(new.user_id,old.user_id), nvl(new.province_id,old.province_id), nvl(new.trade_no,old.trade_no), nvl(new.out_trade_no,old.out_trade_no), nvl(new.payment_type,old.payment_type), nvl(new.payment_amount,old.payment_amount), nvl(new.payment_status,old.payment_status), nvl(new.create_time,old.create_time), nvl(new.callback_time,old.callback_time), nvl(date_format(nvl(new.callback_time,old.callback_time),'yyyy-MM-dd'),'9999-99-99') from ( select id, order_id, user_id, province_id, trade_no, out_trade_no, payment_type, payment_amount, payment_status, create_time, callback_time from ${APP}.dwd_payment_info where dt = '9999-99-99' )old full outer join ( select pi.id, pi.out_trade_no, pi.order_id, pi.user_id, oi.province_id, pi.payment_type, pi.trade_no, pi.payment_amount, pi.payment_status, pi.create_time, pi.callback_time from ( select * from ${APP}.ods_payment_info where dt='$do_date' )pi left join ( select id,province_id from ${APP}.ods_order_info where dt='$do_date' )oi on pi.order_id=oi.id )new on old.id=new.id;"
dwd_cart_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_cart_info partition(dt='$do_date') select id, user_id, sku_id, source_type, source_id, cart_price, is_ordered, create_time, operate_time, order_time, sku_num from ${APP}.ods_cart_info where dt='$do_date';"
dwd_comment_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_comment_info partition(dt='$do_date') select id, user_id, sku_id, spu_id, order_id, appraise, create_time from ${APP}.ods_comment_info where dt='$do_date';"
dwd_favor_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_favor_info partition(dt='$do_date') select id, user_id, sku_id, spu_id, is_cancel, create_time, cancel_time from ${APP}.ods_favor_info where dt='$do_date';"
dwd_coupon_use=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table ${APP}.dwd_coupon_use partition(dt) select nvl(new.id,old.id), nvl(new.coupon_id,old.coupon_id), nvl(new.user_id,old.user_id), nvl(new.order_id,old.order_id), nvl(new.coupon_status,old.coupon_status), nvl(new.get_time,old.get_time), nvl(new.using_time,old.using_time), nvl(new.used_time,old.used_time), nvl(new.expire_time,old.expire_time), coalesce(date_format(nvl(new.used_time,old.used_time),'yyyy-MM-dd'),date_format(nvl(new.expire_time,old.expire_time),'yyyy-MM-dd'),'9999-99-99') from ( select id, coupon_id, user_id, order_id, coupon_status, get_time, using_time, used_time, expire_time from ${APP}.dwd_coupon_use where dt='9999-99-99' )old full outer join ( select id, coupon_id, user_id, order_id, coupon_status, get_time, using_time, used_time, expire_time from ${APP}.ods_coupon_use where dt='$do_date' )new on old.id=new.id;"
dwd_order_refund_info=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_order_refund_info partition(dt='$do_date') select ri.id, ri.user_id, ri.order_id, ri.sku_id, oi.province_id, ri.refund_type, ri.refund_num, ri.refund_amount, ri.refund_reason_type, ri.create_time from ( select * from ${APP}.ods_order_refund_info where dt='$do_date' )ri left join ( select id,province_id from ${APP}.ods_order_info where dt='$do_date' )oi on ri.order_id=oi.id;"
dwd_refund_payment=" set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table ${APP}.dwd_refund_payment partition(dt) select nvl(new.id,old.id), nvl(new.user_id,old.user_id), nvl(new.order_id,old.order_id), nvl(new.sku_id,old.sku_id), nvl(new.province_id,old.province_id), nvl(new.trade_no,old.trade_no), nvl(new.out_trade_no,old.out_trade_no), nvl(new.payment_type,old.payment_type), nvl(new.refund_amount,old.refund_amount), nvl(new.refund_status,old.refund_status), nvl(new.create_time,old.create_time), nvl(new.callback_time,old.callback_time), nvl(date_format(nvl(new.callback_time,old.callback_time),'yyyy-MM-dd'),'9999-99-99') from ( select id, user_id, order_id, sku_id, province_id, trade_no, out_trade_no, payment_type, refund_amount, refund_status, create_time, callback_time from ${APP}.dwd_refund_payment where dt='9999-99-99' )old full outer join ( select rp.id, user_id, order_id, sku_id, province_id, trade_no, out_trade_no, payment_type, refund_amount, refund_status, create_time, callback_time from ( select id, out_trade_no, order_id, sku_id, payment_type, trade_no, refund_amount, refund_status, create_time, callback_time from ${APP}.ods_refund_payment where dt='$do_date' )rp left join ( select id, user_id, province_id from ${APP}.ods_order_info where dt='$do_date' )oi on rp.order_id=oi.id )new on old.id=new.id;"
case $1 in
dwd_order_info )
hive -e "$dwd_order_info"
clear_data dwd_order_info
;;
dwd_order_detail )
hive -e "$dwd_order_detail"
;;
dwd_payment_info )
hive -e "$dwd_payment_info"
clear_data dwd_payment_info
;;
dwd_cart_info )
hive -e "$dwd_cart_info"
;;
dwd_comment_info )
hive -e "$dwd_comment_info"
;;
dwd_favor_info )
hive -e "$dwd_favor_info"
;;
dwd_coupon_use )
hive -e "$dwd_coupon_use"
clear_data dwd_coupon_use
;;
dwd_order_refund_info )
hive -e "$dwd_order_refund_info"
;;
dwd_refund_payment )
hive -e "$dwd_refund_payment"
clear_data dwd_refund_payment
;;
all )
hive -e "$dwd_order_info$dwd_order_detail$dwd_payment_info$dwd_cart_info$dwd_comment_info$dwd_favor_info$dwd_coupon_use$dwd_order_refund_info$dwd_refund_payment"
clear_data dwd_order_info
clear_data dwd_payment_info
clear_data dwd_coupon_use
clear_data dwd_refund_payment
;;
esac
- 增加脚本执行权限
[xiaobai@hadoop102 bin]$ chmod 777 ods_to_dwd_db.sh
- 执行脚本
ods_to_dwd_db.sh all 2020-06-14
发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/153211.html原文链接:https://javaforall.cn
【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛
【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...