Hive统计最近七天内连续登陆3天的用户数量
需求:统计最近七天内连续登陆3天的用户数量
欢迎关注今日头条号、微信公众号、知乎号:仰望夜空一万次
基础数据:
uid为1的用户在‘2020-09-18’号没有登陆。
val df = Seq(("2020-09-21",1),("2020-09-20",1),("2020-09-19",1),("2020-09-17",1),("2020-09-16",1),("2020-09-15",1),("2020-09-20",2),("2020-09-19",2),("2020-09-20",3),("2020-09-19",3),("2020-09-18",3)
).toDF("dt", "uid")
df.createTempView("tmp")
第一步,使用rank over函数,根据用户分组,并且按时间排序
val sql2=s"""| select uid, dt,rank() over(partition by uid order by dt) rank| from tmp| where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'|""".stripMargin
println(sql2)
ss.sql(sql2).show(false)结果:
+---+----------+----+
|uid|dt |rank|
+---+----------+----+
|1 |2020-09-15|1 |
|1 |2020-09-16|2 |
|1 |2020-09-17|3 |
|1 |2020-09-19|4 |
|1 |2020-09-20|5 |
|1 |2020-09-21|6 |
|3 |2020-09-18|1 |
|3 |2020-09-19|2 |
|3 |2020-09-20|3 |
|2 |2020-09-19|1 |
|2 |2020-09-20|2 |
+---+----------+----+
第二步,使用date_sub函数,获得date_dif的值,此值相同表示发生连续登陆行为。
s"""| select uid,date_sub(dt,rank) date_dif| from| (select uid, dt,rank() over(partition by uid order by dt) rank| from tmp| where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'| ) t1""".stripMargin结果: +---+----------+
|uid|date_dif |
+---+----------+
|1 |2020-09-14|
|1 |2020-09-14|
|1 |2020-09-14|
|1 |2020-09-15|
|1 |2020-09-15|
|1 |2020-09-15|
|3 |2020-09-17|
|3 |2020-09-17|
|3 |2020-09-17|
|2 |2020-09-18|
|2 |2020-09-18|
+---+----------+
发现uid为1的用户,在最近七天内有两段连续3天登陆的行为。
第三步,获取连续3天登陆的uid。
val sql2=s"""| select uid| from| (| select uid,date_sub(dt,rank) date_dif| from| (select uid, dt,rank() over(partition by uid order by dt) rank| from tmp| where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'| ) t1| ) t2| group by uid,date_dif| having count(*)>=3""".stripMargin结果:
|uid|
+---+
|1 |
|1 |
|3 |
+---
第四步,对uid进行去重,此处使用group by,而非count(distinct uid),为并行分组uid提供保障。
in Hive1.1, these two queries' explains have the same result. Both of them have ONLY ONE STAGE.
需要查看执行计划,看看版本是否支持。
| select uid
| from
| (
| select uid
| from
| (
| select uid,date_sub(dt,rank) date_dif
| from
| (select uid, dt,rank() over(partition by uid order by dt) rank
| from tmp
| where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'
| ) t1
| ) t2
| group by uid,date_dif
| having count(*)>=3
| ) t3
| group by uid+---+
|uid|
+---+
|1 |
|3 |
+---+
第五步,获取最终数据结果。
compute_date表示统计日期,compute_period表示统计周期,continue_3_days_number表示统计周期内连续登陆3天的总用户数量。
val sql2=s"""|select|'2020-09-21' as compute_date,|concat(date_add('2020-09-21',-6),'_','2020-09-21') as compute_period,|count(*) as continue_3_days_number|from|(| select uid| from| (| select uid| from| (| select uid,date_sub(dt,rank) date_dif| from| (select uid, dt,rank() over(partition by uid order by dt) rank| from tmp| where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'| ) t1| ) t2| group by uid,date_dif| having count(*)>=3| ) t3| group by uid|) t4""".stripMargin+------------+---------------------+----------------------+
|compute_date|compute_period |continue_3_days_number|
+------------+---------------------+----------------------+
|2020-09-21 |2020-09-15_2020-09-21|2 |
+------------+---------------------+----------------------+
第五步的另外一种写法,使用count(distinct)的最终写法
val sql2=s"""|select|'2020-09-21' as compute_date,|concat(date_add('2020-09-21',-6),'_','2020-09-21') as compute_period,|count(distinct uid) as continue_3_days_number| from| (| select uid| from| (| select uid,date_sub(dt,rank) date_dif| from| (select uid, dt,rank() over(partition by uid order by dt) rank| from tmp| where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'| ) t1| ) t2| group by uid,date_dif| having count(*)>=3| ) t3""".stripMargin
println(sql2)
ss.sql(sql2).show(false)
Hive统计最近七天内连续登陆3天的用户数量
需求:统计最近七天内连续登陆3天的用户数量
欢迎关注今日头条号、微信公众号、知乎号:仰望夜空一万次
基础数据:
uid为1的用户在‘2020-09-18’号没有登陆。
val df = Seq(("2020-09-21",1),("2020-09-20",1),("2020-09-19",1),("2020-09-17",1),("2020-09-16",1),("2020-09-15",1),("2020-09-20",2),("2020-09-19",2),("2020-09-20",3),("2020-09-19",3),("2020-09-18",3)
).toDF("dt", "uid")
df.createTempView("tmp")
第一步,使用rank over函数,根据用户分组,并且按时间排序
val sql2=s"""| select uid, dt,rank() over(partition by uid order by dt) rank| from tmp| where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'|""".stripMargin
println(sql2)
ss.sql(sql2).show(false)结果:
+---+----------+----+
|uid|dt |rank|
+---+----------+----+
|1 |2020-09-15|1 |
|1 |2020-09-16|2 |
|1 |2020-09-17|3 |
|1 |2020-09-19|4 |
|1 |2020-09-20|5 |
|1 |2020-09-21|6 |
|3 |2020-09-18|1 |
|3 |2020-09-19|2 |
|3 |2020-09-20|3 |
|2 |2020-09-19|1 |
|2 |2020-09-20|2 |
+---+----------+----+
第二步,使用date_sub函数,获得date_dif的值,此值相同表示发生连续登陆行为。
s"""| select uid,date_sub(dt,rank) date_dif| from| (select uid, dt,rank() over(partition by uid order by dt) rank| from tmp| where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'| ) t1""".stripMargin结果: +---+----------+
|uid|date_dif |
+---+----------+
|1 |2020-09-14|
|1 |2020-09-14|
|1 |2020-09-14|
|1 |2020-09-15|
|1 |2020-09-15|
|1 |2020-09-15|
|3 |2020-09-17|
|3 |2020-09-17|
|3 |2020-09-17|
|2 |2020-09-18|
|2 |2020-09-18|
+---+----------+
发现uid为1的用户,在最近七天内有两段连续3天登陆的行为。
第三步,获取连续3天登陆的uid。
val sql2=s"""| select uid| from| (| select uid,date_sub(dt,rank) date_dif| from| (select uid, dt,rank() over(partition by uid order by dt) rank| from tmp| where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'| ) t1| ) t2| group by uid,date_dif| having count(*)>=3""".stripMargin结果:
|uid|
+---+
|1 |
|1 |
|3 |
+---
第四步,对uid进行去重,此处使用group by,而非count(distinct uid),为并行分组uid提供保障。
in Hive1.1, these two queries' explains have the same result. Both of them have ONLY ONE STAGE.
需要查看执行计划,看看版本是否支持。
| select uid
| from
| (
| select uid
| from
| (
| select uid,date_sub(dt,rank) date_dif
| from
| (select uid, dt,rank() over(partition by uid order by dt) rank
| from tmp
| where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'
| ) t1
| ) t2
| group by uid,date_dif
| having count(*)>=3
| ) t3
| group by uid+---+
|uid|
+---+
|1 |
|3 |
+---+
第五步,获取最终数据结果。
compute_date表示统计日期,compute_period表示统计周期,continue_3_days_number表示统计周期内连续登陆3天的总用户数量。
val sql2=s"""|select|'2020-09-21' as compute_date,|concat(date_add('2020-09-21',-6),'_','2020-09-21') as compute_period,|count(*) as continue_3_days_number|from|(| select uid| from| (| select uid| from| (| select uid,date_sub(dt,rank) date_dif| from| (select uid, dt,rank() over(partition by uid order by dt) rank| from tmp| where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'| ) t1| ) t2| group by uid,date_dif| having count(*)>=3| ) t3| group by uid|) t4""".stripMargin+------------+---------------------+----------------------+
|compute_date|compute_period |continue_3_days_number|
+------------+---------------------+----------------------+
|2020-09-21 |2020-09-15_2020-09-21|2 |
+------------+---------------------+----------------------+
第五步的另外一种写法,使用count(distinct)的最终写法
val sql2=s"""|select|'2020-09-21' as compute_date,|concat(date_add('2020-09-21',-6),'_','2020-09-21') as compute_period,|count(distinct uid) as continue_3_days_number| from| (| select uid| from| (| select uid,date_sub(dt,rank) date_dif| from| (select uid, dt,rank() over(partition by uid order by dt) rank| from tmp| where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'| ) t1| ) t2| group by uid,date_dif| having count(*)>=3| ) t3""".stripMargin
println(sql2)
ss.sql(sql2).show(false)
发布评论