Hive统计最近七天内连续登陆3天的用户数量-阿南达文事网

Hive统计最近七天内连续登陆3天的用户数量

需求：统计最近七天内连续登陆3天的用户数量

欢迎关注今日头条号、微信公众号、知乎号：仰望夜空一万次

基础数据：

uid为1的用户在‘2020-09-18’号没有登陆。

val df = Seq(("2020-09-21",1),("2020-09-20",1),("2020-09-19",1),("2020-09-17",1),("2020-09-16",1),("2020-09-15",1),("2020-09-20",2),("2020-09-19",2),("2020-09-20",3),("2020-09-19",3),("2020-09-18",3)
).toDF("dt", "uid")
df.createTempView("tmp")

第一步，使用rank over函数，根据用户分组，并且按时间排序

val sql2=s"""| select  uid, dt,rank() over(partition by uid order by dt) rank| from tmp| where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'|""".stripMargin
println(sql2)
ss.sql(sql2).show(false)结果：
+---+----------+----+
|uid|dt        |rank|
+---+----------+----+
|1  |2020-09-15|1   |
|1  |2020-09-16|2   |
|1  |2020-09-17|3   |
|1  |2020-09-19|4   |
|1  |2020-09-20|5   |
|1  |2020-09-21|6   |
|3  |2020-09-18|1   |
|3  |2020-09-19|2   |
|3  |2020-09-20|3   |
|2  |2020-09-19|1   |
|2  |2020-09-20|2   |
+---+----------+----+

第二步，使用date_sub函数，获得date_dif的值，此值相同表示发生连续登陆行为。

s"""| select uid,date_sub(dt,rank) date_dif|  from|      (select  uid, dt,rank() over(partition by uid order by dt) rank|        from tmp|        where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'|      ) t1""".stripMargin结果： +---+----------+
|uid|date_dif  |
+---+----------+
|1  |2020-09-14|
|1  |2020-09-14|
|1  |2020-09-14|
|1  |2020-09-15|
|1  |2020-09-15|
|1  |2020-09-15|
|3  |2020-09-17|
|3  |2020-09-17|
|3  |2020-09-17|
|2  |2020-09-18|
|2  |2020-09-18|
+---+----------+

发现uid为1的用户，在最近七天内有两段连续3天登陆的行为。

第三步，获取连续3天登陆的uid。

val sql2=s"""| select uid|      from|          (|            select uid,date_sub(dt,rank) date_dif|            from|                 (select  uid, dt,rank() over(partition by uid order by dt) rank|                   from tmp|                   where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'|                 ) t1|          ) t2|     group by uid,date_dif|     having count(*)>=3""".stripMargin结果： 
|uid|
+---+
|1  |
|1  |
|3  |
+---

第四步，对uid进行去重，此处使用group by，而非count（distinct uid），为并行分组uid提供保障。

in Hive1.1, these two queries' explains have the same result. Both of them have ONLY ONE STAGE.

需要查看执行计划，看看版本是否支持。

|   select uid
|   from
|   (
|      select uid
|      from
|          (
|            select uid,date_sub(dt,rank) date_dif
|            from
|                 (select  uid, dt,rank() over(partition by uid order by dt) rank
|                   from tmp
|                   where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'
|                 ) t1
|          ) t2
|     group by uid,date_dif
|     having count(*)>=3
|   ) t3
|   group by uid+---+
|uid|
+---+
|1  |
|3  |
+---+

第五步，获取最终数据结果。

compute_date表示统计日期，compute_period表示统计周期，continue_3_days_number表示统计周期内连续登陆3天的总用户数量。

val sql2=s"""|select|'2020-09-21' as compute_date,|concat(date_add('2020-09-21',-6),'_','2020-09-21') as compute_period,|count(*) as continue_3_days_number|from|(|   select uid|   from|   (|      select uid|      from|          (|            select uid,date_sub(dt,rank) date_dif|            from|                 (select  uid, dt,rank() over(partition by uid order by dt) rank|                   from tmp|                   where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'|                 ) t1|          ) t2|     group by uid,date_dif|     having count(*)>=3|   ) t3|   group by uid|) t4""".stripMargin+------------+---------------------+----------------------+
|compute_date|compute_period       |continue_3_days_number|
+------------+---------------------+----------------------+
|2020-09-21  |2020-09-15_2020-09-21|2                     |
+------------+---------------------+----------------------+

第五步的另外一种写法，使用count(distinct)的最终写法

val sql2=s"""|select|'2020-09-21' as compute_date,|concat(date_add('2020-09-21',-6),'_','2020-09-21') as compute_period,|count(distinct uid) as continue_3_days_number|   from|   (|      select uid|      from|          (|            select uid,date_sub(dt,rank) date_dif|            from|                 (select  uid, dt,rank() over(partition by uid order by dt) rank|                   from tmp|                   where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'|                 ) t1|          ) t2|     group by uid,date_dif|     having count(*)>=3|   ) t3""".stripMargin
println(sql2)
ss.sql(sql2).show(false)

Hive统计最近七天内连续登陆3天的用户数量

需求：统计最近七天内连续登陆3天的用户数量

欢迎关注今日头条号、微信公众号、知乎号：仰望夜空一万次

基础数据：

uid为1的用户在‘2020-09-18’号没有登陆。

val df = Seq(("2020-09-21",1),("2020-09-20",1),("2020-09-19",1),("2020-09-17",1),("2020-09-16",1),("2020-09-15",1),("2020-09-20",2),("2020-09-19",2),("2020-09-20",3),("2020-09-19",3),("2020-09-18",3)
).toDF("dt", "uid")
df.createTempView("tmp")

第一步，使用rank over函数，根据用户分组，并且按时间排序

val sql2=s"""| select  uid, dt,rank() over(partition by uid order by dt) rank| from tmp| where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'|""".stripMargin
println(sql2)
ss.sql(sql2).show(false)结果：
+---+----------+----+
|uid|dt        |rank|
+---+----------+----+
|1  |2020-09-15|1   |
|1  |2020-09-16|2   |
|1  |2020-09-17|3   |
|1  |2020-09-19|4   |
|1  |2020-09-20|5   |
|1  |2020-09-21|6   |
|3  |2020-09-18|1   |
|3  |2020-09-19|2   |
|3  |2020-09-20|3   |
|2  |2020-09-19|1   |
|2  |2020-09-20|2   |
+---+----------+----+

第二步，使用date_sub函数，获得date_dif的值，此值相同表示发生连续登陆行为。

s"""| select uid,date_sub(dt,rank) date_dif|  from|      (select  uid, dt,rank() over(partition by uid order by dt) rank|        from tmp|        where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'|      ) t1""".stripMargin结果： +---+----------+
|uid|date_dif  |
+---+----------+
|1  |2020-09-14|
|1  |2020-09-14|
|1  |2020-09-14|
|1  |2020-09-15|
|1  |2020-09-15|
|1  |2020-09-15|
|3  |2020-09-17|
|3  |2020-09-17|
|3  |2020-09-17|
|2  |2020-09-18|
|2  |2020-09-18|
+---+----------+

发现uid为1的用户，在最近七天内有两段连续3天登陆的行为。

第三步，获取连续3天登陆的uid。

val sql2=s"""| select uid|      from|          (|            select uid,date_sub(dt,rank) date_dif|            from|                 (select  uid, dt,rank() over(partition by uid order by dt) rank|                   from tmp|                   where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'|                 ) t1|          ) t2|     group by uid,date_dif|     having count(*)>=3""".stripMargin结果： 
|uid|
+---+
|1  |
|1  |
|3  |
+---

第四步，对uid进行去重，此处使用group by，而非count（distinct uid），为并行分组uid提供保障。

in Hive1.1, these two queries' explains have the same result. Both of them have ONLY ONE STAGE.

需要查看执行计划，看看版本是否支持。

|   select uid
|   from
|   (
|      select uid
|      from
|          (
|            select uid,date_sub(dt,rank) date_dif
|            from
|                 (select  uid, dt,rank() over(partition by uid order by dt) rank
|                   from tmp
|                   where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'
|                 ) t1
|          ) t2
|     group by uid,date_dif
|     having count(*)>=3
|   ) t3
|   group by uid+---+
|uid|
+---+
|1  |
|3  |
+---+

第五步，获取最终数据结果。

compute_date表示统计日期，compute_period表示统计周期，continue_3_days_number表示统计周期内连续登陆3天的总用户数量。

val sql2=s"""|select|'2020-09-21' as compute_date,|concat(date_add('2020-09-21',-6),'_','2020-09-21') as compute_period,|count(*) as continue_3_days_number|from|(|   select uid|   from|   (|      select uid|      from|          (|            select uid,date_sub(dt,rank) date_dif|            from|                 (select  uid, dt,rank() over(partition by uid order by dt) rank|                   from tmp|                   where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'|                 ) t1|          ) t2|     group by uid,date_dif|     having count(*)>=3|   ) t3|   group by uid|) t4""".stripMargin+------------+---------------------+----------------------+
|compute_date|compute_period       |continue_3_days_number|
+------------+---------------------+----------------------+
|2020-09-21  |2020-09-15_2020-09-21|2                     |
+------------+---------------------+----------------------+

第五步的另外一种写法，使用count(distinct)的最终写法

val sql2=s"""|select|'2020-09-21' as compute_date,|concat(date_add('2020-09-21',-6),'_','2020-09-21') as compute_period,|count(distinct uid) as continue_3_days_number|   from|   (|      select uid|      from|          (|            select uid,date_sub(dt,rank) date_dif|            from|                 (select  uid, dt,rank() over(partition by uid order by dt) rank|                   from tmp|                   where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21'|                 ) t1|          ) t2|     group by uid,date_dif|     having count(*)>=3|   ) t3""".stripMargin
println(sql2)
ss.sql(sql2).show(false)

Hive统计最近七天内连续登陆3天的用户数量

Hive统计最近七天内连续登陆3天的用户数量

Hive统计最近七天内连续登陆3天的用户数量

发布评论取消回复

最近发表

相关推荐

标签列表

Hive统计最近七天内连续登陆3天的用户数量

Hive统计最近七天内连续登陆3天的用户数量

Hive统计最近七天内连续登陆3天的用户数量

发布评论 取消回复

最近发表

相关推荐

标签列表

发布评论取消回复