数据分析中的连续性问题

wang-possible
wang-possible
wang-possible
9
文章
0
评论
2020-09-1713:09:00 评论 1,353 6889字
摘要

对于 APP 的使用来说,活跃度是我们制定上线活动策略的非常重要的指标;对于快递小哥来说,如何判断他的工作的努力程度的,我们可以使用一个星期或者一个月的连续出勤天数来衡量。上面的这几个场景,我们可以抽象成一个问题就是在某个时间段某个事件连续发生的次数的计算。

场景

对于 APP 的使用来说,我们当然想用户每时每刻在使用我们的 APP,使用 APP 的频率越高,可以称这个用户越活跃,也可以说活跃度高。对于运营的人员来说,对于活跃度高的用户,其实不用花太大的精力去做动作来留着他们了,对于中度活跃甚至偶尔登陆一下的这些用户反而需要花大力气留住他们。所以活跃度是我们制定上线活动策略的非常重要的指标。

对于快递小哥来说,如何判断他的工作的努力程度的,我们可以使用一个星期或者一个月的连续出勤天数来衡量。

对于超市来说,比较害怕的是顾客来买东西,但是货架上没货,对于大的超市来说,需要补货的非常的多,我们可以使用最大缺货天数来衡量补货的紧急程度。其实即使缺货了,也不一定有需求,但是假设我们货物每天都有人买。

上面的这三个场景,我们可以抽象成一个问题就是在某个时间段某个事件连续发生的次数的计算。

那么把问题简化一下,求下面节假日的开始结束日期:

数据分析中的连续性问题

结果如下:

数据分析中的连续性问题

实现

实现1

with a as (  select *  from (    select "2014-01-01" as date_ , "1" as is_holaday    union all select "2014-01-02" as date_ , "0" as is_holaday    union all select "2014-01-03" as date_ , "0" as is_holaday    union all select "2014-01-04" as date_ , "1" as is_holaday    union all select "2014-01-05" as date_ , "1" as is_holaday    union all select "2014-01-06" as date_ , "0" as is_holaday    union all select "2014-01-07" as date_ , "0" as is_holaday        union all select "2014-01-08" as date_ , "1" as is_holaday    union all select "2014-01-09" as date_ , "0" as is_holaday    union all select "2014-01-10" as date_ , "0" as is_holaday    union all select "2014-01-11" as date_ , "1" as is_holaday    union all select "2014-01-12" as date_ , "1" as is_holaday    union all select "2014-01-13" as date_ , "0" as is_holaday        union all select "2014-01-14" as date_ , "0" as is_holaday    union all select "2014-01-15" as date_ , "0" as is_holaday    union all select "2014-01-16" as date_ , "0" as is_holaday    union all select "2014-01-17" as date_ , "0" as is_holaday    union all select "2014-01-18" as date_ , "1" as is_holaday    union all select "2014-01-19" as date_ , "1" as is_holaday    union all select "2014-01-20" as date_ , "1" as is_holaday              )) select date_, is_holaday, group_id , if(is_holaday = "0", null, min(date_) over (partition by group_id)) as min_date, if(is_holaday = "0", null, max(date_) over (partition by group_id)) as max_datefrom ( select date_      ,is_holaday  , if(is_holaday="1",row_number() over (order by date_ asc)-rank() over (partition by is_holaday order by date_),0) as group_id  from a ) as x order by date_

其实这个问题的关键在也对连续假日进行分组,这样我们就能用的 max min 取出假日的开始结束日期了。

row_number 是按照日期排序的,生成递增的序号,然后再根据 is_holaday 来对假期内外进行 rank。可以得到如下的结论。其中 D2 - D1 = 1 ,并且 D1、D2 都是假日期。

数据分析中的连续性问题

不难看出 n -1 - k -1 = n -k,n -2 - k -2 = n -k,所以的 D1 和 D2 放到了同一个组里面。

实现2

还有一种麻烦的:

with a as (select *  from (    select "2014-01-01" as date_ , "1" as is_holaday    union all select "2014-01-02" as date_ , "0" as is_holaday    union all select "2014-01-03" as date_ , "0" as is_holaday    union all select "2014-01-04" as date_ , "1" as is_holaday    union all select "2014-01-05" as date_ , "1" as is_holaday    union all select "2014-01-06" as date_ , "0" as is_holaday    union all select "2014-01-07" as date_ , "0" as is_holaday        union all select "2014-01-08" as date_ , "1" as is_holaday    union all select "2014-01-09" as date_ , "0" as is_holaday    union all select "2014-01-10" as date_ , "0" as is_holaday    union all select "2014-01-11" as date_ , "1" as is_holaday    union all select "2014-01-12" as date_ , "1" as is_holaday    union all select "2014-01-13" as date_ , "0" as is_holaday        union all select "2014-01-14" as date_ , "0" as is_holaday    union all select "2014-01-15" as date_ , "0" as is_holaday    union all select "2014-01-16" as date_ , "0" as is_holaday    union all select "2014-01-17" as date_ , "0" as is_holaday    union all select "2014-01-18" as date_ , "1" as is_holaday    union all select "2014-01-19" as date_ , "1" as is_holaday    union all select "2014-01-20" as date_ , "1" as is_holaday              )) , bb as (select date_      ,is_holiday      ,if(is_holiday="1" and (last_holiday is null or last_holiday = "0"),1,0) as start_holiday      ,if(is_holiday="1" and (next_holiday is null or next_holiday = "0"),1,0) as end_holiday from (select date_      ,is_holaday as is_holiday      ,lag(is_holaday) over( order by date_) as last_holiday      ,lead(is_holaday) over( order by date_) as next_holiday       from a ) as aa )select date_      ,is_holiday      ,start_date      ,if(is_holiday = "0","" , end_date) as end_date  from (select ee.date_      ,ee.is_holiday      ,ee.start_date      ,dd.date_ as end_date      ,row_number() over(partition by ee.date_ ) as index_  from (select date_      ,is_holiday      ,if(is_holiday = "0","" , start_date) as start_date from (select bb.date_      ,cc.date_ as start_date      ,bb.is_holiday      ,row_number() over(partition by bb.date_ order by cc.date_ desc) as index  from bb   cross join (    select * from bb where start_holiday = 1  ) as ccwhere bb.date_ >= cc.date_order by bb.date_)where index = 1) as ee cross join (select * from bb where end_holiday = 1) as dd where ee.date_ <= dd.date_ )where index_ = 1 order by date_

实现3

到了实验三我要增加难度了,有下面一些数据,不但要取出连续天数的开始和结束日期,还要求连续需要持续三天以上。

    select  *      from (          select * from (                      select "A" as shop,"2017-10-11" as day,300 as amt            union all select "A" as shop,"2017-10-12" as day , 200 as amt            union all select "B" as shop,"2017-10-11" as day , 400 as amt            union all select "B" as shop,"2017-10-12" as day , 200 as amt            union all select "A" as shop,"2017-10-13" as day , 100 as amt            union all select "A" as shop,"2017-10-15" as day , 100 as amt            union all select "C" as shop,"2017-10-11" as day , 350 as amt            union all select "C" as shop,"2017-10-15" as day , 400 as amt            union all select "C" as shop,"2017-10-16" as day , 200 as amt            union all select "D" as shop,"2017-10-13" as day , 500 as amt            union all select "E" as shop,"2017-10-14" as day , 600 as amt            union all select "E" as shop,"2017-10-15" as day , 500 as amt            union all select "D" as shop,"2017-10-14" as day , 600 as amt            union all select "B" as shop,"2017-10-13" as day , 300 as amt            union all select "C" as shop,"2017-10-17" as day , 100 as amt                                     union all select "G" as shop,"2017-10-31" as day , 100 as amt             union all select "G" as shop,"2017-11-01" as day , 100 as amt             union all select "G" as shop,"2017-11-02" as day , 100 as amt                 )    order by shop , day desc 

解法如下:

select *       , first_value(day) over(partition by shop order by day) as first_day       , first_value(day) over(partition by shop order by day desc ) as first_day  from (        select *            ,count(1) over(partition by shop , plus ) as coutinues_plus        from (        select *              -- 看到这里,这里是一个点睛之笔,比 row_number() - rank() 的做法有好多了。              -- 这样可以适用于日期中有断开点的,不连续的                ,date_diff("day" , date("2017-01-01") , date(day))                + row_number() over(partition by shop order by day desc ) as plus        from (                select *                   from (                            select "A" as shop,"2017-10-11" as day,300 as amt                            union all select "A" as shop,"2017-10-12" as day , 200 as amt                            union all select "B" as shop,"2017-10-11" as day , 400 as amt                            union all select "B" as shop,"2017-10-12" as day , 200 as amt                            union all select "A" as shop,"2017-10-13" as day , 100 as amt                            union all select "A" as shop,"2017-10-15" as day , 100 as amt                            union all select "C" as shop,"2017-10-11" as day , 350 as amt                            union all select "C" as shop,"2017-10-15" as day , 400 as amt                            union all select "C" as shop,"2017-10-16" as day , 200 as amt                            union all select "D" as shop,"2017-10-13" as day , 500 as amt                            union all select "E" as shop,"2017-10-14" as day , 600 as amt                            union all select "E" as shop,"2017-10-15" as day , 500 as amt                            union all select "D" as shop,"2017-10-14" as day , 600 as amt                            union all select "B" as shop,"2017-10-13" as day , 300 as amt                            union all select "C" as shop,"2017-10-17" as day , 100 as amt                                                                                    union all select "G" as shop,"2017-10-31" as day , 100 as amt                            union all select "G" as shop,"2017-11-01" as day , 100 as amt                            union all select "G" as shop,"2017-11-02" as day , 100 as amt                )        order by shop , day desc        )        ))where coutinues_plus >= 3 

只有开始和结束时间的情况

select user_name        ,time_type       ,lag(time_type , 1 , 0 ) over(partition by user_name order by ts) as pre_time_type       ,lead(time_type , 1 , 1 )over(partition by user_name order by ts) as next_time_type  from (             select 1 as user_name , 1 as time_type, 123 as ts    union all select 1 as user_name , 1 as time_type, 126 as ts   union all select 1 as user_name , 0 as time_type, 166 as ts   union all select 1 as user_name , 0 as time_type, 167 as ts) as a 

其中,user_name 是工号, time_type 是开始和结束标识,ts 代表时间戳。

lag(field , interval , defualt_expression) 取出当前记录的向上数第 interval 记录对应的 field 的值, lead(field , interval , defualt_value) 正好相反。那么 lag 和 lead 的方向怎么分辨呢?如下图所示,从上向下看过去,向下是 lead(领先的意思),向上是 lag (落后的)。

数据分析中的连续性问题

如果想取出连续开始时间的第一条,应该使用 lag 看上一条记录应该是结束的标识,反之,如果想得到所有的结束时间的第一条数据,那可以当前 time_type 是结束,前一条是的开始。

如果想找最后一条,那可以使用 lead 的。

End.爱数据网专栏作者:wang-possible作者介绍:6年零售大数据工作经验,技能持续精进CSDN个人主页:bluedraam_pp

  • 我的微信公众号
  • 微信扫一扫
  • weinxin
  • 我的微信公众号
  • 微信扫一扫
  • weinxin
匿名

发表评论

匿名网友 填写信息

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: