presto- top N、前 N 天销售额计算

wang-possible
wang-possible
wang-possible
9
文章
0
评论
2020-09-2213:09:00 评论 1,453 12118字
摘要

场景描述:(1)统计前 N 天的销售额的平均值;(2)统计 TOP N 商品;(3)随机分组。

场景描述

  • 统计前 N 天的销售额的平均值。
  • 统计 TOP N 商品
  • 随机分组

统计前 N 天的销售额的平均值

基础知识

我们本次使用到的 sql 都是在 presto 上跑的,如果想在 hive 或者 其他平台上跑的话,请自行将 sql 转成对应的 sql 。

首先看一下,array_agg() over() 函数。

select leader , employee , array_agg(employee) over(partition by leader order by employee)  from (          select "A" as leader , "E1" as employeeunion all select "A" as leader , "E2" as employeeunion all select "A" as leader , "E3" as employeeunion all select "B" as leader , "E4" as employeeunion all select "B" as leader , "E5" as employeeunion all select "B" as leader , "E6" as employee  ) as a 

计算结果如下所示:

presto- top N、前 N 天销售额计算

接下来是几个简单的函数:

  • reverse,按照字典顺序或者数字的顺序对数字排序
  • slice:取出数组的子集,例如, slice(array,start_index, end_index)

为了更加形象的说明问题,我们使用的下面的例子进行说明。

select sale_date , shop_id , array_agg(ARRAY[sale_date , shop_id , cast(sale_amt as varchar)]) over(partition by shop_id order by sale_date)  from (            select "2019-06-06" as sale_date , "shop1" as shop_id , 10 as sale_amt  union all select "2019-06-05" as sale_date , "shop1" as shop_id , 14 as sale_amt  union all select "2019-06-03" as sale_date , "shop1" as shop_id , 17 as sale_amt  union all select "2019-06-02" as sale_date , "shop1" as shop_id , 18 as sale_amt   union all select "2019-06-01" as sale_date , "shop1" as shop_id , 13 as sale_amt   union all            select "2019-06-06" as sale_date , "shop2" as shop_id , 11 as sale_amt  union all select "2019-06-05" as sale_date , "shop2" as shop_id , 15 as sale_amt  union all select "2019-06-03" as sale_date , "shop2" as shop_id , 18 as sale_amt  union all select "2019-06-02" as sale_date , "shop2" as shop_id , 19 as sale_amt   union all select "2019-06-01" as sale_date , "shop2" as shop_id , 16 as sale_amt  ) as a

上面这段 sql 的结果如下所示:

presto- top N、前 N 天销售额计算

请注意观察数组里面的数字。我们可以观察到已经是排好序了。但是我们需要的是倒序。这是个问题。

另外,我只想取出前 3 天的数据汇总,所以我们需要取 array 的子集。这是个问题。

然后我们还会遇到一个问题,如果日期不连续的话,例如,shop2 在 2019-06-04 没有数据,所以我们需要对数组里面的元素进行过滤。

过滤之后我们需要单独把数组里的销售额拿出来相加。

排序我们可以使用 reverse 函数,slice 函数取数组的子集,filter(array, lamabda) 过滤出我们需要的元素,然后使用 reduce 把销售额加起来。

这里使用到的 reduce 在官网上没有讲到,reduce(array,cast(ROW(0,0.00) as ROW(cnt int,amt doubel)), (x,y) ->cast( as ROW(cnt int , amt double)) , s -> s )

这些都是在数组上的操作详细的可以查看:https://prestodb.github.io/docs/current/functions/array.html

这里先弄清楚 filter 和 reduce 的功能特性。

select cast(reduce(filter(ARRAY[1,2,3,4,5],x -> x >= 4),cast(ROW(0) as ROW(cnt int)), (x,y) ->cast( ROW(y + x.cnt) as ROW(cnt int ) ), x -> x ) as json) as rs

其中, filter(array,filterFunction(x)): 第一个参数的是要操作的数组,第二个是 filter 的条件,这里是一个正则表达式。

reduce(array,initialState , inputFunction(S, T, S), outputFunction(S, R))

最终的结果

在了解完 array 的几个函数以后,我可以得到如下所示的结果

最终的结果为:

select sale_date      , shop_id      , sale_pre3     , json_array_get(cast(reduce(filter(array_amt,x -> x[1] >= sale_pre3),cast(ROW(0) as ROW(cnt int)), (x,y) ->cast( ROW(cast(y[3] as integer) + x.cnt) as ROW(cnt int ) ), x -> x ) as json),0) as rs  from (select sale_date      , shop_id      , cast(date_add("day",-2,date(sale_date)) as varchar) as sale_pre3     , array_agg(ARRAY[sale_date , shop_id , cast(sale_amt as varchar)]) over(partition by shop_id order by sale_date) as array_amt  from (            select "2019-06-06" as sale_date , "shop1" as shop_id , 10 as sale_amt  union all select "2019-06-05" as sale_date , "shop1" as shop_id , 14 as sale_amt  union all select "2019-06-03" as sale_date , "shop1" as shop_id , 17 as sale_amt  union all select "2019-06-02" as sale_date , "shop1" as shop_id , 18 as sale_amt   union all select "2019-06-01" as sale_date , "shop1" as shop_id , 13 as sale_amt   union all            select "2019-06-06" as sale_date , "shop2" as shop_id , 11 as sale_amt  union all select "2019-06-05" as sale_date , "shop2" as shop_id , 15 as sale_amt  union all select "2019-06-03" as sale_date , "shop2" as shop_id , 18 as sale_amt  union all select "2019-06-02" as sale_date , "shop2" as shop_id , 19 as sale_amt   union all select "2019-06-01" as sale_date , "shop2" as shop_id , 16 as sale_amt  ) as a ) as b  

其中,json_array_get(json_str,index) 函数是将 json_str 转成 json_object,然后取出第一个数组。

其实我们也可以使用 lag 或者 leader 函数来实现这个功能。最终的解决结果如下所示:

select   sale_date        , shop_id       , (          sale_amt          +if(_last_3_day < last_1_day,last_1_amt,0)          +if(_last_3_day < last_2_day,last_1_amt,0)          +if(_last_3_day < last_3_day,last_1_amt,0)                    ) as last_3_amt_sum  from (    select sale_date          , shop_id          , sale_amt         , cast(date(date_add("day" , cast(-3 as bigint) , cast (sale_date as timestamp))) as varchar) as _last_3_day         , lag(sale_amt,1) over(partition by shop_id order by sale_date) as last_1_amt         , lag(sale_date,1) over(partition by shop_id order by sale_date) as last_1_day         , lag(sale_amt,2) over(partition by shop_id order by sale_date) as last_2_amt         , lag(sale_date,2) over(partition by shop_id order by sale_date) as last_2_day         , lag(sale_amt,3) over(partition by shop_id order by sale_date) as last_3_amt         , lag(sale_date,3) over(partition by shop_id order by sale_date) as last_3_day      from (                select "2019-06-06" as sale_date , "shop1" as shop_id , 10 as sale_amt          union all select "2019-06-05" as sale_date , "shop1" as shop_id , 14 as sale_amt          union all select "2019-06-03" as sale_date , "shop1" as shop_id , 17 as sale_amt          union all select "2019-06-02" as sale_date , "shop1" as shop_id , 18 as sale_amt           union all select "2019-06-01" as sale_date , "shop1" as shop_id , 13 as sale_amt           union all                    select "2019-06-06" as sale_date , "shop2" as shop_id , 11 as sale_amt          union all select "2019-06-05" as sale_date , "shop2" as shop_id , 15 as sale_amt          union all select "2019-06-03" as sale_date , "shop2" as shop_id , 18 as sale_amt          union all select "2019-06-02" as sale_date , "shop2" as shop_id , 19 as sale_amt           union all select "2019-06-01" as sale_date , "shop2" as shop_id , 16 as sale_amt      ) as a) as aa 

动态 TOP N

先说动态 TOP N 问题之前我们先了认识一下 TOP N 。

TOP N 是什么呢?让我们先闪回到学生时代,那时在期末考试结束后校长说学生们辛苦了,要有奖励,于是就有了奖励。教务处的老师们经过一番思考之后,制定了一系列的奖励政策,其中之一是每个班级总成绩前 10 名的学生发一本《三体》回去研读。如果你是教务处的老师,应该取出这批学生呢? 其实有点 SQL 基础的人都应该有一些思路了。如下所示:

select stu_name      , class      , score  from (    select  stu_name          , class          , score          , row_number() over(partition by class order by score desc) as student_ranking      from (                      select "a" as stu_name, "C1" as class 213 as score            union all select "b" as stu_name, "C1" as class 223 as score            union all select "c" as stu_name, "C1" as class 223 as score            union all select "d" as stu_name, "C1" as class 223 as score            union all select "e" as stu_name, "C1" as class 223 as score                union all select "g" as stu_name, "C2" as class 213 as score            union all select "h" as stu_name, "C2" as class 223 as score            union all select "m" as stu_name, "C2" as class 223 as score            union all select "n" as stu_name, "C2" as class 223 as score            union all select "i" as stu_name, "C2" as class 223 as score    ) as school_report_card) as school_report_card1where student_ranking <= 10

但是呢,有的班主任提出了不同的观点,有的我们人多,只奖励 10 个人太少了,要按照每个班级人数的 10% 奖励。好像还挺有道理的。

那么这有改怎么写呢?

借用周星驰的经典台词"你让我打狗,总得给我一个棒子啊",我们首先要知道每个班的总人数。

          select "C1" as class , 40 as populationunion all select "C2" as class , 50 as population

那么我们的 SQL 重新写成:

select stu_name      , class      , score      , student_ranking  from (    select  stu_name          , class          , score          , row_number() over(partition by class order by score desc) as student_ranking      from (                      select "a" as stu_name, "C1" as class ,213 as score            union all select "b" as stu_name, "C1" as class ,223 as score            union all select "c" as stu_name, "C1" as class ,223 as score            union all select "d" as stu_name, "C1" as class ,223 as score            union all select "e" as stu_name, "C1" as class ,223 as score                union all select "g" as stu_name, "C2" as class ,213 as score            union all select "h" as stu_name, "C2" as class ,223 as score            union all select "m" as stu_name, "C2" as class ,223 as score            union all select "n" as stu_name, "C2" as class ,223 as score            union all select "i" as stu_name, "C2" as class ,223 as score          ) as school_report_card   ) as school_report_card1left join (          select "C1" as class , 40 as populationunion all select "C2" as class , 50 as population) as class_populationon school_report_card1.class = class_population.classwhere student_ranking <= population*0.1

上面使用到的数组,那么我们能不能也使用数组呢?答案是肯定的.

不知道你使用过没有 clickhouse ,下面来结束一下如何使用 clickhouse 做动态 TOP N 。

clickhouse 没有像 presto 那样开窗函数,所以只能使用数组进行处理了。

首先,来看看关于数组的几个基本的函数

select  class      , groupArray(4)([stu_name , cast(score as String)])  from (        select "a" as stu_name, "C1" as class ,213 as scoreunion all select "b" as stu_name, "C1" as class ,223 as scoreunion all select "c" as stu_name, "C1" as class ,212 as scoreunion all select "d" as stu_name, "C1" as class ,877 as scoreunion all select "e" as stu_name, "C1" as class ,665 as scoreunion all select "g" as stu_name, "C2" as class ,6564 as scoreunion all select "h" as stu_name, "C2" as class ,444 as scoreunion all select "m" as stu_name, "C2" as class ,111 as scoreunion all select "n" as stu_name, "C2" as class ,222 as scoreunion all select "i" as stu_name, "C2" as class ,333 as score  )group by class 

结果如下所示:

presto- top N、前 N 天销售额计算

由上面的结果可以看到 groupArray(record element) 的功能是和 presto 的 array_agg() 的

功能差不多。就是把分组内的每个元素汇总放到一个数组里面。

接下来看一下,看下一个重要的函数:arrayResize 。

select class , arrayResize(a.array_group , b.population) as resized_array_group  from (    select  class          , groupArray(4)([stu_name , cast(score as String)]) as array_group      from (            select "a" as stu_name, "C1" as class ,213 as score    union all select "b" as stu_name, "C1" as class ,223 as score    union all select "c" as stu_name, "C1" as class ,212 as score    union all select "d" as stu_name, "C1" as class ,877 as score    union all select "e" as stu_name, "C1" as class ,665 as score        union all select "g" as stu_name, "C2" as class ,6564 as score    union all select "h" as stu_name, "C2" as class ,444 as score    union all select "m" as stu_name, "C2" as class ,111 as score    union all select "n" as stu_name, "C2" as class ,222 as score    union all select "i" as stu_name, "C2" as class ,333 as score      )    group by class ) as a allleft join (          select "C1" as class , 4 as populationunion all select "C2" as class , 3 as population) as b   on a.class = b.class

结果为:

presto- top N、前 N 天销售额计算

由例子我们知道 resize(array,new_size) 的功能就是重新设置数组的大小,如果 new_size > old_size 那么可以 ‘’ 或者 0 填充数组。

如果 new_size < old_size 则会将尾部的几个元素删除掉。

然后是 arrayJoin 此函数和 arrayConcat 看起来一样,但是其实差别特别的大。

arrayJoin 是将数组中的元素 split 放到列上,功能有点类似 hive 的 LATERAL VIEW 函数。

还有一个函数 indexOf(element,array),这个函数是将 element 在 array 中的位置返回,这样我们就可以得到成绩的排名。

select aa.class      ,arrayConcat(joined_array ,[toString(element_index)] )  from (        select   class                , arrayJoin(arrayResize(a.array_group , b.population)) as joined_array               , indexOf(arrayResize(a.array_group , b.population) ,arrayJoin(arrayResize(a.array_group , b.population))) as element_index          from (            select  class                  , groupArray(4)([stu_name , cast(score as String)]) as array_group              from (                select *                   from (                        select "a" as stu_name, "C1" as class ,213 as score                union all select "b" as stu_name, "C1" as class ,223 as score                union all select "c" as stu_name, "C1" as class ,212 as score                union all select "d" as stu_name, "C1" as class ,877 as score                union all select "e" as stu_name, "C1" as class ,665 as score                                union all select "g" as stu_name, "C2" as class ,6564 as score                union all select "h" as stu_name, "C2" as class ,444 as score                union all select "m" as stu_name, "C2" as class ,111 as score                union all select "n" as stu_name, "C2" as class ,222 as score                union all select "i" as stu_name, "C2" as class ,333 as score                  ) as aa                order by  class , score desc             )            group by class         ) as a all        left join (                  select "C1" as class , 4 as population        union all select "C2" as class , 3 as population        ) as b           on a.class = b.class) as aa  

presto- top N、前 N 天销售额计算

后面就好说吧,使用 arrayJoin 展开使用数组的下标展示出不同的字段就可以了。

所以最后的结果是:

select  class       ,arry[1] as stu_name       ,arry[2] as score       ,arry[3] as student_ranking        from (    select aa.class          ,arrayConcat(joined_array ,[toString(element_index)] ) as arry      from (    select   class            , arrayJoin(arrayResize(a.array_group , b.population)) as joined_array           , indexOf(arrayResize(a.array_group , b.population) ,arrayJoin(arrayResize(a.array_group , b.population))) as element_index      from (        select  class              , groupArray(4)([stu_name , cast(score as String)]) as array_group          from (            select *               from (                    select "a" as stu_name, "C1" as class ,213 as score            union all select "b" as stu_name, "C1" as class ,223 as score            union all select "c" as stu_name, "C1" as class ,212 as score            union all select "d" as stu_name, "C1" as class ,877 as score            union all select "e" as stu_name, "C1" as class ,665 as score                        union all select "g" as stu_name, "C2" as class ,6564 as score            union all select "h" as stu_name, "C2" as class ,444 as score            union all select "m" as stu_name, "C2" as class ,111 as score            union all select "n" as stu_name, "C2" as class ,222 as score            union all select "i" as stu_name, "C2" as class ,333 as score              ) as aa            order by  class , score desc         )        group by class     ) as a all    left join (              select "C1" as class , 4 as population    union all select "C2" as class , 3 as population    ) as b       on a.class = b.class    ) as aa  ) as aaa     

presto- top N、前 N 天销售额计算

presto- top N、前 N 天销售额计算

随机分组

随机分组的问题是这样的,如果我们 100 名学生成绩都差不多,那么我们可以随机的分到两个班级里面去。

select stu_name       ,case when group_index = 1              then "C1"             when group_index = 2             then "C2"             else "none" end as class  from (  select stu_name      ,(index + max_index/2 -1)/(max_index/2) as group_index  from (select stu_name      ,index      ,(max(index) over()) as max_index  from (    select *         from (    select  stu_name           ,row_number() over() as index      from (        select "a" as stu_name        union all select "b" as stu_name        union all select "c" as stu_name        union all select "d" as stu_name        union all select "e" as stu_name        union all select "f" as stu_name              )     ) as a     ) as aa   ) as aaa  ) as aaaa

需要注意的是我们这个例子对于总人数是偶数的时候是有效的,但是奇数的时候就会有问题,那些人特殊处理一下就可以了

例如,

a、b、c、d、e做取商操作的时候,得到 (1 + 2 -1)/2 = 1 、 (2 + 2 -1)/2 = 1、 (2 + 3 -1)/2 = 2、 (2 + 4 -1)/2 = 2、 (2 + 5 -1)/2 = 3我们发现,5 会分到 3 里面,所以如果总人数是奇数的时候,需要对余出来的几行记录做一下特殊处理。

还可以使用取模的办法来分组。请看下面的公式。

group=index%goup_size

% 是取于操作,举个例子:

1,2,3,4,5,6,7,8,9,102 取模后:1,0,1,0,1,0,1,0,1,03 取模后:1,2,0,1,2,0,1,2,0,1

所以 sql 可以改造成:

select stu_name       ,index      ,index%3      ,substring("ABC",1+index%3,1) as group_index  from(    select  stu_name           ,row_number() over() as index      from (        select "a" as stu_name        union all select "b" as stu_name        union all select "c" as stu_name        union all select "d" as stu_name        union all select "e" as stu_name        union all select "f" as stu_name              ) ) order by index      

结果:

presto- top N、前 N 天销售额计算

如何形成周期性的序列

presto- top N、前 N 天销售额计算

其中,interval = 7 ,每隔 7 个数字的分成了一个组。

select (index + interval -1 )/interval      , index  from (    select dim_date_id           ,max(index) over() as max_index           ,index      from (        select dim_date_id               ,row_number() over() as index           from dim.dim_date         where dim_date_id between "20190701" and "20190721"    ))  

End.爱数据网专栏作者:wang-possible作者介绍:6年零售大数据工作经验,技能持续精进CSDN个人主页:bluedraam_pp

  • 我的微信公众号
  • 微信扫一扫
  • weinxin
  • 我的微信公众号
  • 微信扫一扫
  • weinxin
匿名

发表评论

匿名网友 填写信息

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: