拼多多面经分享：24个「数据分析师」岗位面试题和答案解析

挖数网精选

446
文章

0
评论

2020-05-1102:05:00 评论 2,820 6122字

摘要

大家在面试数据分析岗位时，肯定会有很多困扰，不知道面试时会被问到什么问题，今天小编给大家分享一下拼多多的数据分析岗位的面试题和答案，供大家参考。

一.贝叶斯公式复述并解释应用场景

P（A|B) =P(B|A)*P(A) / P(B)

如搜索query纠错，设A为正确的词，B为输入的词，那么：

P(A|B)表示输入词B实际为A的概率
P(B|A)表示词A错输为B的概率，可以根据AB的相似度计算（如编辑距离）
P(A)是词A出现的频率，统计获得
P(B)对于所有候选的A都一样，所以可以省去

二.如何写SQL求出中位数平均数和众数（除了用count之外的方法）

1. 中位数

方案1（没考虑到偶数个数的情况）：

set @m = (select count(*)/2 from table)

select column from table order by columnlimit @m, 1

方案2（考虑偶数个数，中位数是中间两个数的平均）：

set @index = -1

select avg(table.column)

from

(select @index:=@index+1 as index, column

from table order by column) as t

where t.index in(floor(@index/2),ceiling(@index/2))

2. 平均数

select avg(distinct column) from table

3. 众数

select column, count(*) from table group bycolumn order by column desc limit 1(emmm，好像用到count了）

三.如何避免决策树过拟合

限制树深
剪枝
限制叶节点数量
正则化项
增加数据
bagging（subsample、subfeature、低维空间投影）
数据增强（加入有杂质的数据）
早停

四.朴素贝叶斯的理解

理解：朴素贝叶斯是在已知一些先验概率的情况下，由果索因的一种方法
其它：朴素的意思是假设了事件相互独立

五.SVM的优点

优点：

能应用于非线性可分的情况
最后分类时由支持向量决定，复杂度取决于支持向量的数目而不是样本空间的维度，避免了维度灾难
具有鲁棒性：因为只使用少量支持向量，抓住关键样本，剔除冗余样本
高维低样本下性能好，如文本分类

缺点：

模型训练复杂度高
难以适应多分类问题
核函数选择没有较好的方法论

六.Kmeans的原理

初始化k个点
根据距离点归入k个类中
更新k个类的类中心
重复②③，直到收敛或达到迭代次数

七.口答一个SQL题（要用到rownumber）

mysql中设置row number：

SET @row_number = 0; SELECT(@row_number:=@row_number + 1) AS num FROM table

八.业务场景题，如何分析次日留存率下降的问题

业务问题关键是问对问题，然后才是拆解问题去解决。

1. 两层模型

从用户画像、渠道、产品、行为环节等角度细分，明确到底是哪里的次日留存率下降了

2. 指标拆解

次日留存率 = Σ 次日留存数 / 今日获客人数

3. 原因分析

内部：

运营活动
产品变动
技术故障
设计漏洞（如产生可以撸羊毛的设计）

外部：

竞品
用户偏好
节假日
社会事件（如产生舆论）

九.处理需求时的一般思路是什么，并举例

明确需求，需求方的目的是什么
拆解任务
制定可执行方案
推进
验收

十.hadoop原理和mapreduce原理

1. Hadoop原理

采用HDFS分布式存储文件，MapReduce分解计算，其它先略

2. MapReduce原理

map阶段：读取HDFS中的文件，解析成<k,v>的形式，并对<k,v>进行分区（默认一个区），将相同k的value放在一个集合中
reduce阶段：将map的输出copy到不同的reduce节点上，节点对map的输出进行合并、排序

十一.现有一个数据库表Tourists，记录了某个景点7月份每天来访游客的数量如下：id date visits 1 2017-07-01100 …… 非常巧，id字段刚好等于日期里面的几号。现在请筛选出连续三天都有大于100天的日期。上面例子的输出为：date 2017-07-01 ……

select t1.date

from Tourists as t1, Tourists as t2,Tourists as t3

on t1.id = (t2.id+1) and t2.id = (t3.id+1)

where t1.visits >100 andt2.visits>100 and t3.visits>100

十二.在一张工资表salary里面，发现2017-07这个月的性别字段男m和女f写反了，请用一个Updae语句修复数据。例如表格数据是：id name gender salary month 1 A m 1000 2017-06 2 B f 1010 2017-06

update salary

set gender = replace("mf", gender, "")

十三.现有A表，有21个列，第一列id，剩余列为特征字段，列名从d1-d20，共10W条数据！另外一个表B称为模式表，和A表结构一样，共5W条数据请找到A表中的特征符合B表中模式的数据，并记录下相对应的id。

有两种情况满足要求：

每个特征列都完全匹配的情况下
最多有一个特征列不匹配，其他19个特征列都完全匹配，但哪个列不匹配未知

select A.id,

((case A.d1 when B.d1 then 1 else 0) +

(case A.d2 when B.d2 then 1 else 0) +

...) as count_match

from A left join B

on A.d1 = B.d1

十四.我们把用户对商品的评分用稀疏向量表示，保存在数据库表t里面：t的字段有：uid，goods_id，star uid是用户id；goodsid是商品id；star是用户对该商品的评分，值为1-5。现在我们想要计算向量两两之间的内积，内积在这里的语义为：对于两个不同的用户，如果他们都对同样的一批商品打了分，那么对于这里面的每个人的分数乘起来，并对这些乘积求和。

例子，数据库表里有以下的数据：U0 g0 2 U0 g1 4 U1 g0 3 U1g1 1 计算后的结果为：U0 U1 2*3+4*1=10 ……

select uid1, uid2, sum(result) as dot

from

(select t1.uid as uid1, t2.uid as uid2,t1.goods_id, t1.star*t2.star as result

from t as t1, t as t2

on t1.goods_id = t2.goods_id) as t

group by goods_id

十五.统计教授多门课老师数量并输出每位老师教授课程数统计表

设表class中字段为id，teacher，course

1. 统计教授多门课老师数量

select count(*) from class

group by teacher having count(*) > 1

2. 输出每位老师教授课程数统计

select teacher, count(course) ascount_course

from class

group by teacher

十六.四个人选举出一个骑士，统计投票数，并输出真正的骑士名字

设表tabe中字段为id，knight，vote_knight

select knight from table

group by vote_knight

order by count(vote_knight) limit 1

十七.员工表，宿舍表，部门表，统计出宿舍楼各部门人数表

设：

员工表为employee，字段为id，employee_name，belong_dormitory_id，belong_department_id；
宿舍表为dormitory，字段为id，dormitory_number；
部门表为department，字段为id，department_name

select dormitory_number, department_name,count(employee_name) as count_employee

from employee as e

left join dormitory as dor one.belong_dormitory_id = dor.id

left join department as dep one.belong_department_id = dep.id

十八.给出一堆数和频数的表格，统计这一堆数中位数

设表table中字段为id,number,frequency

set @sum = (select sum(frequency)+1 as sumfrom table)

set @index = 0

set @last_index = 0

select avg(distinct t.frequecy)

from

(select @last_index := @index, @index :=@index+frequency as index, frequency

from table) as t

where t.index in (floor(@sum/2), ceiling(@sum/2))

or (floor(@sum/2) > t.last_index andceiling(@sum.2) <= t.index)

十九.中位数，三个班级合在一起的一张成绩单，统计每个班级成绩中位数

设表table中字段为id，class，score

select t1.class, avg(distinct t1.score) asmedian

from table t1, table t2 on t1.id = t2.id

group by t1.class, t1.score

having sum(case when t1.score >=t2.score then 1else 0 end) >=

(select count(*)/2 from table wheretable.class = t1.class)

and

having sum(case when t1.score <=t2.score then 1else 0 end) >=

(select count(*)/2 from table wheretable.class = t1.class)

二十.交易表结构为user_id,order_id,pay_time,order_amount

写sql查询过去一个月付款用户量（提示：用户量需去重）最高的3天分别是哪几天

写sql查询做昨天每个用户最后付款的订单ID及金额

select count(distinct user_id) as c fromtable group by month(pay_time) order by c desc limit 3

select order_id, order_amount from ((selectuser_id, max(pay_time) as mt from table group by user_id whereDATEDIFF(pay_time, NOW()) = -1 as t1) left join table as t2 where t1.user_id =t2.user_id and t1.mt == t2.pay_time)

二十一.PV表a(表结构为user_id,goods_id),点击表b(user_id,goods_id),数据量各为50万条，在防止数据倾斜的情况下，写一句sql找出两个表共同的user_id和相应的goods_id

select * from a

where a.user_id exsit (select user_id fromb)

二十二.表结构为user_id,reg_time,age, 写一句sql按user_id随机抽样2000个用户写一句sql取出按各年龄段（每10岁一个分段，如（0,10））分别抽样1%的用户

1. 随机抽样2000个用户

select * from table order by rand() limit2000

2. 取出各年龄段抽样1%的用户

set @target = 0

set @count_user = 0

select @target:=@target+10 as age_right, *

from table as t1

where t1.age >=@target-10 and t1.age< (@target)

and t1.id in

(select floor(count(*)*0.1） from table as t2

where t1.age >=@target-10 and t1.age< (@target)

order by rand() limit ??)

注：mysql下按百分比取数没有想到比较好的方法，因为limit后面不能接变量。想到的方法是先计算出每个年龄段的总数，然后计算出1%是多少，接着给每一行加一个递增+1的行标，当行标=1%时，结束。

二十三.用户登录日志表为user_id,log_id,session_id,plat,visit_date 用sql查询近30天每天平均登录用户数量用sql查询出近30天连续访问7天以上的用户数量

1. 近三十天每天平均登录用户数量

select visit_date, count(distince user_id)

group by visit_date

2. 近30天连续访问7天以上的用户数量

select t1.date

from table t1, table t2, ..., table t7

on t1.visit_date = (t2.visit_date+1) andt2.visit_date = (t3.visit_date+1)

and ... and t6.visit_date =(t7.visit_date+1）

二十四.表user_id,visit_date,page_name,plat 统计近7天每天到访的新用户数统计每个访问渠道plat7天前的新用户的3日留存率和7日留存率

1. 近7天每天到访的新用户数

select day(visit_date), count(distinctuser_id)

from table

where user_id not in

(select user_id from table

where day(visit_date) <date_sub(visit_date, interval 7day))

2. 每个渠道7天前用户的3日留存和7日留存

# 三日留存

# 先计算每个平台7日前的新用户数量

select t1.plat, t1.c/t2.c as retention_3

(select plat, count(distinct user_id)

from table

group by plat, user_id

having day(min(visit_date)) =date_sub(now(), interval 7 day)) as t1

left join

(select plat, count(distinct user_id) as c

from table

group by user_id having count(user_id) >0

having day(min(visit_date)) =date_sub(now(), interval 7 day)

and day(max(visit_date)) >date_sub(now(), interval 7 day)

and day(max(visit_date)) <=date_sub(now(), interval 4day)) as t2

on t1.plat = t2.plat

End.

作者：稻娃

来源：CSDN

本文为转载分享，如有侵权请联系后台删除

历史上的今天

5 月

我的微信公众号
微信扫一扫

我的微信公众号
微信扫一扫

2024 年 7 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

历史上的今天

您可以选择一种方式赞助本站

支付宝扫一扫赞助

微信钱包扫描赞助

发表评论 取消回复

登录 注册 找回密码

发表评论取消回复

登录注册找回密码