斗鱼弹幕分析

之前无聊爬的斗鱼弹幕,留个存档。主要爬取的都是 LOL (英雄联盟)的弹幕,用的是在 brucezzDouyuCrawler 基础上修改过的程序,他的原版程序在存储上好像有点问题。

总体构成

采集的主播房间如下所示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# 草莓
room.url.caomei = http://www.douyutv.com/caomei
# 微笑
room.url.weixiao = http://www.douyutv.com/weixiao
# 卷毛
room.url.juanmao = http://www.douyutv.com/fzz1
# 笑笑
room.url.xiaoxiao = http://www.douyutv.com/sunyalong
# 小仓
room.url.xiaocang = http://www.douyutv.com/xiaocang
# 孙悟空
room.url.sunwukong = http://www.douyutv.com/swk
# 五五开
room.url.wuwukai = http://www.douyutv.com/wt55kai
# 杀神风
room.url.shashenfeng = http://www.douyutv.com/217785
# 油条
room.url.youtiao = http://www.douyutv.com/56040
# 饼干
room.url.binggan = http://www.douyutv.com/4809
# Solofeng
room.url.solofeng = http://www.douyutv.com/lolff
# 叶音符
room.url.yeyinfu = http://www.douyutv.com/12313
# 霸哥
room.url.bage = http://www.douyutv.com/bage
# 誓约
room.url.shiyue = http://www.douyutv.com/dushuren
# 黑白
room.url.heibai = http://www.douyutv.com/93912
# 单人影
room.url.danrenyin = http://www.douyutv.com/251785
# 龙十七
room.url.longshiqi = http://www.douyutv.com/315279
# 吾单
room.url.wudan = http://www.douyutv.com/316336

纯数字的弹幕数:690135,总弹幕数:5097333,百分比为:13.54%。

2016-07-24_Digit only.png

发送次数最多的十个内容

内容 次数
66666666666 61119
护眼 59923
666 47633
66666666666666 46398
6666 35088
66666 28649
666666 23150
66666666666666666666 20114
1 18280
6666666 18213

弹幕条数最多的十个主播

主播房间号 弹幕条数
56040 718936
217785 588062
138286 565031
321358 514643
475252 446493
335166 306877
4809 276416
60062 229880
12313 202524
319721 197282

发弹幕最多的十个用户

用户名 弹幕条数
rickee7 2437
露易丝030 1688
dengxpeng 1580
yaobo754 1395
Louise030 1337
lordgw 1140
望月叹云薄 1056
1026692535 1047
1312312412123额2 1035
生哥思念 972

纯数字弹幕分析

1.选取所有内容

1
2
danmaku_pd_dict = pd.read_csv('./Danmaku.csv', low_memory=False, header=None, sep=',')
danmaku_content = danmaku_pd_dict[3]

2.分析对应的内容并且判定是否是纯数字

1
2
3
4
5
6
def is_number(s):
try:
float(s)
return True
except ValueError:
return False

3.计数

  • 总数
1
total_rows = len(danmaku_content.index)
  • 纯数字
1
2
3
4
5
6
7
count = 0
try:
for element in danmaku_content:
if is_number(element):
count += 1
except Exception, e:
print e

4.计算百分比并画图

1
2
3
4
5
6
7
8
9
10
11
12
# The slices will be ordered and plotted counter-clockwise.
labels = 'Digit only danmaku', 'Other'
sizes = [count, total_rows - count]
colors = ['yellowgreen', 'lightskyblue']
explode = (0.1, 0) # only "explode" the 2nd slice (i.e. 'Hogs')
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.2f%%', shadow=True, startangle=90)
# Set aspect ratio to be equal so that pie is drawn as a circle.
plt.axis('equal')
plt.title('Digit only danmaku')
plt.savefig('./image/Digit only.png', format='png')
plt.show()

遇到的问题

1.替换字体

但是好像保存图片的时候还是不行

1
2
UserWarning: findfont: Font family [u'Kaiti SC'] not found. Falling back to Bitstream Vera Sans
(prop.get_family(), self.defaultFamily[fontext]))

2.对 int64 类型的统计结构进行索引排序时候发生 KeyError

1
print value_counts[:10]

对主播的的统计结果(Type:pandas.core.series.Series)进行切片输出的时候,出现 KeyError 这个错误,可以这样来。

1
print value_counts.head(10)