2017-01-10 09:59:11

ArcGIS知乎网站用户信息爬取与分析【附代码】

上周陈老师让我收集一下部门人员在ArcGIS知乎上的活动信息，包括威望、积分、赞同、感谢、提问、回复和文章等数量，当时觉得数量不多，就直接打开网页一个个找到，然后敲到Excel中。
现在觉得要是以后大量分析用户行为，这就不可行，正好对爬虫感兴趣，先写个练练手。
爬取的信息如下：
有效注册用户：4633个

问题总数：5247个

回复总数：6146个

文章总数：2539篇

日期：2017年01月10日12:00

爬取过程
爬取结束
爬取结果
对数据结果进行简要汇总和分析

威望前20

积分前20

赞同前20

感谢前20

发问前20

回复前20

文章前20

1、爬取了5185个用户，有效用户为4633个，注册有效率为89.4%

2、从各项排名来看，认证用户比较活跃

3、几个普通用户比较热心，比如turtle，可以加个认证，以资鼓励，(^_^))

毕竟是公司的网站，再者是白天访问，没有使用多进程。

ArcGIS知乎可以加一些反爬虫的代码。

PS：实际通过数据库中提取用户信息，更为全面可靠。

附上代码

#!/usr/bin/env python
#-*-coding:utf-8 -*-
__author__ = 'jiang'
# 导入模块 urllib2
import urllib2
import BeautifulSoup
import re
import datetime
starttime = datetime.datetime.now()
file = open('C://Users//Esri//Desktop//arcgiazhihu.txt', 'a')
line ="编号"+ ',' + "用户名"+ ',' + "威望" + ',' + "积分" + ',' + "赞同" + ',' + "感谢" + ',' + "发问" + ',' + "回复" + ',' + "文章" + ',' + "关注" + ',' + "被关注" + ',' + "关注话题" + ',' +"个人主页" + ',' +"个性主页" + '\n'
# 对象file的write方法将字符串line写入file中.
file = file.write(line)
num=0
shuliang=input("请输入需要爬取的用户数量：")
for namecode in range(1,shuliang+1):
    url = 'http://zhihu.esrichina.com.cn/people/' + str(namecode)
    #print(url)
    # 设置头文件。抓取有些的网页不需要专门设置头文件，但是这里如果不设置的话，
    # 这里暂时不提。关于怎么知道头文件该怎么写，一些插件可以看到你用的浏览器和网站交互的
    # 头文件（这种工具很多浏览器是自带的），我用的是firefox的firebug插件。
    header = {'Host': 'zhihu.esrichina.com.cn','User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv:26.0) Gecko/20100101 Firefox/26.0','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Connection': 'keep-alive'}
    try:
        req = urllib2.Request(url, headers = header)
        con = urllib2.urlopen( req )
        # 对con这个对象调用read()方法，返回的是html页面，也就是有html标签的纯文本
        doc = con.read()
        #print doc
        # 生成一个soup对象，doc就是步骤二中提到的
        soup = BeautifulSoup.BeautifulSoup(doc)
        user_name = soup.html.body.find('div', {'class': 'mod-head'}).h1
        #print user_name
        str123=str(user_name)
        user_name1 = re.sub(r'\s|<h1>|<.*>', '', str123)
        weiwang1 = soup.findAll('em', {'class': 'aw-text-color-green'})
        chengjiu1=soup.findAll('em', {'class': 'aw-text-color-orange'})
        q_a = soup.findAll('span', {'class': 'badge'})
        fensi = soup.findAll('em', {'class': 'aw-text-color-blue'})
        weiwang2=weiwang1[0:len(weiwang1) / 2]
        chengjiu=chengjiu1[0:len(chengjiu1) / 2]
        weiwang = re.sub(r'<em class="aw-text-color-green">|<.*>', '', str(weiwang2[0]))
        jifen=re.sub(r'<em class="aw-text-color-orange">|<.*>', '', str(chengjiu[0]))
        zantong = re.sub(r'<em class="aw-text-color-orange">|<.*>', '', str(chengjiu[1]))
        ganxie = re.sub(r'<em class="aw-text-color-orange">|<.*>', '', str(chengjiu[2]))
        fawen=re.sub(r'<span class="badge">|<.*>', '', str(q_a[0]))
        huifu=re.sub(r'<span class="badge">|<.*>', '', str(q_a[1]))
        wenzhang=re.sub(r'<span class="badge">|<.*>', '', str(q_a[2]))
        guanzhu=re.sub(r'<em class="aw-text-color-blue">|<.*>', '', str(fensi[0]))
        beiguanzhu = re.sub(r'<em class="aw-text-color-blue">|<.*>', '', str(fensi[1]))
        guanzhuhuati = re.sub(r'\n|<em class="aw-text-color-blue">|<.*>', '', str(fensi[2]))
        gerenzhuye='http://zhihu.esrichina.com.cn/people/'+str(namecode)
        gexingzhuye='http://zhihu.esrichina.com.cn/people/'+user_name1
        print "正在爬取用户【"+user_name1+"】的信息"
        # print "威望：" + weiwang
        # print "积分：" + jifen
        # print "赞同：" + zantong
        # print "感谢：" + ganxie
        # print "提问：" + fawen
        # print "回复：" + huifu
        # print "文章：" + wenzhang
        # print "个人主页："+gerenzhuye
        # print "个性主页："+gexingzhuye
        # print "--------------------------------------------------------------------------------------------------------------------------------------"
        file = open('C://Users//Esri//Desktop//arcgiazhihu.txt', 'a')
        line = str(namecode)+ ',' +str(user_name1)+ ',' + weiwang + ',' + jifen + ',' + zantong+ ',' + ganxie+ ',' + fawen+ ',' + huifu+ ',' + wenzhang+ ',' + guanzhu+ ',' + beiguanzhu+ ',' + guanzhuhuati +',' +gerenzhuye+',' +gexingzhuye+ '\n'
        # 对象file的write方法将字符串line写入file中
        file = file.write(line)
        num=num+1
    except Exception, e:
        print "用户"+str(namecode)+"不存在"
        continue
endtime = datetime.datetime.now()
time = (endtime - starttime).seconds
print "最终共爬取到"+str(num)+"个有效用户信息"
print "用时"+str(time)+"秒"