2017-02-09 18:35:28

网站服务器(Apach)日志读取分析成图及代码

先说一句招黑的话：python是最好的编程语言！
还有王法吗？
没有之一
还有谁

近来一直在学python，
最先知道python可以做爬虫，简单几行代码，轻松爬取网站的内容
上手也比较快，即学即上岗

python真是无所不能呀，越来越被人重视，现在是引用这张经典照片的时候了
人生苦短，我用python
比如现在使用Arcpy写一写批处理的脚本，解决很多繁琐的问题
还有爬虫，将爬到的数据分析一下，很有成就感
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
过年放假前，接到一项挑战：将公司论坛网站（ArcGIS知乎）服务器上的Apach日志分析一下，看看网站论坛的搜索情况
日志是每天生成一个，一共是二百多天的，总共大概1.3个G，数量不大。

读取数据每天搜索次数统计

1、首先肯定是看一下这些文件名字，嗯，很正规，方便后续的读取和存储

2、然后随便打开一个文件，分析一下存储结构，
然后找到自己需要的内容，初步思考怎么提取出来

3、接着尝试写代码读取一个文件看看，无压力
4、现在要的是将里面有形如“GET //search/ajax/search_result/search_type-questions__q-%E5%85%A5%E9%97%A8”搜索的请求行取出来，找到其中q-%E5%85%A5%E9%97%A8字符串，
然后使用“urllib.parse.unquote”转码，转为可以读懂的关键词，这就是要的结果了
5、读取到每个日志文件中的关键词后，将其存储到文本中,代码如下，

###读取源数据，通过挑选、反编码并写入内存
rootdir = "C://Users//Esri//Desktop//log"
loglist = []
tongjilist = []
for parent, dirnames, filenames in os.walk(rootdir):
    day = 1
    i = 1
    for filename in filenames:
        filedate = filename[7:-4]
        filepath = os.path.join(parent, filename)
        # print(filedate)
        # print(filepath)
        input = open(filepath, 'r')
        num = 1
        for line in input:
            if (
                    'GET //search/ajax/search_result/search_type-questions__q-' and 'template-__page-1' and 'search_type-questions') in line:
                kw1 = line.split()[6][53:-19]
                kw = urllib.parse.unquote(kw1).replace('_', '')
                if '?' in kw:
                    pass
                else:
                    # file1 = open('C://Users//Esri//Desktop//statistics//原始数据.txt', 'a',encoding='utf-8')
                    logline = str(line.split()[0]) + ',' + str(line.split()[3][1:]) + ',' + kw + '\n'
                    loglist.append(logline)
                    # file1 = file1.write(line1)
                    # print(i)
                    i += 1
                    num += 1
        day += 1
        print(filedate + ":共有" + str(num) + "次搜索")
        # file2 = open('C://Users//Esri//Desktop//statistics//每天搜索次数统计.txt', 'a')
        tongjiline = filedate + ',' + str(num) + '\n'
        tongjilist.append(tongjiline)
###创建文本并写入数据表头
log = open('C://Users//Esri//Desktop//statistics//原始数据.txt', 'w', encoding='utf-8')
loghead = "IP" + ',' + "日期" + ',' + "搜索关键词" + '\n'
log.write(loghead)  # 写文件头
log.close()
tongji = open('C://Users//Esri//Desktop//statistics//每天搜索次数统计.txt', 'w', encoding='utf-8')
tongjihead = "日期" + ',' + "次数" + '\n'
tongji.write(tongjihead)  # 写文件头
tongji.close()
###将内存的list保存到文本中
openlog = open('C://Users//Esri//Desktop//statistics//原始数据.txt', 'a', encoding='utf-8')
openlog.writelines(loglist)
openlog.close()
opentongji = open('C://Users//Esri//Desktop//statistics//每天搜索次数统计.txt', 'a', encoding='utf-8')
opentongji.writelines(tongjilist)
opentongji.close()
del loglist
del tongjilist
day -= 1
sumday = i - 1
print(str(day) + "天一共有：" + str(sumday) + "次搜索")
# print(filedate)
startday = (datetime.datetime.strptime(filedate, "%Y-%m-%d") - datetime.timedelta(days=day - 1)).strftime("%Y-%m-%d")
# print(startday)

PS：读取完文件花了10S

同一IP、日期、搜索关键词结果去重

好了，得到初步结果。
通过代码，可以看到：每天搜索次数统计.txt中包含了三方面的内容：IP，时间，搜索关键词，
对于同一IP，同一时间，同一搜索关键词进行去重，得到清洗后的数据源，才能进行后续的有效分析
代码如下

'''同一IP、日期、搜索关键词结果去重'''
lines_seen = set()
outfile = open('C://Users//Esri//Desktop//statistics//nodate.txt', "w", encoding='utf-8')
for line in open('C://Users//Esri//Desktop//statistics//原始数据.txt', "r", encoding='utf-8'):
    outfile = open('C://Users//Esri//Desktop//statistics//nodate.txt', "a", encoding='utf-8')
    rline1 = line.split(',')[0]
    # print(rline1)
    rline2 = line.split(',')[2]
    # print(rline2)
    rline3 = line.split(',')[1][:-9]
    # print(rline3)
    reline = rline1 + ',' + rline3 + ',' + rline2
    if reline not in lines_seen:
        # print('reline:'+reline)
        outfile.write(reline)
        lines_seen.add(reline)
opendupbeta = open('C://Users//Esri//Desktop//statistics//nodate.txt', "r", encoding='utf-8')
opendup = open('C://Users//Esri//Desktop//statistics//爬取的数据源.txt', "w", encoding='utf-8')
opendup.write(re.sub(r'IP,,搜索关键词', 'IP,日期,搜索关键词', opendupbeta.read()))
opendupbeta.close()
opendup.close

搜索关键词Top100统计

有了数据能做些什么呢？
既然是网站论坛的搜索关键词统计，那么统计一下搜索次数最高的关键词 TOP100！
python很容易上手，那么问题来了：你将对Python的了解不够深入。
所以在你开始实现自己的想法时，一种“朴素的编程思想”贯穿脑中，费尽心力写了几十行，还不怎么完美
最后发现该功能已经写成函数了，你只需用一行去调用即可。
这里不禁感叹python的强大：没有做不到的，只有你想不到的。
好吧，这是我的心得体会！
附上统计Top100的代码

'''搜索关键词Top100统计'''
outfile1 = open('C://Users//Esri//Desktop//statistics//frequencybeta.txt', "w", encoding='utf-8')
outfile2 = open('C://Users//Esri//Desktop//statistics//爬取的数据源.txt', "r", encoding='utf-8')
for line in outfile2:
    reline1 = line.split(',')[2]
    outfile1.write(reline1)
outfile1.close()
outfile2.close()
f1 = open('C://Users//Esri//Desktop//statistics//frequencybeta.txt', "r", encoding='utf-8')
f2 = open('C://Users//Esri//Desktop//statistics//frequency.txt', 'w', encoding='utf-8')
f2.write(re.sub(r'搜索关键词', '', f1.read()))
f1.close()
f2.close()
openfre = open('C://Users//Esri//Desktop//statistics//frequency.txt', encoding='utf-8')
str1 = openfre.read().lower().split()
# print(str1)
# print("原文本:\n %s"% str1)
# print("\n各词出现的次数：\n %s" % collections.Counter(str1))
# print(type(collections.Counter(str1)))
# cishu= collections.Counter(str1)
# for  ci in str1:
# print(ci, cishu[ci])
# print(collections.Counter(str1)['arcgis'])#以字典的形式存储，每个关键词对应的键值就是在文本中出现的次数
mc100 = re.sub('\ |\[|\]|\(|\)|\'', '', str(collections.Counter(str1).most_common(100)))
# print(mc100)
msarr = mc100.split(',')
# print(type(msarr))
openfre.close()
file3 = open('C://Users//Esri//Desktop//statistics//关键词top100.txt', 'w', encoding='utf-8')
line3 = "关键词" + ',' + "次数" + '\n'
file3_ = file3.write(line3)
i = 1
x = []
y = []
file3.close()
for xunhuan in msarr:
    file4 = open('C://Users//Esri//Desktop//statistics//关键词top100.txt', 'a', encoding='utf-8')
    if i % 2 != 0:
        xunhuandot = xunhuan + ","
        file4.write(xunhuandot)
        x.append(xunhuan)
    else:
        xunhuanbr = xunhuan + '\n'
        file4.write(xunhuanbr)
        y.append(int(xunhuan))
    i += 1
# print(x)
# print(y)
file3.close()
file4.close()

按月份统计搜索次数

有了每天的搜索次数，就可以得到每月的搜索次数

'''按月份统计汇总'''
openutj = open('C://Users//Esri//Desktop//statistics//uheadtongji.txt',"r",encoding='utf-8')
umonth=[]
for line5 in openutj.read().split('\n'):
    umonth.append(line5[:7])
#print(year)
#print(month)
openutj.close()
outfile7 = open('C://Users//Esri//Desktop//statistics//每月搜索次数统计.txt', "w", encoding='utf-8')
headms='月份'+','+'次数'+'\n'
outfile7.write(headms)
outfile7.close()
uyemon=[]
for mon in umonth:
    #print(mon)
    openutj1 = open('C://Users//Esri//Desktop//statistics//uheadtongji.txt', "r", encoding='utf-8')
    summ=0
    for line6 in openutj1.read().split('\n'):
        monn=line6[:7]
        if monn==mon:
            summ+=int(line6[11:])
    uyemon.append(summ)
#print(uyemon)
month=[]
for dmonth in umonth:
    if dmonth not in month:
        month.append(dmonth)
#print(month)
yemon=[]
for dyemon in uyemon:
    if dyemon not in yemon:
        yemon.append(dyemon)
#print(yemon)
openutj1.close()
i22=0
sousuo=[]
for c22 in month:
    i22+=1
    if i22!=len(month):
        sousuo.append(str(c22)+','+str(yemon[i22-1])+'\n')
    else:
        sousuo.append(str(c22) + ',' + str(yemon[i22 - 1]))
#print(sousuo)
openmeiyue=open('C://Users//Esri//Desktop//statistics//每月搜索次数统计.txt', "a", encoding='utf-8')
openmeiyue.writelines(sousuo)
#print(sousuo)
openmeiyue.close()

统计结果成图

得到统计结果后，可以通过python的matplotlib等库可视化数据

'''每天统计成图'''
opentj1 = open('C://Users//Esri//Desktop//statistics//每天搜索次数统计.txt', 'r', encoding='utf-8')
openu1 = open('C://Users//Esri//Desktop//statistics//uheadtongji.txt', 'w', encoding='utf-8')
jishu = 0
for line1 in opentj1.read().split('\n'):
    # print(line1)
    jishu += 1
# print(jishu)
num = 0
opentj2 = open('C://Users//Esri//Desktop//statistics//每天搜索次数统计.txt', 'r', encoding='utf-8')
for line in opentj2.read().split('\n'):
    # print(line)
    openu2 = open('C://Users//Esri//Desktop//statistics//uheadtongji.txt', 'a', encoding='utf-8')
    # print(line3)
    if line == '日期,次数' or line == '':
        pass
    else:
        if num != jishu - 2:
            openu2.write(line + '\n')
        else:
            openu2.write(line)
    num += 1
opentj1.close()
opentj2.close()
f5 = open('C://Users//Esri//Desktop//statistics//uheadtongji.txt', 'r').read().replace('\n', ',').split(',')
# print(f5)
jishu2 = 1
x = []
y = []
for uh in f5:
    # print(uh)
    if jishu2 % 2 != 0:
        x.append(uh)
    else:
        y.append(int(uh))
    jishu2 += 1
jishu1 = 1
x1 = []
for jishu1 in range(1, len(x) + 1):
    x1.append(int(jishu1))
# print(x1)
DataX = tuple(x1)
# print(type(DataX))
DataY = tuple(y)

fig3 = pl.figure()
fig3.set_size_inches(30, 10.5)
plt.title(
    'ArcGIS知乎站内搜索每天统计（' + startday + '至' + filedate + '）' + '\n' + '(' + str(day) + "天一共有：" + str(sumday) + "次搜索)",
    fontsize=26)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('日期(' + startday + '+)', fontsize=23)
plt.ylabel('次数', fontsize=23)
plt.plot(x1, y, 'b*')  # use pylab to plot x and y
plt.plot(x1, y, 'r')
for i, (_x, _y,_z) in enumerate(zip(x1, y,y)):
    plt.text(_x, _y, _z, color='red', fontsize=25)
savets = "C:/Users/Esri/Desktop/" + re.sub(':', '', str(datetime.datetime.now())) + "ts.png"
fig3.savefig(savets, dpi=100)
# pl.show()  # show the plot on the screen

'''每月统计折点图'''
openmonth=open('C://Users//Esri//Desktop//statistics//每月搜索次数统计.txt', 'r', encoding='utf-8')
realmonth=[]
monsta=[]
for monybeta in openmonth:
    mony=monybeta.replace('\n','').split(',')[1]
    if mony!='次数':
        monsta.append(mony)
        realmonth.append(monybeta[5:7])
mons=[]
#print(len(monsta))
for monx in range(0,len(monsta)):
    mons.append(monx+1)
#print(monsta)
#print(mons)
fig4 = plt.figure()
fig4.set_size_inches(30, 10.5)
plt.title(
    'ArcGIS知乎站内搜索每月统计（' + startday + '至' + filedate + '）' + '\n' + '(' + str(day) + "天一共有：" + str(sumday) + "次搜索)",
    fontsize=26)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel('月份(' + str(startday)[5:7] + '月份开始--)', fontsize=23)
plt.ylabel('次数', fontsize=23)
plt.plot(mons, monsta, 'bo',mons, monsta, 'r')  # use pylab to plot x and y
#plt.plot(mons, monsta, 'r')
plt.xticks(mons, realmonth, fontsize=20)
for i, (_x, _y,_z) in enumerate(zip(mons, monsta,monsta)):
    plt.text(_x, _y, _z, color='red', fontsize=25)
savemonth = "C:/Users/Esri/Desktop/" + re.sub(':', '', str(datetime.datetime.now())) + "month.png"
fig4.savefig(savemonth, dpi=100)

'''top20成图'''
x = x[:20]
y = y[:20]
i = 1
x1 = []
for i in range(1, len(x) + 1):
    x1.append(int(i)+5)
DataX = tuple(x1)
# print(type(DataX))
DataY = tuple(y)
# print(type(DataY))
fig1 = plt.figure()
fig1.set_size_inches(30, 15)
# plt.xlim()
rects = plt.bar(left=DataX, height=DataY, width=0.5, align="center", yerr=0.000001, facecolor='lightskyblue',
                edgecolor='white')
plt.plot(DataX,DataY)

def autolabel(rects):
    for rect in rects:
        height = rect.get_height()
        plt.text(rect.get_x() + rect.get_width() / 3., 1.01 * height, '%s' % int(height), color='red', fontsize=25)


autolabel(rects)
plt.xlabel('关键词', fontsize=23)
plt.ylabel('次数', fontsize=23)
plt.title('ArcGIS知乎站内搜索关键词统计（' + startday + '至' + filedate + '）top20统计' + '\n' + '(' + str(day) + "天一共有：" + str(
    sumday) + "次搜索)", fontsize=26)
plt.yticks(fontsize=20)
plt.xticks(DataX, tuple(x), fontsize=20)
savename = "C:/Users/Esri/Desktop/" + re.sub(':', '', str(datetime.datetime.now())) + ".png"
# print(savename)
fig1.savefig(savename, dpi=100)
# plt.show()

'''top20成图'''
x = x[:20]
y = y[:20]
i = 1
x1 = []
for i in range(1, len(x) + 1):
    x1.append(int(i)+5)
DataX = tuple(x1)
# print(type(DataX))
DataY = tuple(y)
# print(type(DataY))
fig1 = plt.figure()
fig1.set_size_inches(30, 15)
# plt.xlim()
rects = plt.bar(left=DataX, height=DataY, width=0.5, align="center", yerr=0.000001, facecolor='lightskyblue',
                edgecolor='white')
plt.plot(DataX,DataY)

def autolabel(rects):
    for rect in rects:
        height = rect.get_height()
        plt.text(rect.get_x() + rect.get_width() / 3., 1.01 * height, '%s' % int(height), color='red', fontsize=25)


autolabel(rects)
plt.xlabel('关键词', fontsize=23)
plt.ylabel('次数', fontsize=23)
plt.title('ArcGIS知乎站内搜索关键词统计（' + startday + '至' + filedate + '）top20统计' + '\n' + '(' + str(day) + "天一共有：" + str(
    sumday) + "次搜索)", fontsize=26)
plt.yticks(fontsize=20)
plt.xticks(DataX, tuple(x), fontsize=20)
savename = "C:/Users/Esri/Desktop/" + re.sub(':', '', str(datetime.datetime.now())) + ".png"
# print(savename)
fig1.savefig(savename, dpi=100)
# plt.show()

由于上述都是以txt文档的形式保存数据，为方便出图，可以转成Excel格式，实现代码如下

def txt_to_xlsx(txtpath):
    exc = xlsxwriter.Workbook(
        'C://Users//Esri//Desktop//statistics//' + txtpath.split('//')[-1].split('.')[0] + '.xlsx')
    worksheet = exc.add_worksheet()
    openf = open(txtpath, 'r', encoding='utf-8')
    opentxt = openf.read().split('\n')
    txtline = 0
    spaceline = 0
    storesp = set()
    newtxt = []
    '''首先去掉空行'''
    for line in opentxt:
        # print(line)
        txtline += 1
        if line != '':
            newtxt.append(line)
        else:
            spaceline += 1
            storesp.add(txtline)

    '''将内存中list写入xlsx'''
    colnums = len(newtxt[0].split(','))
    maxwidth = 0
    linenum = len(newtxt)
    for colnum in range(0, colnums):
        for linenums in range(0, linenum):
            wid = len(newtxt[linenums].split(',')[colnum])
            if wid >= maxwidth:
                maxwidth = wid
        worksheet.set_column(colnum, colnum, width=maxwidth)
        for linenumss in range(0, linenum):
            worksheet.write(linenumss, colnum, newtxt[linenumss].split(',')[colnum])
    exc.close()
    openf.close()
    # '''文本信息'''
    # print('====================================='+'\n'+'导入文本信息'+'\n'+'=====================================')
    # print('该文本共有%d行' % txtline + '\n' + '有%d行为空' % spaceline + '\n' + '空行行数分别为:' + re.sub('{|}', '', str(storesp)))

到这，代码运行无误的话，数据分析结果已经得到
如果我进一步想将分析的结果以邮件的形式发送给别人，可以用下面的代码实现

_user = "发送人邮箱地址"
smtpserver='邮箱SMTP地址'
_from = '发送人姓名<发送人邮箱地址>'
_pwd = "邮箱密码"
#----如果发送多个人用英文半角逗号隔开---
_to = '收件人名字<收件人邮箱地址>'

'''我们编写了一个函数_format_addr()来格式化一个邮件地址。
注意不能简单地传入name <addr@example.com>，
因为如果包含中文，需要通过Header对象进行编码。---廖雪峰'''
def _format_addr(s):
    name, addr = parseaddr(s)
    return formataddr((Header(name, 'utf-8').encode(), addr))

# 如名字所示Multipart就是分多个部分
msg = MIMEMultipart()
msg["Subject"] = "邮件名字"
msg["From"] =_format_addr( _from)
msg["To"] = _to

# ---这是文字图片部分---
def addimg(src, imgid):# 添加图片函数，参数1：图片路径，参数2：图片id
    fp = open(src, 'rb')# 打开文件
    msgImage = MIMEImage(fp.read())# 创建MIMEImage对象，读取图片内容并作为参数
    fp.close()# 关闭文件
    msgImage.add_header('Content-ID', imgid)# 指定图片文件的Content-ID，imgid，<img>标签中的src用到
    return msgImage# 返回msgImage对象
#msg = MIMEMultipart('related')# 创建MIMEMultipart对象，采用related定义内嵌资源的邮件体
#msg["Subject"] = "ArcGISzhihu"
part = MIMEText('<font size="18px" color=red><b><center>ArcGIS知乎站内搜索关键词统计（详情见附件）</center></b></font>'+'<br/><b><center>数据图表获取生成时间：'+str(datetime.datetime.now())+'</center></b><img width="1300px" src="cid:pic">',"html","utf-8" )
msg.attach(part)
# image = MIMEImage(open('C://Users//Esri//Desktop//0b7b02087bf40ad14cc91cbd5f2c11dfa9eccebb.jpg','rb').read()).add_header('Content-ID','men')
msg.attach(addimg(savename,"pic"))

# ---这是附件部分---

part = MIMEApplication(open('C://Users//Esri//Desktop//statistics//爬取的数据源.xlsx', 'rb').read())
part.add_header('Content-Disposition', 'attachment', filename=('gbk', '',"爬取的数据源.xlsx"))
msg.attach(part)

part = MIMEApplication(open('C://Users//Esri//Desktop//statistics//每天搜索次数统计.txt', 'rb').read())
part.add_header('Content-Disposition', 'attachment', filename=('gbk', '',"每天搜索次数统计.xlsx"))
msg.attach(part)

part = MIMEApplication(open('C://Users//Esri//Desktop//statistics//每月搜索次数统计.xlsx', 'rb').read())
part.add_header('Content-Disposition', 'attachment', filename=('gbk', '',"每月搜索次数统计.xlsx"))
msg.attach(part)

part = MIMEApplication(open('C://Users//Esri//Desktop//statistics//关键词top100.xlsx', 'rb').read())
part.add_header('Content-Disposition', 'attachment', filename=('gbk', '',"关键词top100.xlsx"))
msg.attach(part)

part = MIMEApplication(open(savename, 'rb').read())
part.add_header('Content-Disposition', 'attachment', filename=('gbk', '',"Top20统计结果直方图.png"))
msg.attach(part)

part = MIMEApplication(open(savets, 'rb').read())
part.add_header('Content-Disposition', 'attachment', filename=('gbk', '',"每天统计结果折点图.png"))
msg.attach(part)

part = MIMEApplication(open(savemonth, 'rb').read())
part.add_header('Content-Disposition', 'attachment', filename=('gbk', '',"每月统计结果折点图.png"))
msg.attach(part)

s = smtplib.SMTP(smtpserver,25)  # 连接smtp邮件服务器,端口默认是25
s.login(_user, _pwd)  # 登陆服务器
s.sendmail(_user, _to.split(','), msg.as_string())  # 发送邮件
s.close()

如果想知道运行这些过程消耗的时间，可以在最后加上以下代码

'''执行操作时间'''
endtime = datetime.datetime.now()
time = (endtime - starttime).seconds
print('------------------------------------------------------' + '\n' + "一共用时" + str(time) + "秒")
print('邮件发送给:%s'%_to.split(','))

最后的最后，如果你觉得每天的统计次数成图效果不好，可以制作一个HTML网页去展示，调用js的C3库，易于交互查看

html网页展示

读取数据 每天搜索次数统计