python爬取贴吧文字

1

博主用的是python 3.4,如果有用python 2.x的朋友,可以参考python 3和python 2的不同

其实爬取文字和爬取图片是差不多的,都是在获取到的html中找寻相应的所需文本.

但是爬取文字时候,可能会在html中遇见很多的标签.需要做的就是去除这些标签.

首先,导入:

import urllib.request
import re

这里导入re,是导入正则表达式.

很多语言都有正则表达式,这是一个非常有用的东西.

#url = input('请输入您需要爬取的百度贴吧的地址:')
url = 'http://tieba.baidu.com/p/3975584183'

输入需要爬取的贴吧的地址.其他网站的文字爬取类似.

在这里,我们添加一个模仿浏览器的头,代理IP。当然也可以不添加。

response = urllib.request.urlopen(url)
html = response.read().decode('utf-8')
return html

获取到所需的网页。

贴吧截图3

打开审查元素,可以看到。在每一楼的回复前都有一个标志

<div id="post_content_

所以利用正则表达式:

p = re.compile('<div id="post_content_.*?>(.*?)</div>')

这里讲所匹配的正则表达式进行编译,方便之后的匹配。

python的正则表达式元字符,以及其他规则这里暂不做详述。可以参照这篇文章python 正则表达式 概述及常用字符

需要说明的是:这里.*?不在是原有的含义。相当于一个组合用法。搜索 python的贪婪模式与懒惰模式

懒惰模式简而言之,就是在获取到第一个能够匹配的内容就返回,不会检索全部。

这里的括号,表示在re.compile中的内容,会匹配到的内容,返回时只返回括号中的内容.即是我们需要的贴吧楼中的内容.

其实在之前,博主还在想,如何在匹配到内容之后,将<div id…>等去掉,但是之后明白只需要加一个括号就可以了.确实是很方便.

items = p.findall(html)

p.findall方法,返回的是一个列表,所以将得到的所有楼层中的文本,得出,放在一个列表items中.

最后依次打印出来就可以.

但是打印的时候,存在一个问题,那就是尽管处理了文本前后的标签,但是在文本中也可能存在,段落标签<b></b><img>等等.标签.我们需要将其消除.

removeImg = re.compile('<img.*?>')
removeAddr = re.compile('<a.*?>|</a>')
removeLine = re.compile('<tr>|<div>|</div>|<p>|<td>|')
removeClass = re.compile('<div class=".*?>')

这里将,需要消除的标签用正则表达式编译.

然后将标签替换为空.

items = re.sub(removeImg, '', items)
items = re.sub(removeAddr, '', items)
items = re.sub(removeLine, '', items)
items = re.sub(removeClass, '', items)

这里用到了sub函数.

re.sub函数举例:

s = "hello 123 world 456"
replace_s = re.sub('\d+', '222', s)
print(replace_s)

会打印出:’hello 222 world 222′

参考文章:详解Python中re.sub

最后在循环中,打印出items中的被替换过后的内容就可以了.

floor = 1

for item in items:
    item = replace(item)
    with open('tieba.txt', 'a') as f:
        floorline = '这个是第' + str(floor) + '楼---------------------\n'
        enterline = '\n\n'
        print(floorline)
        print(item)
        f.write(floorline)
        f.write(item)
        f.write(enterline)
    floor = floor + 1

这里with open(‘tieba.txt’, ‘a’)用到的是a,而不是w,是因为这里要间断地进行打印,所以是不断向后面追加内容.

之后,在进行代码美化就可以了.

完整代码:

#2015年8月16日 20:46:30
#百度贴吧爬取 正则表达式 网页html爬取
#by imekaku.com

import urllib.request
import re

#url = input('请输入您需要爬取的百度贴吧的地址:')
url = 'http://tieba.baidu.com/p/3975584183'

def openUrl(url):
    #模拟浏览器浏览行为,添加headers

    #获取网页
    response = urllib.request.urlopen(url)
    html = response.read().decode('utf-8')
    return html

#替换不必要的字符
def replace(items):
    removeImg = re.compile('<img.*?>')
    removeAddr = re.compile('<a.*?>|</a>')
    removeLine = re.compile('<tr>|<div>|</div>|<p>|<td>|')
    removeClass = re.compile('<div class=".*?>')
    items = re.sub(removeImg, '', items)
    items = re.sub(removeAddr, '', items)
    items = re.sub(removeLine, '', items)
    items = re.sub(removeClass, '', items)
    items = items.strip()
    return items

html = openUrl(url)
p = re.compile('
<div id="post_content_.*?>(.*?)</div>')
items = p.findall(html)
floor = 1

for item in items:
    item = replace(item)
    with open('tieba.txt', 'a') as f:
        floorline = '这个是第' + str(floor) + '楼---------------------\n'
        enterline = '\n\n'
        print(floorline)
        print(item)
        f.write(floorline)
        f.write(item)
        f.write(enterline)
    floor = floor + 1

 

 

 

 

3 个评论 在 “python爬取贴吧文字

  1. hOur company provides a wide variety of pharmacy. Look at our health contributing website in case you want to feel healthier. Our site offers a wide variety of non prescription drugs. Visit our health website in case you want to feel better with a help health products. Our company offers a wide variety of non prescription products. Look at our health portal in case you want to to improve your health with a help of general health products. Our company provides a wide variety of non prescription products. Take a look at our health site in case you want to to improve your health with a help health products. Our company offers supreme quality general health products. Take a look at our health contributing website in case you want to strengthen your health. Our company provides a wide variety of non prescription products. Look at our health site in case you want to to improve your health with a help of health products.
    Our company provides a wide variety of non prescription drugs. Visit our health website in case you want to look better with a help of generic supplements. Our company provides a wide variety of non prescription drugs. Visit our health website in case you want to feel better with a help of generic supplements. Our company offers a wide variety of non prescription drugs. Take a look at our health portal in case you want to feel better with a help of general health products. Our company provides a wide variety of non prescription drugs. Take a look at our health site in case you want to to feel healthier with a help of general health products. Our company offers a wide variety of non prescription drugs. Look at our health website in case you want to look healthier with a help of generic supplements.

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注

开始在上面输入您的搜索词,然后按回车进行搜索。按ESC取消。

返回顶部