博主用的是python 3.4,如果有用python 2.x的朋友,可以参考python 3和python 2的不同
其实爬取文字和爬取图片是差不多的,都是在获取到的html中找寻相应的所需文本.
但是爬取文字时候,可能会在html中遇见很多的标签.需要做的就是去除这些标签.
首先,导入:
import urllib.request import re
这里导入re,是导入正则表达式.
很多语言都有正则表达式,这是一个非常有用的东西.
#url = input('请输入您需要爬取的百度贴吧的地址:') url = 'http://tieba.baidu.com/p/3975584183'
输入需要爬取的贴吧的地址.其他网站的文字爬取类似.
在这里,我们添加一个模仿浏览器的头,代理IP。当然也可以不添加。
response = urllib.request.urlopen(url) html = response.read().decode('utf-8') return html
获取到所需的网页。
打开审查元素,可以看到。在每一楼的回复前都有一个标志
<div id="post_content_
所以利用正则表达式:
p = re.compile('<div id="post_content_.*?>(.*?)</div>')
这里讲所匹配的正则表达式进行编译,方便之后的匹配。
python的正则表达式元字符,以及其他规则这里暂不做详述。可以参照这篇文章python 正则表达式 概述及常用字符
需要说明的是:这里.*?不在是原有的含义。相当于一个组合用法。搜索 python的贪婪模式与懒惰模式
懒惰模式简而言之,就是在获取到第一个能够匹配的内容就返回,不会检索全部。
这里的括号,表示在re.compile中的内容,会匹配到的内容,返回时只返回括号中的内容.即是我们需要的贴吧楼中的内容.
其实在之前,博主还在想,如何在匹配到内容之后,将<div id…>等去掉,但是之后明白只需要加一个括号就可以了.确实是很方便.
items = p.findall(html)
p.findall方法,返回的是一个列表,所以将得到的所有楼层中的文本,得出,放在一个列表items中.
最后依次打印出来就可以.
但是打印的时候,存在一个问题,那就是尽管处理了文本前后的标签,但是在文本中也可能存在,段落标签<b></b><img>等等.标签.我们需要将其消除.
removeImg = re.compile('<img.*?>') removeAddr = re.compile('<a.*?>|</a>') removeLine = re.compile('<tr>|<div>|</div>|<p>|<td>|') removeClass = re.compile('<div class=".*?>')
这里将,需要消除的标签用正则表达式编译.
然后将标签替换为空.
items = re.sub(removeImg, '', items) items = re.sub(removeAddr, '', items) items = re.sub(removeLine, '', items) items = re.sub(removeClass, '', items)
这里用到了sub函数.
re.sub函数举例:
s = "hello 123 world 456" replace_s = re.sub('\d+', '222', s) print(replace_s)
会打印出:’hello 222 world 222′
参考文章:详解Python中re.sub
最后在循环中,打印出items中的被替换过后的内容就可以了.
floor = 1 for item in items: item = replace(item) with open('tieba.txt', 'a') as f: floorline = '这个是第' + str(floor) + '楼---------------------\n' enterline = '\n\n' print(floorline) print(item) f.write(floorline) f.write(item) f.write(enterline) floor = floor + 1
这里with open(‘tieba.txt’, ‘a’)用到的是a,而不是w,是因为这里要间断地进行打印,所以是不断向后面追加内容.
之后,在进行代码美化就可以了.
完整代码:
#2015年8月16日 20:46:30 #百度贴吧爬取 正则表达式 网页html爬取 #by imekaku.com import urllib.request import re #url = input('请输入您需要爬取的百度贴吧的地址:') url = 'http://tieba.baidu.com/p/3975584183' def openUrl(url): #模拟浏览器浏览行为,添加headers #获取网页 response = urllib.request.urlopen(url) html = response.read().decode('utf-8') return html #替换不必要的字符 def replace(items): removeImg = re.compile('<img.*?>') removeAddr = re.compile('<a.*?>|</a>') removeLine = re.compile('<tr>|<div>|</div>|<p>|<td>|') removeClass = re.compile('<div class=".*?>') items = re.sub(removeImg, '', items) items = re.sub(removeAddr, '', items) items = re.sub(removeLine, '', items) items = re.sub(removeClass, '', items) items = items.strip() return items html = openUrl(url) p = re.compile(' <div id="post_content_.*?>(.*?)</div>') items = p.findall(html) floor = 1 for item in items: item = replace(item) with open('tieba.txt', 'a') as f: floorline = '这个是第' + str(floor) + '楼---------------------\n' enterline = '\n\n' print(floorline) print(item) f.write(floorline) f.write(item) f.write(enterline) floor = floor + 1
感谢楼主! :oops:
NameError: name ‘item’ is not defined
hOur company provides a wide variety of pharmacy. Look at our health contributing website in case you want to feel healthier. Our site offers a wide variety of non prescription drugs. Visit our health website in case you want to feel better with a help health products. Our company offers a wide variety of non prescription products. Look at our health portal in case you want to to improve your health with a help of general health products. Our company provides a wide variety of non prescription products. Take a look at our health site in case you want to to improve your health with a help health products. Our company offers supreme quality general health products. Take a look at our health contributing website in case you want to strengthen your health. Our company provides a wide variety of non prescription products. Look at our health site in case you want to to improve your health with a help of health products.
Our company provides a wide variety of non prescription drugs. Visit our health website in case you want to look better with a help of generic supplements. Our company provides a wide variety of non prescription drugs. Visit our health website in case you want to feel better with a help of generic supplements. Our company offers a wide variety of non prescription drugs. Take a look at our health portal in case you want to feel better with a help of general health products. Our company provides a wide variety of non prescription drugs. Take a look at our health site in case you want to to feel healthier with a help of general health products. Our company offers a wide variety of non prescription drugs. Look at our health website in case you want to look healthier with a help of generic supplements.