爬虫实战技巧

发布时间：2024-12-19 11:33

Python爬虫实战：requests库应用 #生活知识# #编程教程#

连续做了一周的爬虫了，但是都是简单的那种，简单的总结下，后面有时间在写个工具。

1.网页获取，由于网站情况不一样，有的网站有相关的反爬虫技术，要对网站情况进行分析，才能获得想要的网页信息。

2.续爬，爬虫不一的能一次就吧整个网站就能爬下来，要设在分析，增加续爬功能是有必要的。

3.爬取过程中遇到的问题。

1）我是使用的bs4进行xml解析的，由于每个节点属性不完全相同，当统一使用一个方法访问节点属性的时候一定要加try，防止程序意外中断。

2）在使用python语言的时候，为了安全，要注意函数的返回值，特别是类型判断。

3）网页抓取要加try，动态数据类型也尽量加try

4.爬取过程中的关键语句

dir_name= '.'

def download_file(url, file_name):

if not os.access(dir_name, os.F_OK):

os.mkdir( dir_name, 0o777 )

file_name=file_name.replace('/','-')

tmp = os.path.join(dir_name, file_name)

if os.access(tmp, os.F_OK):

return

f = urllib.request.urlopen(url)

data = f.read()

with open(tmp, "wb+") as code:

code.write(data)

def get_html(url):

iplist=['121.193.143.','112.126.65.','122.96.59.','115.29.98.','117.131.216.','116.226.243.','101.81.22.','122.96.59.']

html = ''

try:

http_request_ip = (random.choice(iplist)+str(random.randint(3,215))+':'+str(random.randint(1024,65535)))

print(http_request_ip)

proxy_support = urllib.request.ProxyHandler({'http':http_request_ip})

opener=urllib.request.build_opener(proxy_support)

opener.addheaders=[('User-Agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36')]

urllib.request.install_opener(opener)

response = urllib.request.urlopen(url,timeout=3)

html = response.read().decode('utf-8')

except Exception as e:

logger.debug('['+str(e)+']'+url)

print ('URLError: <urlopen error timed out> All times is failed ')

return html

def get_html_method2(url):

html = ''

i = 5

while(i):

try:

response = urllib.request.urlopen(url,timeout=3)

html = response.read().decode('utf-8')

return html

except Exception as e:

i = i -1

logger.debug('['+str(e)+']'+url)

print ('URLError: <urlopen error timed out> All times is failed ' )

return ''

html_page = get_html_method2(page_url)

html_soup = bs4.BeautifulSoup(html_page, 'lxml')

item_name = html_soup.select('ul[class="List_list font14"]')

table_thead_item = table_item[0].select('thead')

matchObj = re.search( r'\/uploadfiles\/\d{6}\/\d{2}\/\d{22}\.(\w{1,5})$', a_href, re.M|re.I)

cvs_filename = dir_name+'_table.csv'

if os.access(cvs_filename,os.F_OK):

csvfile = open(cvs_filename, 'a+', newline='')

writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

else:

csvfile = open(cvs_filename, 'a+', newline='')

writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

writer.writeheader()

writer.writerow({fieldnames[0]:list_data[0])

网址：爬虫实战技巧 https://www.yuejiaxmz.com/news/view/517659

⬅️上一篇：广西自然资源职业技术学院召开党史

➡️下一篇：亮晶晶油污清洁剂厨房去重油污净

爬虫实战技巧

相关内容

随便看看

最新动态分享

热点动态分享

专题

推荐动态分享