0%

美赛回顾(一) · 爬虫

SPIDER!

前不久才刚刚初步学习了爬虫相关知识,跟随教程简单爬取了医院和招聘网站的信息,没想到在美赛实现了如此规模(170万字)的爬虫!!Spider haunts at night!!

题目

Problem F: Reducing Illegal Wildlife Trade

保护野生动物!

Illegal wildlife trade negatively impacts our environment and threatens global biodiversity. It is
estimated to involve up to 26.5 billion US dollars per year and is considered to be the fourth
largest of all global illegal trades.
[1] You are to develop a data-driven 5-year project designed to make a notable reduction in illegal wildlife trade. Your goal is to convince a client to carry out your project. To do this, you must select both a client and an appropriate project for that client.
Your work should explore the following sub-questions:
● Who is your client? What can that client realistically do? (In other words, your client should have the powers, resources, and interest needed to enact the project you propose.)
● Explain why the project you developed is suitable for this client. What research, from published literature and from your own analyses, supports the selection of your proposed project? Using a data-driven analysis, how will you convince your client that this is a project they should undertake?
● What additional powers and resources will your client need to carry out the project? (Remember to use assumptions, but also ground your work in reality as much as you are able.)
● If the project is carried out what will happen? In other words, what will the measurable impact on illegal wildlife trade be? What analysis did you do to determine this?
● How likely is the project to reach the expected goal? Also, based on a contextualized sensitivity analysis, are there conditions or events that may disproportionately aid or harm the project’s ability to reach its goal?
While you could limit your approach to illegal wildlife trade, you may also consider illegal wildlife trade as part of a larger complex system. Specifically, you could consider how other global efforts in other domains, e.g., efforts to curtail other forms of trafficking or efforts to reduce climate change coupled with efforts to curtail illegal wildlife trade, may be part of a complex system. This may create synergistic opportunities for unexpected actors in this domain.
If you choose to leverage a complexity framework in your solution, be sure to justify your choice by discussing the benefits and drawbacks of this modeling decision.Additionally, your team must submit a 1-page memo with key points for your client, highlighting your 5-year project proposal and why the project is right for them as a client (e.g., access to resources, part of their mandate, aligns with their mission statement, etc.)

目标

经过一番国际层面的研究,我们把最终的客户锁定在中国。于是,找到中国非法贸易的数据并分析成为了目标,可以从非法贸易的审判书等资料获取。

Spider!

导入所需的库

1
2
3
4
5
6
7
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import re
from urllib.parse import urljoin
import time
import random

定位

1
2
3
4
5
driver = webdriver.Edge()
base_url = 'https://www.cxxxxxxxt.org'
# 找到网站
search_url = '/article/essearch/keyword/%E6%BF%92%E5%8D%B1%E9%87%8E%E7%94%9F%E5%8A%A8%E7%89%A9/ticket/tr035hZedznczbY7Ag34pg1wQntELscZW2dBZKhpuNgKwxD1WX7e4Kig_5yZvIrabqf3ozOiJGrYsZT8-ZeJtjykVi42K8IMCfTc5eJKE0UgmzoUqX-zb4oO4A%2A%2A/appid/2056206409/randstr/%40WT4/content_time_publish_begin/2016-01-01/content_time_publish_end/2024-02-03/page/'
# 定位————根据页数 为后续翻页形成模板
1
2
3
4
5
6
7
8
page_url = f'{base_url}{search_url}{1}.shtml'
# 定位第一页

driver.get(page_url)
html_content = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html_content, 'lxml')
pagination_div = soup.find('div', class_='paginationControl')
# 发送HTTP请求并获取页面内容
1
2
3
4
5
6
7
8
9
10
last_page_link = soup.find('a', text='尾页')
# 定位尾页
if last_page_link:
last_page_url = last_page_link['href']
# 提取尾页的页码信息
max_page = int(last_page_url.split('/')[-1].split('.')[0])

print(f"总共有 {max_page} 页")
else:
print("未找到页码信息")

到这里,我们可以实现对搜索内容相关的每一页,每一篇文章进行遍历

爬取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
from docx import Document
doc = Document()
# 打开我们要存入的word文档

std = 0
#为遍历文章数计数

for page_number in range(1, max_page+1): # 从首页遍历到尾页
page_url = f'{base_url}{search_url}{page_number}.shtml'
driver.get(page_url)
#定位某一页

html_content = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html_content, 'lxml')

# 查找<dt>标签下的所有链接
dt_tags = soup.find_all('dt')
# 好好观察F12页面各个参数

# 遍历每个链接
for dt_tag in dt_tags:
std = std + 1
a_tag = dt_tag.find('a')
if a_tag:
relative_link = a_tag.get('href')

# 将相对链接转换为绝对链接,找到文章的链接
absolute_link = urljoin(base_url, relative_link)

# 访问链接并获取页面内容
driver.get(absolute_link)
html_content1 = driver.page_source.encode('utf-8')
soup1 = BeautifulSoup(html_content1, 'lxml')

# 提取标题
title = soup1.find('div', class_='detail_bigtitle')
title_text = title.get_text(strip=True) if title else "No title found"

try:
time_span = soup1.find('div', class_='detail_thr').find('span', class_='time')
# 有的文章已经过期,提取time会失败,采用try来规避
except:
print("no time now")
pass

# 提取时间
timeo = time_span.get_text(strip=True) # 与time库重,不能取time变量名

# 提取主体文字
detail_txt_div = soup1.find('div', class_='detail_txt')
if detail_txt_div: # if也是一种规避方法
paragraphs = detail_txt_div.find_all('p') # 每一段用'p'隔开
link_text = '\n'.join([p.get_text(strip=True) for p in paragraphs])

# 写入打开的word文档
doc.add_paragraph(f"Title: {title_text}")
doc.add_paragraph(f"Time: {timeo}")
doc.add_paragraph(f"Link: {absolute_link}")
doc.add_paragraph(f"Text:\n{link_text}")
doc.add_paragraph('-' * 30)
random_sleep_time = random.uniform(1, 2)
# 执行随机停顿
time.sleep(random_sleep_time)
print(f'已经爬了{std}篇文章,{page_number}页,还剩{max_page-page_number}页')

# 保存Word文档
doc.save('output.docx')

# 关闭浏览器
driver.quit()

完工!

170万字 1400+篇文章!!!

------------- THE END -------------