Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
梦想橡皮擦
Python 爬虫120例
提交
1548bd3d
Python 爬虫120例
项目概览
梦想橡皮擦
/
Python 爬虫120例
通知
6424
Star
761
Fork
392
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
1
列表
看板
标记
里程碑
合并请求
0
DevOps
流水线
流水线任务
计划
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
Python 爬虫120例
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
1
Issue
1
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
DevOps
DevOps
流水线
流水线任务
计划
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
流水线任务
提交
Issue看板
提交
1548bd3d
编写于
9月 03, 2021
作者:
梦想橡皮擦
💬
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
中介网站数据
上级
56a73dac
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
96 addition
and
0 deletion
+96
-0
NO28/中介网站排名数据.py
NO28/中介网站排名数据.py
+96
-0
未找到文件。
NO28/中介网站排名数据.py
0 → 100644
浏览文件 @
1548bd3d
from
queue
import
Queue
import
time
import
threading
import
requests
from
lxml
import
etree
import
random
import
re
def
get_headers
():
uas
=
[
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
,
"Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)"
,
"Baiduspider-image+(+http://www.baidu.com/search/spider.htm)"
,
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36"
,
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
,
"Mozilla/5.0 (compatible; Googlebot-Image/1.0; +http://www.google.com/bot.html)"
,
"Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
,
"Sogou News Spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
,
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);"
,
"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
,
"Sosospider+(+http://help.soso.com/webspider.htm)"
,
"Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)"
]
ua
=
random
.
choice
(
uas
)
headers
=
{
"user-agent"
:
ua
}
return
headers
def
get_total_page
():
res
=
requests
.
get
(
'https://www.zhongjie.com/top/rank_all_1.html'
,
headers
=
get_headers
(),
timeout
=
5
)
element
=
etree
.
HTML
(
res
.
text
)
last_page
=
element
.
xpath
(
"//a[@class='weiye']/@href"
)[
0
]
pattern
=
re
.
compile
(
'(\d+)'
)
page
=
pattern
.
search
(
last_page
)
return
int
(
page
.
group
(
1
))
# 生产者
def
producer
():
while
True
:
# 取一个分类ID
url
=
urls
.
get
()
urls
.
task_done
()
if
url
is
None
:
break
res
=
requests
.
get
(
url
=
url
,
headers
=
get_headers
(),
timeout
=
5
)
text
=
res
.
text
element
=
etree
.
HTML
(
text
)
links
=
element
.
xpath
(
'//a[@class="copyright_title"]/@href'
)
for
i
in
links
:
wait_list_urls
.
put
(
"https://www.zhongjie.com"
+
i
)
# 消费者
def
consumer
():
while
True
:
url
=
wait_list_urls
.
get
()
wait_list_urls
.
task_done
()
if
url
is
None
:
break
res
=
requests
.
get
(
url
=
url
,
headers
=
get_headers
(),
timeout
=
5
)
text
=
res
.
text
element
=
etree
.
HTML
(
text
)
title
=
element
.
xpath
(
'//div[@class="info-head-l"]/h1/text()'
)
link
=
element
.
xpath
(
'//div[@class="info-head-l"]/p[1]/a/text()'
)
description
=
element
.
xpath
(
'//div[@class="info-head-l"]/p[2]/text()'
)
print
(
title
,
link
,
description
)
if
__name__
==
"__main__"
:
# 初始化一个队列
urls
=
Queue
(
maxsize
=
0
)
last_page
=
get_total_page
()
for
p
in
range
(
1
,
last_page
+
1
):
urls
.
put
(
f
"https://www.zhongjie.com/top/rank_all_
{
p
}
.html"
)
wait_list_urls
=
Queue
(
maxsize
=
0
)
# 开启2个生产者线程
for
p_in
in
range
(
1
,
3
):
p
=
threading
.
Thread
(
target
=
producer
)
p
.
start
()
# 开启2个消费者线程
for
p_in
in
range
(
1
,
2
):
p
=
threading
.
Thread
(
target
=
consumer
)
p
.
start
()
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录