Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
茵可露露
Python 爬虫120例
提交
8690fb95
Python 爬虫120例
项目概览
茵可露露
/
Python 爬虫120例
与 Fork 源项目一致
Fork自
梦想橡皮擦 / Python 爬虫120例
通知
1
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
DevOps
流水线
流水线任务
计划
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
Python 爬虫120例
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
DevOps
DevOps
流水线
流水线任务
计划
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
流水线任务
提交
Issue看板
提交
8690fb95
编写于
10月 21, 2021
作者:
梦想橡皮擦
💬
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
41案例,协程最后一篇
上级
1ad7750a
变更
5
隐藏空白更改
内联
并排
Showing
5 changed file
with
153 addition
and
4 deletion
+153
-4
NO41/Semaphore 在协程中的应用.py
NO41/Semaphore 在协程中的应用.py
+38
-0
NO41/Semaphore 控制信号量.py
NO41/Semaphore 控制信号量.py
+42
-0
NO41/TCPConnector 限制连接数.py
NO41/TCPConnector 限制连接数.py
+31
-0
NO41/普通多线程.py
NO41/普通多线程.py
+38
-0
README.md
README.md
+4
-4
未找到文件。
NO41/Semaphore 在协程中的应用.py
0 → 100644
浏览文件 @
8690fb95
import
time
import
asyncio
import
aiohttp
from
bs4
import
BeautifulSoup
async
def
get_title
(
semaphore
,
url
):
async
with
semaphore
:
print
(
"正在采集:"
,
url
)
async
with
aiohttp
.
request
(
'GET'
,
url
)
as
res
:
html
=
await
res
.
text
()
soup
=
BeautifulSoup
(
html
,
'html.parser'
)
title_tags
=
soup
.
find_all
(
attrs
=
{
'class'
:
'item-title'
})
event_names
=
[
item
.
a
.
text
for
item
in
title_tags
]
print
(
event_names
)
async
def
main
():
semaphore
=
asyncio
.
Semaphore
(
10
)
# 控制每次最多执行 10 个线程
tasks
=
[
asyncio
.
ensure_future
(
get_title
(
semaphore
,
"http://www.lishiju.net/hotevents/p{}"
.
format
(
i
)))
for
i
in
range
(
111
)]
dones
,
pendings
=
await
asyncio
.
wait
(
tasks
)
# for task in dones:
# print(len(task.result()))
if
__name__
==
'__main__'
:
start_time
=
time
.
perf_counter
()
asyncio
.
run
(
main
())
print
(
"代码运行时间为:"
,
time
.
perf_counter
()
-
start_time
)
# # 创建事件循环。
# event_loop = asyncio.get_event_loop()
# # 启动事件循环并等待协程main()结束。
# event_loop.run_until_complete(main())
# # 代码运行时间为: 2.227831242
NO41/Semaphore 控制信号量.py
0 → 100644
浏览文件 @
8690fb95
import
threading
import
time
import
requests
from
bs4
import
BeautifulSoup
class
MyThread
(
threading
.
Thread
):
def
__init__
(
self
,
url
):
threading
.
Thread
.
__init__
(
self
)
self
.
__url
=
url
def
run
(
self
):
if
semaphore
.
acquire
():
# 计数器 -1
print
(
"正在采集:"
,
self
.
__url
)
res
=
requests
.
get
(
url
=
self
.
__url
)
soup
=
BeautifulSoup
(
res
.
text
,
'html.parser'
)
title_tags
=
soup
.
find_all
(
attrs
=
{
'class'
:
'item-title'
})
event_names
=
[
item
.
a
.
text
for
item
in
title_tags
]
print
(
event_names
)
print
(
""
)
semaphore
.
release
()
# 计数器 +1
if
__name__
==
"__main__"
:
semaphore
=
threading
.
Semaphore
(
5
)
# 控制每次最多执行 5 个线程
start_time
=
time
.
perf_counter
()
threads
=
[]
for
i
in
range
(
111
):
# 创建了110个线程。
threads
.
append
(
MyThread
(
url
=
"http://www.lishiju.net/hotevents/p{}"
.
format
(
i
)))
for
t
in
threads
:
t
.
start
()
# 启动了110个线程。
for
t
in
threads
:
t
.
join
()
# 等待线程结束
print
(
"累计耗时:"
,
time
.
perf_counter
()
-
start_time
)
# 累计耗时: 2.8005530640000003
NO41/TCPConnector 限制连接数.py
0 → 100644
浏览文件 @
8690fb95
import
time
import
asyncio
import
aiohttp
from
bs4
import
BeautifulSoup
async
def
get_title
(
session
,
url
):
async
with
session
.
get
(
url
)
as
res
:
print
(
"正在采集:"
,
url
)
html
=
await
res
.
text
()
soup
=
BeautifulSoup
(
html
,
'html.parser'
)
title_tags
=
soup
.
find_all
(
attrs
=
{
'class'
:
'item-title'
})
event_names
=
[
item
.
a
.
text
for
item
in
title_tags
]
print
(
event_names
)
async
def
main
():
connector
=
aiohttp
.
TCPConnector
(
limit
=
1
)
# 限制同时连接数
async
with
aiohttp
.
ClientSession
(
connector
=
connector
)
as
session
:
tasks
=
[
asyncio
.
ensure_future
(
get_title
(
session
,
"http://www.lishiju.net/hotevents/p{}"
.
format
(
i
)))
for
i
in
range
(
111
)]
await
asyncio
.
wait
(
tasks
)
if
__name__
==
'__main__'
:
start_time
=
time
.
perf_counter
()
asyncio
.
run
(
main
())
print
(
"代码运行时间为:"
,
time
.
perf_counter
()
-
start_time
)
NO41/普通多线程.py
0 → 100644
浏览文件 @
8690fb95
import
threading
import
time
import
requests
from
bs4
import
BeautifulSoup
class
MyThread
(
threading
.
Thread
):
def
__init__
(
self
,
url
):
threading
.
Thread
.
__init__
(
self
)
self
.
__url
=
url
def
run
(
self
):
print
(
"正在采集:"
,
self
.
__url
)
res
=
requests
.
get
(
url
=
self
.
__url
)
soup
=
BeautifulSoup
(
res
.
text
,
'html.parser'
)
title_tags
=
soup
.
find_all
(
attrs
=
{
'class'
:
'item-title'
})
event_names
=
[
item
.
a
.
text
for
item
in
title_tags
]
print
(
event_names
)
print
(
""
)
if
__name__
==
"__main__"
:
start_time
=
time
.
perf_counter
()
threads
=
[]
for
i
in
range
(
111
):
# 创建了110个线程。
threads
.
append
(
MyThread
(
url
=
"http://www.lishiju.net/hotevents/p{}"
.
format
(
i
)))
for
t
in
threads
:
t
.
start
()
# 启动了110个线程。
for
t
in
threads
:
t
.
join
()
# 等待线程结束
print
(
"累计耗时:"
,
time
.
perf_counter
()
-
start_time
)
# 累计耗时: 1.537718624
README.md
浏览文件 @
8690fb95
...
...
@@ -79,11 +79,11 @@
37.
[
python 爬虫爱好者必须掌握的知识点“ 协程爬虫”,看一下如何用 gevent 采集女生用头像
](
https://dream.blog.csdn.net/article/details/120421824
)
38.
[
python协程总学不会?不可能的,边学协程边采集Coser图吧!
](
https://dream.blog.csdn.net/article/details/120445004
)
39.
中少绘本 MP4 视频采集,asyncio 协程第3篇
40.
Bensound 站 MP3 采集,asyncio + aiohttp 协程第4篇
41.
历史剧网采集,协程并发控制
39.
[
你是不是已经成为【爸爸程序员】了?用Python给自己的宝下载200+绘本动画吧,协程第3遍学习
](
https://dream.blog.csdn.net/article/details/120463479
)
40.
[
python 协程第4课,目标数据源为 mp3 ,目标站点为 bensound.com
](
https://dream.blog.csdn.net/article/details/120507981
)
41.
[
python 协程补个知识点,控制并发数,python 数据采集必会技能
](
https://dream.blog.csdn.net/article/details/120879805
)
### 📘 scrapy 库学习
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录