Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
梦想橡皮擦
Python 爬虫120例
提交
fb2f7913
Python 爬虫120例
项目概览
梦想橡皮擦
/
Python 爬虫120例
通知
6432
Star
763
Fork
392
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
1
列表
看板
标记
里程碑
合并请求
0
DevOps
流水线
流水线任务
计划
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
Python 爬虫120例
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
1
Issue
1
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
DevOps
DevOps
流水线
流水线任务
计划
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
流水线任务
提交
Issue看板
提交
fb2f7913
编写于
6月 29, 2021
作者:
H
hjCodeCloud
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
腾讯动漫爬虫代码
上级
458eb93d
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
75 addition
and
0 deletion
+75
-0
NO10/腾讯动漫爬虫代码.py
NO10/腾讯动漫爬虫代码.py
+75
-0
未找到文件。
NO10/腾讯动漫爬虫代码.py
0 → 100644
浏览文件 @
fb2f7913
import
requests
from
fake_useragent
import
UserAgent
import
re
import
threading
def
replace_mark
(
my_str
):
return
my_str
.
replace
(
","
,
","
).
replace
(
'"'
,
"“"
)
def
format_html
(
html
):
li_pattern
=
re
.
compile
(
'<li\sclass="ret-search-item clearfix">[\s\S]+?</li>'
)
title_url_pattern
=
re
.
compile
(
'<a href="(.*?)" target="_blank" title=".*?">(.*?)</a>'
)
sign_pattern
=
re
.
compile
(
'<i class="ui-icon-sign">签约</i>'
)
exclusive_pattern
=
re
.
compile
(
'<i class="ui-icon-exclusive">独家</i>'
)
author_pattern
=
re
.
compile
(
'<p class="ret-works-author" title=".*?">(.*?)</p>'
)
tags_pattern
=
re
.
compile
(
'<span href=".*?" target="_blank">(.*?)</span>'
)
score_pattern
=
re
.
compile
(
'<span>人气:<em>(.*?)</em></span>'
)
items
=
li_pattern
.
findall
(
html
)
for
item
in
items
:
title_url
=
title_url_pattern
.
search
(
item
)
title
=
title_url
.
group
(
2
)
url
=
title_url
.
group
(
1
)
sign
=
0
exclusive
=
0
if
sign_pattern
.
search
(
item
)
is
not
None
:
sign
=
1
if
exclusive_pattern
.
search
(
item
)
is
not
None
:
exclusive
=
1
author
=
author_pattern
.
search
(
item
).
group
(
1
)
tags
=
tags_pattern
.
findall
(
item
)
score
=
score_pattern
.
search
(
item
).
group
(
1
)
lock
.
acquire
()
with
open
(
"./qq.csv"
,
"a+"
,
encoding
=
"utf-8"
)
as
f
:
f
.
write
(
f
'
{
replace_mark
(
title
)
}
,
{
url
}
,
{
sign
}
,
{
exclusive
}
,
{
replace_mark
(
author
)
}
,
{
"#"
.
join
(
tags
)
}
,"
{
replace_mark
(
score
)
}
"
\n
'
)
lock
.
release
()
def
run
(
index
):
ua
=
UserAgent
(
use_cache_server
=
False
)
response
=
requests
.
get
(
f
"https://ac.qq.com/Comic/index/page/
{
index
}
"
,
headers
=
{
'User-Agent'
:
ua
.
random
})
html
=
response
.
text
format_html
(
html
)
semaphore
.
release
()
lock
=
threading
.
Lock
()
if
__name__
==
"__main__"
:
num
=
0
semaphore
=
threading
.
BoundedSemaphore
(
5
)
lst_record_threads
=
[]
for
index
in
range
(
1
,
462
):
print
(
f
"正在抓取
{
index
}
"
)
semaphore
.
acquire
()
t
=
threading
.
Thread
(
target
=
run
,
args
=
(
index
,
))
t
.
start
()
lst_record_threads
.
append
(
t
)
for
rt
in
lst_record_threads
:
rt
.
join
()
print
(
"数据爬取完毕"
)
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录