Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
茵可露露
Python 爬虫120例
提交
bf1520ba
Python 爬虫120例
项目概览
茵可露露
/
Python 爬虫120例
与 Fork 源项目一致
Fork自
梦想橡皮擦 / Python 爬虫120例
通知
1
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
DevOps
流水线
流水线任务
计划
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
Python 爬虫120例
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
DevOps
DevOps
流水线
流水线任务
计划
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
流水线任务
提交
Issue看板
提交
bf1520ba
编写于
8月 31, 2021
作者:
梦想橡皮擦
💬
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
懒人畅听案例补充
上级
737404a1
变更
4
展开全部
显示空白变更内容
内联
并排
Showing
4 changed file
with
64 addition
and
72 deletion
+64
-72
NO23/懒人畅听网数据.py
NO23/懒人畅听网数据.py
+52
-0
NO23/虎牙/0.json
NO23/虎牙/0.json
+0
-0
NO23/虎牙直播数据.py
NO23/虎牙直播数据.py
+0
-72
README.md
README.md
+12
-0
未找到文件。
NO23/懒人畅听网数据.py
0 → 100644
浏览文件 @
bf1520ba
import
threading
from
threading
import
Lock
,
Thread
import
random
,
requests
from
lxml
import
etree
def
get_headers
():
uas
=
[
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
,
]
ua
=
random
.
choice
(
uas
)
headers
=
{
"user-agent"
:
ua
,
"referer"
:
"https://www.baidu.com/"
}
return
headers
def
run
(
url
,
semaphore
):
headers
=
get_headers
()
semaphore
.
acquire
()
#加锁
res
=
requests
.
get
(
url
,
headers
=
headers
,
timeout
=
5
)
if
res
:
text
=
res
.
text
element
=
etree
.
HTML
(
text
)
titles
=
element
.
xpath
(
'//a[@class="book-item-name"]/text()'
)
authors
=
element
.
xpath
(
'//a[@class="author"]/text()'
)
weakens
=
element
.
xpath
(
'//a[@class="g-user-shutdown"]/text()'
)
save
(
url
,
titles
,
authors
,
weakens
)
semaphore
.
release
()
#释放
def
save
(
url
,
titles
,
authors
,
weakens
):
data_list
=
zip
(
titles
,
authors
,
weakens
)
for
item
in
data_list
:
with
open
(
"./data.csv"
,
"a+"
,
encoding
=
"utf-8"
)
as
f
:
f
.
write
(
f
"
{
item
[
0
]
}
,
{
item
[
1
]
}
,
{
item
[
2
]
}
\n
"
)
print
(
url
,
"该URL地址数据写入完毕"
)
if
__name__
==
'__main__'
:
lock
=
Lock
()
url_format
=
'https://www.lrts.me/book/category/1/recommend/{}/20'
# 拼接URL,全局共享变量
urls
=
[
url_format
.
format
(
i
)
for
i
in
range
(
1
,
1372
)]
l
=
[]
semaphore
=
threading
.
BoundedSemaphore
(
5
)
# 最多允许5个线程同时运行
for
url
in
urls
:
t
=
threading
.
Thread
(
target
=
run
,
args
=
(
url
,
semaphore
))
t
.
start
()
while
threading
.
active_count
()
!=
1
:
pass
else
:
print
(
'所有线程运行完毕'
)
NO23/虎牙/0.json
已删除
100644 → 0
浏览文件 @
737404a1
此差异已折叠。
点击以展开。
NO23/虎牙直播数据.py
已删除
100644 → 0
浏览文件 @
737404a1
import
threading
import
requests
import
random
class
Common
:
def
__init__
(
self
):
pass
def
get_headers
(
self
):
uas
=
[
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
,
"Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)"
,
"Baiduspider-image+(+http://www.baidu.com/search/spider.htm)"
,
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36"
,
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
,
"Mozilla/5.0 (compatible; Googlebot-Image/1.0; +http://www.google.com/bot.html)"
,
"Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
,
"Sogou News Spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
,
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);"
,
"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
,
"Sosospider+(+http://help.soso.com/webspider.htm)"
,
"Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)"
]
ua
=
random
.
choice
(
uas
)
headers
=
{
"user-agent"
:
ua
,
"referer"
:
"https://www.baidu.com"
}
return
headers
def
run
(
index
,
url
,
semaphore
,
headers
):
semaphore
.
acquire
()
# 加锁
res
=
requests
.
get
(
url
,
headers
=
headers
,
timeout
=
5
)
res
.
encoding
=
'utf-8'
text
=
res
.
text
text
=
text
.
replace
(
'getLiveListJsonpCallback('
,
''
)
text
=
text
[:
-
1
]
# print(text)
# json_data = json.loads(text)
# print(json_data)
save
(
index
,
text
)
semaphore
.
release
()
# 释放
def
save
(
index
,
text
):
with
open
(
f
"./虎牙/
{
index
}
.json"
,
"w"
,
encoding
=
"utf-8"
)
as
f
:
f
.
write
(
f
"
{
text
}
"
)
print
(
"该URL地址数据写入完毕"
)
if
__name__
==
'__main__'
:
# 获取总页码
first_url
=
'https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&tagAll=0&callback=&page=1'
c
=
Common
()
res
=
requests
.
get
(
url
=
first_url
,
headers
=
c
.
get_headers
())
data
=
res
.
json
()
if
data
[
'status'
]
==
200
:
total_page
=
data
[
'data'
][
'totalPage'
]
url_format
=
'https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&tagAll=0&callback=getLiveListJsonpCallback&page={}'
# 拼接URL,全局共享变量
urls
=
[
url_format
.
format
(
i
)
for
i
in
range
(
1
,
total_page
)]
# 最多允许5个线程同时运行
semaphore
=
threading
.
BoundedSemaphore
(
5
)
for
i
,
url
in
enumerate
(
urls
):
t
=
threading
.
Thread
(
target
=
run
,
args
=
(
i
,
url
,
semaphore
,
c
.
get_headers
()))
t
.
start
()
while
threading
.
active_count
()
!=
1
:
pass
else
:
print
(
'所有线程运行完毕'
)
README.md
浏览文件 @
bf1520ba
...
...
@@ -8,16 +8,20 @@ Python爬虫120例正式开始
## Python 爬虫 120 例,已完成文章清单
### requests 库 + re 模块
1.
[
10 行代码集 2000 张美女图,Python 爬虫 120 例,再上征途
](
https://dream.blog.csdn.net/article/details/117024328
)
2.
[
通过 Python 爬虫,发现 60%女装大佬游走在 cosplay 领域
](
https://dream.blog.csdn.net/article/details/117221667
)
3.
[
Python 千猫图,简单技术满足你的收集控
](
https://dream.blog.csdn.net/article/details/117458947
)
4.
[
熊孩子说“你没看过奥特曼”,赶紧用 Python 学习一下,没想到
](
https://dream.blog.csdn.net/article/details/117458985
)
5.
[
技术圈的【多肉小达人】,一篇文章你就能做到
](
https://blog.csdn.net/hihell/article/details/117661488
)
6.
[
我用 Python 连夜离线了 100G 图片,只为了防止网站被消失
](
https://dream.blog.csdn.net/article/details/117918309
)
### requests 库 + re 模块 + threading 模块
7.
[
对 Python 爬虫编写者充满诱惑的网站,《可爱图片网》,瞧人这网站名字起的
](
https://dream.blog.csdn.net/article/details/118035208
)
8.
[
5000张高清壁纸大图(手机用),用Python在法律的边缘又试探了一把
](
https://dream.blog.csdn.net/article/details/118145504
)
9.
[
10994部漫画信息,用Python实施大采集,因为反爬差一点就翻车了
](
https://blog.csdn.net/hihell/article/details/118222271
)
10.
[
爬动漫“上瘾”之后,放弃午休,迫不及待的用Python薅了腾讯动漫的数据,啧啧啧
](
https://blog.csdn.net/hihell/article/details/118340372
)
### requests 库 + lxml 库
11.
[
他说:“只是单纯的想用Python收集一些素颜照,做机器学习使用”,“我信你个鬼!”
](
https://blog.csdn.net/hihell/article/details/118385640
)
12.
[
1小时赚100元,某群X友,周末采集了20000+漫展历史数据,毫无技术难度
](
https://blog.csdn.net/hihell/article/details/118485941
)
13.
[
程序员(媛)不懂汉服?岂能让别人小看,咱先靠肉眼大数据识别万张穿搭照
](
https://dream.blog.csdn.net/article/details/118541741
)
...
...
@@ -25,8 +29,16 @@ Python爬虫120例正式开始
15.
[
整个大活,采集8个代理IP站点,为Python代理池铺路,爬虫120例之第15例
](
https://dream.blog.csdn.net/article/details/119137580
)
16.
[
极复杂编码,下载《原神》角色高清图、中日无损配音,爬虫 16 / 120 例
](
https://dream.blog.csdn.net/article/details/111028288
)
17.
[
爬虫120例之第17例,用Python面向对象的思路,采集各种精彩句子
](
https://dream.blog.csdn.net/article/details/119632820
)
### 技术阶段整理
18.
[
requests库与 lxml 库常用操作整理+总结,爬虫120例阶段整理篇
](
https://dream.blog.csdn.net/article/details/119633672
)
19.
[
正则表达式 与 XPath 语法领域细解,初学阶段的你,该怎么学?
](
https://dream.blog.csdn.net/article/details/119633700
)
### requests 库 + lxml 库 + cssselect 库
20.
[
Python爬虫120例之第20例,1637、一路商机网全站加盟数据采集
](
https://dream.blog.csdn.net/article/details/119850647
)
21.
[
孔夫子旧书网数据采集,举一反三学爬虫,Python爬虫120例第21例
](
https://dream.blog.csdn.net/article/details/119878744
)
### 多线程爬虫之 threading 模块
22.
[
谁有粉?就爬谁!他粉多,就爬他!Python 多线程采集 260000+ 粉丝数据
](
https://dream.blog.csdn.net/article/details/119931364
)
23.
[
懒人畅听网,有声小说类目数据采集,多线程速采案例,Python爬虫120例之23例
](
https://dream.blog.csdn.net/article/details/119914203
)
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录