Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
茵可露露
Python 爬虫120例
提交
348de4b0
Python 爬虫120例
项目概览
茵可露露
/
Python 爬虫120例
与 Fork 源项目一致
Fork自
梦想橡皮擦 / Python 爬虫120例
通知
1
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
DevOps
流水线
流水线任务
计划
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
Python 爬虫120例
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
DevOps
DevOps
流水线
流水线任务
计划
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
流水线任务
提交
Issue看板
提交
348de4b0
编写于
8月 27, 2021
作者:
梦想橡皮擦
💬
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
博客粉丝采集
上级
b0f61b78
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
91 addition
and
0 deletion
+91
-0
NO22/CSDN博客粉丝数据采集.py
NO22/CSDN博客粉丝数据采集.py
+91
-0
未找到文件。
NO22/CSDN博客粉丝数据采集.py
0 → 100644
浏览文件 @
348de4b0
import
threading
from
threading
import
Lock
,
Thread
import
time
import
os
import
requests
import
random
class
MyThread
(
threading
.
Thread
):
def
__init__
(
self
,
name
):
super
(
MyThread
,
self
).
__init__
()
self
.
name
=
name
def
run
(
self
):
global
urls
lock
.
acquire
()
one_url
=
urls
.
pop
()
print
(
"正在爬取:"
,
one_url
)
lock
.
release
()
print
(
"任意线程等待随机时间"
)
time
.
sleep
(
random
.
randint
(
1
,
3
))
res
=
requests
.
get
(
one_url
,
headers
=
self
.
get_headers
(),
timeout
=
5
)
if
res
.
json
()[
"code"
]
!=
400
:
data
=
res
.
json
()[
"data"
][
"list"
]
for
user
in
data
:
name
=
user
[
'username'
]
nickname
=
self
.
remove_character
(
user
[
'nickname'
])
userAvatar
=
user
[
'userAvatar'
]
blogUrl
=
user
[
'blogUrl'
]
blogExpert
=
user
[
'blogExpert'
]
briefIntroduction
=
self
.
remove_character
(
user
[
'briefIntroduction'
])
with
open
(
'./qing_gee_data.csv'
,
'a+'
,
encoding
=
'utf-8'
)
as
f
:
print
(
f
'
{
name
}
,
{
nickname
}
,
{
userAvatar
}
,
{
blogUrl
}
,
{
blogExpert
}
,
{
briefIntroduction
}
'
)
f
.
write
(
f
"
{
name
}
,
{
nickname
}
,
{
userAvatar
}
,
{
blogUrl
}
,
{
blogExpert
}
,
{
briefIntroduction
}
\n
"
)
else
:
print
(
res
.
json
())
print
(
"异常数据"
,
one_url
)
with
open
(
'./error.txt'
,
'a+'
,
encoding
=
'utf-8'
)
as
f
:
f
.
write
(
one_url
+
"
\n
"
)
# 去除特殊字符
def
remove_character
(
self
,
origin_str
):
if
origin_str
is
None
:
return
origin_str
=
origin_str
.
replace
(
'
\n
'
,
''
)
origin_str
=
origin_str
.
replace
(
','
,
','
)
return
origin_str
def
get_headers
(
self
):
uas
=
[
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
,
"Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)"
,
"Baiduspider-image+(+http://www.baidu.com/search/spider.htm)"
,
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36"
,
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
,
"Mozilla/5.0 (compatible; Googlebot-Image/1.0; +http://www.google.com/bot.html)"
,
"Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
,
"Sogou News Spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
,
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);"
,
"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
,
"Sosospider+(+http://help.soso.com/webspider.htm)"
,
"Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)"
]
ua
=
random
.
choice
(
uas
)
headers
=
{
"user-agent"
:
ua
,
'cookie'
:
'UserName=你的ID; UserInfo=你的UserInfo; UserToken=你的UserToken;'
,
"referer"
:
"https://blog.csdn.net/qing_gee?type=sub&subType=fans"
}
return
headers
if
__name__
==
'__main__'
:
lock
=
Lock
()
url_format
=
'https://blog.csdn.net/community/home-api/v1/get-fans-list?page={}&size=20&noMore=false&blogUsername=qing_gee'
urls
=
[
url_format
.
format
(
i
)
for
i
in
range
(
1
,
13300
)]
l
=
[]
while
len
(
urls
)
>
0
:
print
(
len
(
urls
))
for
i
in
range
(
5
):
p
=
MyThread
(
"t"
+
str
(
i
))
l
.
append
(
p
)
p
.
start
()
for
p
in
l
:
p
.
join
()
\ No newline at end of file
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录