提交 fc23aaab 编写于 作者: 幻灰龙's avatar 幻灰龙

Merge branch 'dev' into 'master'

合并Dev

See merge request !3
{ {
"export": [], "export": ["dynamic_page.json"],
"keywords": [], "keywords": [],
"children": [ "children": [
{ {
......
{
"author": "zxm2015",
"source": "dynamic_page.md",
"depends": [],
"type": "code_options"
}
\ No newline at end of file
# 爬取动态页面
现在想爬取一个url为下拉滚动的页面,下列选项可以爬取到下列页面内容的是:
## 答案
```python
import time
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get(url);
Thread.sleep(1000);
page_size = 10
for i in range(page_size):
time.sleep(2)
js = "var q=document.documentElement.scrollTop=10000"
driver.execute_script(js)
page = BeautifulSoup(driver.page_source, 'lxml')
print(page.text)
```
## 选项
### A
```
以上均不正确
```
### B
```python
import requests
response = requests.get(url=url)
page = BeautifulSoup(response.text, 'lxml')
print(page.text)
```
### C
```python
import urllib.request
response = urllib.request.urlopen(url)
buff = response.read()
html = buff.decode("utf8")
page = BeautifulSoup(html, 'lxml')
print(page.text)
```
{ {
"export": [], "export": ["simulate_login.json"],
"keywords": [], "keywords": [],
"children": [ "children": [
{ {
......
{
"author": "zxm2015",
"source": "simulate_login.md",
"depends": [],
"type": "code_options"
}
# 模拟登陆
一些网站需要登录之后才能浏览网站的其他内容,爬虫需要拥有登录获取cookie/session的能力才能继续采集数据,以下关于说法<span style="color:red">错误</span>的是:
## 答案
```
登录成功后获取的cookie一般来说永久有效
```
## 选项
### A
```
模拟登陆需要先注册网站的账号,或者多注册一些账号来维护一个cookies池
```
### B
```
获取登录页面,可以从登录按钮处获取到登录的url
```
### C
```
登录成功后获取到cookie,其他请求带上cookie就可以获取到请求的页面资源
```
{ {
"export": [], "export": ["selenium.json"],
"keywords": [], "keywords": [],
"children": [ "children": [
{ {
......
{
"author": "zxm2015",
"source": "selenium.md",
"depends": [],
"type": "code_options"
}
# selenium
Selenium是web自动化测试工具集,爬虫可以利用其实现对页面动态资源的采集,对于其这种说法<span style="color:red">错误</span>的是:
## 答案
```
selenium和requests一样,都能用来采集数据,具有同等的速度
```
## 选项
### A
```
页面执行js才能呈现的内容,可以使用selenium来协助采集
```
### B
```
selenium本质是驱动浏览器来发送请求,模拟浏览器的行为
```
### C
```
请求之后往往需要等待一段时间,等待资源加载渲染完成
```
{ {
"export": [], "export": ["pyspider.json"],
"keywords": [], "keywords": [],
"children": [ "children": [
{ {
......
{
"author": "zxm2015",
"source": "pyspider.md",
"depends": [],
"type": "code_options"
}
# pyspider
Pyspider与Scrapy都可以用来爬取数据,关于他们的说法<span style="color:red">错误</span>的是:
## 答案
```
Scrapy提供了web界面,可以用来调试部署
```
## 选项
### A
```
Pyspider提供了web界面,可以进行可视化调试
```
### B
```
初学者如果想快速入门爬取一个新闻网站,推荐使用Pyspider
```
### C
```
Scrapy的可扩展程度更高,主要用来应对一些复杂的爬取场景
```
{ {
"export": [], "export": ["verification_code.json"],
"keywords": [], "keywords": [],
"children": [ "children": [
{ {
......
{
"author": "zxm2015",
"source": "verification_code.md",
"depends": [],
"type": "code_options"
}
# 爬虫验证码
验证码是用来区分人和机器的一种方式,以下关于验证码的说法<span style="color:red">错误</span>的是:
## 答案
```
验证码的识别是一个老话题,已经做到了100%的识别率
```
## 选项
### A
```
验证码的种类繁多,包括中英混合,点选,滑动等等
```
### B
```
验证码识别要使用到OCR(Optical Character Recognition)技术
```
### C
```
对于有难度的验证码,可以对接打码平台或者第三方平台提供的识别服务
```
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册