Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
looyolo
scrapy
提交
904a5013
S
scrapy
项目概览
looyolo
/
scrapy
与 Fork 源项目一致
从无法访问的项目Fork
通知
2
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
S
scrapy
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
体验新版 GitCode,发现更多精彩内容 >>
未验证
提交
904a5013
编写于
2月 03, 2021
作者:
M
Mikhail Korobov
提交者:
GitHub
2月 03, 2021
浏览文件
操作
浏览文件
下载
差异文件
Merge pull request #4973 from Gallaecio/zyte
Scrapinghub → Zyte
上级
28262d4b
f30f53b3
变更
9
隐藏空白更改
内联
并排
Showing
9 changed file
with
38 addition
and
35 deletion
+38
-35
AUTHORS
AUTHORS
+2
-2
CODE_OF_CONDUCT.md
CODE_OF_CONDUCT.md
+1
-1
README.rst
README.rst
+4
-3
docs/intro/install.rst
docs/intro/install.rst
+0
-1
docs/topics/deploy.rst
docs/topics/deploy.rst
+16
-16
docs/topics/logging.rst
docs/topics/logging.rst
+2
-2
docs/topics/practices.rst
docs/topics/practices.rst
+3
-3
docs/topics/selectors.rst
docs/topics/selectors.rst
+2
-2
scrapy/core/downloader/handlers/http11.py
scrapy/core/downloader/handlers/http11.py
+8
-5
未找到文件。
AUTHORS
浏览文件 @
904a5013
Scrapy was brought to life by Shane Evans while hacking a scraping framework
prototype for Mydeco (mydeco.com). It soon became maintained, extended and
improved by Insophia (insophia.com), with the initial sponsorship of Mydeco to
bootstrap the project. In mid-2011, Scrapinghub
became the new official
maintainer.
bootstrap the project. In mid-2011, Scrapinghub
(now Zyte) became the new
official
maintainer.
Here is the list of the primary authors & contributors:
...
...
CODE_OF_CONDUCT.md
浏览文件 @
904a5013
...
...
@@ -55,7 +55,7 @@ further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at opensource@
scrapinghub
.com. All
reported by contacting the project team at opensource@
zyte
.com. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
...
...
README.rst
浏览文件 @
904a5013
...
...
@@ -42,10 +42,11 @@ Scrapy is a fast high-level web crawling and web scraping framework, used to
crawl websites and extract structured data from their pages. It can be used for
a wide range of purposes, from data mining to monitoring and automated testing.
Scrapy is maintained by `Scrapinghub`_ and `many other contributors`_.
Scrapy is maintained by Zyte_ (formerly Scrapinghub) and `many other
contributors`_.
.. _many other contributors: https://github.com/scrapy/scrapy/graphs/contributors
.. _
Scrapinghub: https://www.scrapinghub
.com/
.. _
Zyte: https://www.zyte
.com/
Check the Scrapy homepage at https://scrapy.org for more information,
including a list of features.
...
...
@@ -95,7 +96,7 @@ Please note that this project is released with a Contributor Code of Conduct
(see https://github.com/scrapy/scrapy/blob/master/CODE_OF_CONDUCT.md).
By participating in this project you agree to abide by its terms.
Please report unacceptable behavior to opensource@
scrapinghub
.com.
Please report unacceptable behavior to opensource@
zyte
.com.
Companies using Scrapy
======================
...
...
docs/intro/install.rst
浏览文件 @
904a5013
...
...
@@ -266,7 +266,6 @@ For details, see `Issue #2473 <https://github.com/scrapy/scrapy/issues/2473>`_.
.. _setuptools: https://pypi.python.org/pypi/setuptools
.. _homebrew: https://brew.sh/
.. _zsh: https://www.zsh.org/
.. _Scrapinghub: https://scrapinghub.com
.. _Anaconda: https://docs.anaconda.com/anaconda/
.. _Miniconda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
.. _conda-forge: https://conda-forge.org/
docs/topics/deploy.rst
浏览文件 @
904a5013
...
...
@@ -14,7 +14,7 @@ spiders come in.
Popular choices for deploying Scrapy spiders are:
* :ref:`Scrapyd <deploy-scrapyd>` (open source)
* :ref:`Scrapy Cloud <deploy-scrapy-cloud>` (cloud-based)
* :ref:`
Zyte
Scrapy Cloud <deploy-scrapy-cloud>` (cloud-based)
.. _deploy-scrapyd:
...
...
@@ -32,28 +32,28 @@ Scrapyd is maintained by some of the Scrapy developers.
.. _deploy-scrapy-cloud:
Deploying to Scrapy Cloud
=========================
Deploying to
Zyte
Scrapy Cloud
=========================
=====
`
Scrapy Cloud`_ is a hosted, cloud-based service by `Scrapinghub`_,
the company
behind Scrapy.
`
Zyte Scrapy Cloud`_ is a hosted, cloud-based service by Zyte_, the company
behind Scrapy.
Scrapy Cloud removes the need to setup and monitor servers
and provides a nice UI to manage spiders and review scraped items,
logs and stats.
Zyte Scrapy Cloud removes the need to setup and monitor servers and provides a
nice UI to manage spiders and review scraped items, logs and stats.
To deploy spiders to Scrapy Cloud you can use the `shub`_ command line tool.
Please refer to the `Scrapy Cloud documentation`_ for more information.
To deploy spiders to Zyte Scrapy Cloud you can use the `shub`_ command line
tool.
Please refer to the `Zyte Scrapy Cloud documentation`_ for more information.
Scrapy Cloud is compatible with Scrapyd and one can switch between
Zyte
Scrapy Cloud is compatible with Scrapyd and one can switch between
them as needed - the configuration is read from the ``scrapy.cfg`` file
just like ``scrapyd-deploy``.
.. _Scrapyd: https://github.com/scrapy/scrapyd
.. _Deploying your project: https://scrapyd.readthedocs.io/en/latest/deploy.html
.. _Scrapy
Cloud: https://scrapinghub.com/scrapy-clou
d
.. _Scrapy
d: https://github.com/scrapy/scrapy
d
.. _scrapyd-client: https://github.com/scrapy/scrapyd-client
.. _shub: https://doc.scrapinghub.com/shub.html
.. _scrapyd-deploy documentation: https://scrapyd.readthedocs.io/en/latest/deploy.html
.. _Scrapy Cloud documentation: https://doc.scrapinghub.com/scrapy-cloud.html
.. _Scrapinghub: https://scrapinghub.com/
.. _shub: https://shub.readthedocs.io/en/latest/
.. _Zyte: https://zyte.com/
.. _Zyte Scrapy Cloud: https://www.zyte.com/scrapy-cloud/
.. _Zyte Scrapy Cloud documentation: https://docs.zyte.com/scrapy-cloud.html
docs/topics/logging.rst
浏览文件 @
904a5013
...
...
@@ -101,7 +101,7 @@ instance, which can be accessed and used like this::
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://scrap
inghub.com
']
start_urls = ['https://scrap
y.org
']
def parse(self, response):
self.logger.info('Parse function called on %s', response.url)
...
...
@@ -117,7 +117,7 @@ Python logger you want. For example::
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://scrap
inghub.com
']
start_urls = ['https://scrap
y.org
']
def parse(self, response):
logger.info('Parse function called on %s', response.url)
...
...
docs/topics/practices.rst
浏览文件 @
904a5013
...
...
@@ -63,7 +63,7 @@ project as example.
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl('followall', domain='scrap
inghub.com
')
process.crawl('followall', domain='scrap
y.org
')
process.start() # the script will block here until the crawling is finished
There's another Scrapy utility that provides more control over the crawling
...
...
@@ -244,7 +244,7 @@ Here are some tips to keep in mind when dealing with these kinds of sites:
super proxy that you can attach your own proxies to.
* use a highly distributed downloader that circumvents bans internally, so you
can just focus on parsing clean pages. One example of such downloaders is
`
Crawlera
`_
`
Zyte Smart Proxy Manager
`_
If you are still unable to prevent your bot getting banned, consider contacting
`commercial support`_.
...
...
@@ -254,5 +254,5 @@ If you are still unable to prevent your bot getting banned, consider contacting
.. _ProxyMesh: https://proxymesh.com/
.. _Google cache: http://www.googleguide.com/cached_pages.html
.. _testspiders: https://github.com/scrapinghub/testspiders
.. _Crawlera: https://scrapinghub.com/crawlera
.. _scrapoxy: https://scrapoxy.io/
.. _Zyte Smart Proxy Manager: https://www.zyte.com/smart-proxy-manager/
docs/topics/selectors.rst
浏览文件 @
904a5013
...
...
@@ -464,10 +464,10 @@ effectively. If you are not much familiar with XPath yet,
you may want to take a look first at this `XPath tutorial`_.
.. note::
Some of the tips are based on `this post from
ScrapingHub
's blog`_.
Some of the tips are based on `this post from
Zyte
's blog`_.
.. _`XPath tutorial`: http://www.zvon.org/comp/r/tut-XPath_1.html
.. _
`this post from ScrapingHub's blog`: https://blog.scrapinghub.com/2014/07/17
/xpath-tips-from-the-web-scraping-trenches/
.. _
this post from Zyte's blog: https://www.zyte.com/blog
/xpath-tips-from-the-web-scraping-trenches/
.. _topics-selectors-relative-xpaths:
...
...
scrapy/core/downloader/handlers/http11.py
浏览文件 @
904a5013
...
...
@@ -303,11 +303,14 @@ class ScrapyAgent:
proxyHost
=
to_unicode
(
proxyHost
)
omitConnectTunnel
=
b
'noconnect'
in
proxyParams
if
omitConnectTunnel
:
warnings
.
warn
(
"Using HTTPS proxies in the noconnect mode is deprecated. "
"If you use Crawlera, it doesn't require this mode anymore, "
"so you should update scrapy-crawlera to 1.3.0+ "
"and remove '?noconnect' from the Crawlera URL."
,
ScrapyDeprecationWarning
)
warnings
.
warn
(
"Using HTTPS proxies in the noconnect mode is deprecated. "
"If you use Zyte Smart Proxy Manager (formerly Crawlera), "
"it doesn't require this mode anymore, so you should "
"update scrapy-crawlera to 1.3.0+ and remove '?noconnect' "
"from the Zyte Smart Proxy Manager URL."
,
ScrapyDeprecationWarning
,
)
if
scheme
==
b
'https'
and
not
omitConnectTunnel
:
proxyAuth
=
request
.
headers
.
get
(
b
'Proxy-Authorization'
,
None
)
proxyConf
=
(
proxyHost
,
proxyPort
,
proxyAuth
)
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录