Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
looyolo
scrapy
提交
108f8c4f
S
scrapy
项目概览
looyolo
/
scrapy
与 Fork 源项目一致
从无法访问的项目Fork
通知
2
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
S
scrapy
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
体验新版 GitCode,发现更多精彩内容 >>
未验证
提交
108f8c4f
编写于
10月 28, 2017
作者:
M
Mikhail Korobov
提交者:
GitHub
10月 28, 2017
浏览文件
操作
浏览文件
下载
差异文件
Merge pull request #2982 from codeaditya/https-links
Fix broken links and use https links wherever possible
上级
79df51aa
dae7b1cd
变更
39
显示空白变更内容
内联
并排
Showing
39 changed file
with
80 addition
and
80 deletion
+80
-80
CONTRIBUTING.md
CONTRIBUTING.md
+2
-2
INSTALL
INSTALL
+1
-1
README.rst
README.rst
+7
-7
debian/control
debian/control
+3
-3
debian/copyright
debian/copyright
+4
-4
docs/contributing.rst
docs/contributing.rst
+1
-1
docs/intro/overview.rst
docs/intro/overview.rst
+1
-1
docs/topics/practices.rst
docs/topics/practices.rst
+1
-1
docs/topics/selectors.rst
docs/topics/selectors.rst
+2
-2
docs/topics/shell.rst
docs/topics/shell.rst
+4
-4
scrapy/_monkeypatches.py
scrapy/_monkeypatches.py
+2
-2
scrapy/core/downloader/contextfactory.py
scrapy/core/downloader/contextfactory.py
+2
-2
scrapy/crawler.py
scrapy/crawler.py
+1
-1
scrapy/downloadermiddlewares/chunked.py
scrapy/downloadermiddlewares/chunked.py
+1
-1
scrapy/downloadermiddlewares/httpcache.py
scrapy/downloadermiddlewares/httpcache.py
+1
-1
scrapy/exporters.py
scrapy/exporters.py
+1
-1
scrapy/extensions/httpcache.py
scrapy/extensions/httpcache.py
+5
-5
scrapy/extensions/telnet.py
scrapy/extensions/telnet.py
+1
-1
scrapy/pipelines/files.py
scrapy/pipelines/files.py
+2
-2
scrapy/signalmanager.py
scrapy/signalmanager.py
+1
-1
scrapy/templates/project/module/items.py.tmpl
scrapy/templates/project/module/items.py.tmpl
+1
-1
scrapy/templates/project/module/middlewares.py.tmpl
scrapy/templates/project/module/middlewares.py.tmpl
+1
-1
scrapy/templates/project/module/pipelines.py.tmpl
scrapy/templates/project/module/pipelines.py.tmpl
+1
-1
scrapy/templates/project/module/settings.py.tmpl
scrapy/templates/project/module/settings.py.tmpl
+10
-10
scrapy/templates/project/scrapy.cfg
scrapy/templates/project/scrapy.cfg
+1
-1
scrapy/utils/defer.py
scrapy/utils/defer.py
+1
-1
scrapy/utils/deprecate.py
scrapy/utils/deprecate.py
+6
-6
scrapy/utils/http.py
scrapy/utils/http.py
+1
-1
scrapy/utils/log.py
scrapy/utils/log.py
+1
-1
scrapy/utils/url.py
scrapy/utils/url.py
+1
-1
sep/sep-001.rst
sep/sep-001.rst
+1
-1
sep/sep-006.rst
sep/sep-006.rst
+4
-4
sep/sep-013.rst
sep/sep-013.rst
+1
-1
sep/sep-017.rst
sep/sep-017.rst
+1
-1
sep/sep-020.rst
sep/sep-020.rst
+1
-1
setup.py
setup.py
+1
-1
tests/__init__.py
tests/__init__.py
+1
-1
tests/keys/example-com.conf
tests/keys/example-com.conf
+2
-2
tox.ini
tox.ini
+1
-1
未找到文件。
CONTRIBUTING.md
浏览文件 @
108f8c4f
The guidelines for contributing are available here:
http://doc.scrapy.org/en/master/contributing.html
http
s
://doc.scrapy.org/en/master/contributing.html
Please do not abuse the issue tracker for support questions.
If your issue topic can be rephrased to "How to ...?", please use the
support channels to get it answered: http://scrapy.org/community/
support channels to get it answered: http
s
://scrapy.org/community/
INSTALL
浏览文件 @
108f8c4f
For information about installing Scrapy see:
* docs/intro/install.rst (local file)
* http://doc.scrapy.org/en/latest/intro/install.html (online version)
* http
s
://doc.scrapy.org/en/latest/intro/install.html (online version)
README.rst
浏览文件 @
108f8c4f
...
...
@@ -31,7 +31,7 @@ crawl websites and extract structured data from their pages. It can be used for
a wide range of purposes, from data mining to monitoring and automated testing.
For more information including a list of features check the Scrapy homepage at:
http://scrapy.org
http
s
://scrapy.org
Requirements
============
...
...
@@ -47,12 +47,12 @@ The quick way::
pip install scrapy
For more details see the install section in the documentation:
http://doc.scrapy.org/en/latest/intro/install.html
http
s
://doc.scrapy.org/en/latest/intro/install.html
Documentation
=============
Documentation is available online at http://doc.scrapy.org/ and in the ``docs``
Documentation is available online at http
s
://doc.scrapy.org/ and in the ``docs``
directory.
Releases
...
...
@@ -63,12 +63,12 @@ You can find release notes at https://doc.scrapy.org/en/latest/news.html
Community (blog, twitter, mail list, IRC)
=========================================
See http://scrapy.org/community/
See http
s
://scrapy.org/community/
Contributing
============
See http://doc.scrapy.org/en/master/contributing.html
See http
s
://doc.scrapy.org/en/master/contributing.html
Code of Conduct
---------------
...
...
@@ -82,9 +82,9 @@ Please report unacceptable behavior to opensource@scrapinghub.com.
Companies using Scrapy
======================
See http://scrapy.org/companies/
See http
s
://scrapy.org/companies/
Commercial Support
==================
See http://scrapy.org/support/
See http
s
://scrapy.org/support/
debian/control
浏览文件 @
108f8c4f
...
...
@@ -4,7 +4,7 @@ Priority: optional
Maintainer: Scrapinghub Team <info@scrapinghub.com>
Build-Depends: debhelper (>= 7.0.50), python (>=2.7), python-twisted, python-w3lib, python-lxml, python-six (>=1.5.2)
Standards-Version: 3.8.4
Homepage: http://scrapy.org/
Homepage: http
s
://scrapy.org/
Package: scrapy
Architecture: all
...
...
debian/copyright
浏览文件 @
108f8c4f
This package was debianized by the Scrapinghub team <info@scrapinghub.com>.
It was downloaded from http://scrapy.org
It was downloaded from http
s
://scrapy.org
Upstream Author: Scrapy Developers
...
...
docs/contributing.rst
浏览文件 @
108f8c4f
...
...
@@ -7,7 +7,7 @@ Contributing to Scrapy
.. important::
Double check you are reading the most recent version of this document at
http://doc.scrapy.org/en/master/contributing.html
http
s
://doc.scrapy.org/en/master/contributing.html
There are many ways to contribute to Scrapy. Here are some of them:
...
...
docs/intro/overview.rst
浏览文件 @
108f8c4f
...
...
@@ -160,7 +160,7 @@ The next steps for you are to :ref:`install Scrapy <intro-install>`,
a full-blown Scrapy project and `join the community`_. Thanks for your
interest!
.. _join the community: http://scrapy.org/community/
.. _join the community: http
s
://scrapy.org/community/
.. _web scraping: https://en.wikipedia.org/wiki/Web_scraping
.. _Amazon Associates Web Services: https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html
.. _Amazon S3: https://aws.amazon.com/s3/
...
...
docs/topics/practices.rst
浏览文件 @
108f8c4f
...
...
@@ -248,7 +248,7 @@ If you are still unable to prevent your bot getting banned, consider contacting
`commercial support`_.
.. _Tor project: https://www.torproject.org/
.. _commercial support: http://scrapy.org/support/
.. _commercial support: http
s
://scrapy.org/support/
.. _ProxyMesh: https://proxymesh.com/
.. _Google cache: http://www.googleguide.com/cached_pages.html
.. _testspiders: https://github.com/scrapinghub/testspiders
...
...
docs/topics/selectors.rst
浏览文件 @
108f8c4f
...
...
@@ -86,7 +86,7 @@ To explain how to use the selectors we'll use the `Scrapy shell` (which
provides interactive testing) and an example page located in the Scrapy
documentation server:
http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
http
s
://doc.scrapy.org/en/latest/_static/selectors-sample1.html
.. _topics-selectors-htmlcode:
...
...
@@ -99,7 +99,7 @@ Here's its HTML code:
First, let's open the shell::
scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
scrapy shell http
s
://doc.scrapy.org/en/latest/_static/selectors-sample1.html
Then, after the shell loads, you'll have the response available as ``response``
shell variable, and its attached selector in ``response.selector`` attribute.
...
...
docs/topics/shell.rst
浏览文件 @
108f8c4f
...
...
@@ -142,7 +142,7 @@ Example of shell session
========================
Here's an example of a typical shell session where we start by scraping the
http://scrapy.org page, and then proceed to scrape the https://reddit.com
http
s
://scrapy.org page, and then proceed to scrape the https://reddit.com
page. Finally, we modify the (Reddit) request method to POST and re-fetch it
getting an error. We end the session by typing Ctrl-D (in Unix systems) or
Ctrl-Z in Windows.
...
...
@@ -154,7 +154,7 @@ shell works.
First, we launch the shell::
scrapy shell 'http://scrapy.org' --nolog
scrapy shell 'http
s
://scrapy.org' --nolog
Then, the shell fetches the URL (using the Scrapy downloader) and prints the
list of available objects and useful shortcuts (you'll notice that these lines
...
...
@@ -164,7 +164,7 @@ all start with the ``[s]`` prefix)::
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f07395dd690>
[s] item {}
[s] request <GET http://scrapy.org>
[s] request <GET http
s
://scrapy.org>
[s] response <200 https://scrapy.org/>
[s] settings <scrapy.settings.Settings object at 0x7f07395dd710>
[s] spider <DefaultSpider 'default' at 0x7f0735891690>
...
...
@@ -182,7 +182,7 @@ After that, we can start playing with the objects::
>>> response.xpath('//title/text()').extract_first()
'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'
>>> fetch("http://reddit.com")
>>> fetch("http
s
://reddit.com")
>>> response.xpath('//title/text()').extract()
['reddit: the front page of the internet']
...
...
scrapy/_monkeypatches.py
浏览文件 @
108f8c4f
...
...
@@ -4,12 +4,12 @@ from six.moves import copyreg
if
sys
.
version_info
[
0
]
==
2
:
from
urlparse
import
urlparse
# workaround for http://bugs.python.org/issue7904 - Python < 2.7
# workaround for http
s
://bugs.python.org/issue7904 - Python < 2.7
if
urlparse
(
's3://bucket/key'
).
netloc
!=
'bucket'
:
from
urlparse
import
uses_netloc
uses_netloc
.
append
(
's3'
)
# workaround for http://bugs.python.org/issue9374 - Python < 2.7.4
# workaround for http
s
://bugs.python.org/issue9374 - Python < 2.7.4
if
urlparse
(
's3://bucket/key?key=value'
).
query
!=
'key=value'
:
from
urlparse
import
uses_query
uses_query
.
append
(
's3'
)
...
...
scrapy/core/downloader/contextfactory.py
浏览文件 @
108f8c4f
...
...
@@ -64,7 +64,7 @@ if twisted_version >= (14, 0, 0):
"""
Twisted-recommended context factory for web clients.
Quoting http://twistedmatrix.com/documents/current/api/twisted.web.client.Agent.html:
Quoting http
s
://twistedmatrix.com/documents/current/api/twisted.web.client.Agent.html:
"The default is to use a BrowserLikePolicyForHTTPS,
so unless you have special requirements you can leave this as-is."
...
...
@@ -100,6 +100,6 @@ else:
def
getContext
(
self
,
hostname
=
None
,
port
=
None
):
ctx
=
ClientContextFactory
.
getContext
(
self
)
# Enable all workarounds to SSL bugs as documented by
# http
://www.openssl.org/docs/ssl
/SSL_CTX_set_options.html
# http
s://www.openssl.org/docs/manmaster/man3
/SSL_CTX_set_options.html
ctx
.
set_options
(
SSL
.
OP_ALL
)
return
ctx
scrapy/crawler.py
浏览文件 @
108f8c4f
...
...
@@ -83,7 +83,7 @@ class Crawler(object):
yield
defer
.
maybeDeferred
(
self
.
engine
.
start
)
except
Exception
:
# In Python 2 reraising an exception after yield discards
# the original traceback (see http://bugs.python.org/issue7563),
# the original traceback (see http
s
://bugs.python.org/issue7563),
# so sys.exc_info() workaround is used.
# This workaround also works in Python 3, but it is not needed,
# and it is slower, so in Python 3 we use native `raise`.
...
...
scrapy/downloadermiddlewares/chunked.py
浏览文件 @
108f8c4f
...
...
@@ -11,7 +11,7 @@ warnings.warn("Module `scrapy.downloadermiddlewares.chunked` is deprecated, "
class
ChunkedTransferMiddleware
(
object
):
"""This middleware adds support for chunked transfer encoding, as
documented in: http://en.wikipedia.org/wiki/Chunked_transfer_encoding
documented in: http
s
://en.wikipedia.org/wiki/Chunked_transfer_encoding
"""
def
process_response
(
self
,
request
,
response
,
spider
):
...
...
scrapy/downloadermiddlewares/httpcache.py
浏览文件 @
108f8c4f
...
...
@@ -75,7 +75,7 @@ class HttpCacheMiddleware(object):
return
response
# RFC2616 requires origin server to set Date header,
# http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.18
# http
s
://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.18
if
'Date'
not
in
response
.
headers
:
response
.
headers
[
'Date'
]
=
formatdate
(
usegmt
=
1
)
...
...
scrapy/exporters.py
浏览文件 @
108f8c4f
...
...
@@ -188,7 +188,7 @@ class XmlItemExporter(BaseItemExporter):
self
.
xg
.
endElement
(
name
)
self
.
_beautify_newline
()
# Workaround for http://bugs.python.org/issue17606
# Workaround for http
s
://bugs.python.org/issue17606
# Before Python 2.7.4 xml.sax.saxutils required bytes;
# since 2.7.4 it requires unicode. The bug is likely to be
# fixed in 2.7.6, but 2.7.6 will still support unicode,
...
...
scrapy/extensions/httpcache.py
浏览文件 @
108f8c4f
...
...
@@ -70,8 +70,8 @@ class RFC2616Policy(object):
return
True
def
should_cache_response
(
self
,
response
,
request
):
# What is cacheable - http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec14.9.1
# Response cacheability - http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.4
# What is cacheable - http
s
://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec14.9.1
# Response cacheability - http
s
://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.4
# Status code 206 is not included because cache can not deal with partial contents
cc
=
self
.
_parse_cachecontrol
(
response
)
# obey directive "Cache-Control: no-store"
...
...
@@ -163,7 +163,7 @@ class RFC2616Policy(object):
def
_compute_freshness_lifetime
(
self
,
response
,
request
,
now
):
# Reference nsHttpResponseHead::ComputeFreshnessLifetime
# http
://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpResponseHead.cpp#410
# http
s://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpResponseHead.cpp#706
cc
=
self
.
_parse_cachecontrol
(
response
)
maxage
=
self
.
_get_max_age
(
cc
)
if
maxage
is
not
None
:
...
...
@@ -194,7 +194,7 @@ class RFC2616Policy(object):
def
_compute_current_age
(
self
,
response
,
request
,
now
):
# Reference nsHttpResponseHead::ComputeCurrentAge
# http
://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpResponseHead.cpp#366
# http
s://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpResponseHead.cpp#658
currentage
=
0
# If Date header is not set we assume it is a fast connection, and
# clock is in sync with the server
...
...
@@ -414,7 +414,7 @@ class LeveldbCacheStorage(object):
def
parse_cachecontrol
(
header
):
"""Parse Cache-Control header
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9
http
s
://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9
>>> parse_cachecontrol(b'public, max-age=3600') == {b'public': None,
... b'max-age': b'3600'}
...
...
scrapy/extensions/telnet.py
浏览文件 @
108f8c4f
...
...
@@ -82,7 +82,7 @@ class TelnetConsole(protocol.ServerFactory):
'prefs'
:
print_live_refs
,
'hpy'
:
hpy
,
'help'
:
"This is Scrapy telnet console. For more info see: "
\
"http://doc.scrapy.org/en/latest/topics/telnetconsole.html"
,
"http
s
://doc.scrapy.org/en/latest/topics/telnetconsole.html"
,
}
self
.
crawler
.
signals
.
send_catch_log
(
update_telnet_vars
,
telnet_vars
=
telnet_vars
)
return
telnet_vars
scrapy/pipelines/files.py
浏览文件 @
108f8c4f
...
...
@@ -120,7 +120,7 @@ class S3FilesStore(object):
def
_get_boto_bucket
(
self
):
# disable ssl (is_secure=False) because of this python bug:
# http://bugs.python.org/issue5103
# http
s
://bugs.python.org/issue5103
c
=
self
.
S3Connection
(
self
.
AWS_ACCESS_KEY_ID
,
self
.
AWS_SECRET_ACCESS_KEY
,
is_secure
=
False
)
return
c
.
get_bucket
(
self
.
bucket
,
validate
=
False
)
...
...
scrapy/signalmanager.py
浏览文件 @
108f8c4f
...
...
@@ -55,7 +55,7 @@ class SignalManager(object):
The keyword arguments are passed to the signal handlers (connected
through the :meth:`connect` method).
.. _deferreds: http://twistedmatrix.com/documents/current/core/howto/defer.html
.. _deferreds: http
s
://twistedmatrix.com/documents/current/core/howto/defer.html
"""
kwargs
.
setdefault
(
'sender'
,
self
.
sender
)
return
_signal
.
send_catch_log_deferred
(
signal
,
**
kwargs
)
...
...
scrapy/templates/project/module/items.py.tmpl
浏览文件 @
108f8c4f
...
...
@@ -3,7 +3,7 @@
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
# http
s
://doc.scrapy.org/en/latest/topics/items.html
import scrapy
...
...
scrapy/templates/project/module/middlewares.py.tmpl
浏览文件 @
108f8c4f
...
...
@@ -3,7 +3,7 @@
# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html
# http
s
://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
...
...
scrapy/templates/project/module/pipelines.py.tmpl
浏览文件 @
108f8c4f
...
...
@@ -3,7 +3,7 @@
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
# See: http
s
://doc.scrapy.org/en/latest/topics/item-pipeline.html
class ${ProjectName}Pipeline(object):
...
...
scrapy/templates/project/module/settings.py.tmpl
浏览文件 @
108f8c4f
...
...
@@ -5,9 +5,9 @@
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http
://scrapy.readthedocs
.org/en/latest/topics/downloader-middleware.html
# http
://scrapy.readthedocs
.org/en/latest/topics/spider-middleware.html
# http
s
://doc.scrapy.org/en/latest/topics/settings.html
# http
s://doc.scrapy
.org/en/latest/topics/downloader-middleware.html
# http
s://doc.scrapy
.org/en/latest/topics/spider-middleware.html
BOT_NAME = '$project_name'
...
...
@@ -25,7 +25,7 @@ ROBOTSTXT_OBEY = True
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http
://scrapy.readthedocs
.org/en/latest/topics/settings.html#download-delay
# See http
s://doc.scrapy
.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
...
...
@@ -45,31 +45,31 @@ ROBOTSTXT_OBEY = True
#}
# Enable or disable spider middlewares
# See http
://scrapy.readthedocs
.org/en/latest/topics/spider-middleware.html
# See http
s://doc.scrapy
.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# '$project_name.middlewares.${ProjectName}SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http
://scrapy.readthedocs
.org/en/latest/topics/downloader-middleware.html
# See http
s://doc.scrapy
.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# '$project_name.middlewares.${ProjectName}DownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http
://scrapy.readthedocs
.org/en/latest/topics/extensions.html
# See http
s://doc.scrapy
.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http
://scrapy.readthedocs
.org/en/latest/topics/item-pipeline.html
# See http
s://doc.scrapy
.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# '$project_name.pipelines.${ProjectName}Pipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
# See http
s
://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
...
...
@@ -82,7 +82,7 @@ ROBOTSTXT_OBEY = True
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http
://scrapy.readthedocs
.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# See http
s://doc.scrapy
.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
...
...
scrapy/templates/project/scrapy.cfg
浏览文件 @
108f8c4f
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.
org
/en/latest/deploy.html
# https://scrapyd.readthedocs.
io
/en/latest/deploy.html
[settings]
default = ${project_name}.settings
...
...
scrapy/utils/defer.py
浏览文件 @
108f8c4f
...
...
@@ -57,7 +57,7 @@ def parallel(iterable, count, callable, *args, **named):
"""Execute a callable over the objects in the given iterable, in parallel,
using no more than ``count`` concurrent calls.
Taken from: http://jcalderone.livejournal.com/24285.html
Taken from: http
s
://jcalderone.livejournal.com/24285.html
"""
coop
=
task
.
Cooperator
()
work
=
(
callable
(
elem
,
*
args
,
**
named
)
for
elem
in
iterable
)
...
...
scrapy/utils/deprecate.py
浏览文件 @
108f8c4f
...
...
@@ -71,8 +71,8 @@ def create_deprecated_class(name, new_class, clsdict=None,
warnings
.
warn
(
msg
,
warn_category
,
stacklevel
=
2
)
super
(
DeprecatedClass
,
cls
).
__init__
(
name
,
bases
,
clsdict_
)
# see http://www.python.org/dev/peps/pep-3119/#overloading-isinstance-and-issubclass
# and http
://docs.python.org/2
/reference/datamodel.html#customizing-instance-and-subclass-checks
# see http
s
://www.python.org/dev/peps/pep-3119/#overloading-isinstance-and-issubclass
# and http
s://docs.python.org
/reference/datamodel.html#customizing-instance-and-subclass-checks
# for implementation details
def
__instancecheck__
(
cls
,
inst
):
return
any
(
cls
.
__subclasscheck__
(
c
)
...
...
scrapy/utils/http.py
浏览文件 @
108f8c4f
...
...
@@ -11,7 +11,7 @@ def decode_chunked_transfer(chunked_body):
decoded body.
For more info see:
http://en.wikipedia.org/wiki/Chunked_transfer_encoding
http
s
://en.wikipedia.org/wiki/Chunked_transfer_encoding
"""
body
,
h
,
t
=
''
,
''
,
chunked_body
...
...
scrapy/utils/log.py
浏览文件 @
108f8c4f
...
...
@@ -154,7 +154,7 @@ class StreamLogger(object):
"""Fake file-like stream object that redirects writes to a logger instance
Taken from:
http://www.electricmonk.nl/log/2011/08/14/redirect-stdout-and-stderr-to-a-logger-in-python/
http
s
://www.electricmonk.nl/log/2011/08/14/redirect-stdout-and-stderr-to-a-logger-in-python/
"""
def
__init__
(
self
,
logger
,
log_level
=
logging
.
INFO
):
self
.
logger
=
logger
...
...
scrapy/utils/url.py
浏览文件 @
108f8c4f
...
...
@@ -47,7 +47,7 @@ def parse_url(url, encoding=None):
def
escape_ajax
(
url
):
"""
Return the crawleable url according to:
http
://code.google.com/web/ajaxcrawling/docs/getting-started.html
http
s://developers.google.com/webmasters/ajax-crawling/docs/getting-started
>>> escape_ajax("www.example.com/ajax.html#!key=value")
'www.example.com/ajax.html?_escaped_fragment_=key%3Dvalue'
...
...
sep/sep-001.rst
浏览文件 @
108f8c4f
...
...
@@ -61,7 +61,7 @@ ItemForm
--------
Pros:
- same API used for Items (see http://doc.scrapy.org/en/latest/topics/items.html)
- same API used for Items (see http
s
://doc.scrapy.org/en/latest/topics/items.html)
- some people consider setitem API more elegant than methods API
Cons:
...
...
sep/sep-006.rst
浏览文件 @
108f8c4f
...
...
@@ -16,7 +16,7 @@ Motivation
==========
When you use Selectors in Scrapy, your final goal is to "extract" the data that
you've selected, as the [http://doc.scrapy.org/en/latest/topics/selectors.html
you've selected, as the [http
s
://doc.scrapy.org/en/latest/topics/selectors.html
XPath Selectors documentation] says (bolding by me):
When you’re scraping web pages, the most common task you need to perform is
...
...
@@ -58,7 +58,7 @@ As the name of the method for performing selection (the ``x`` method) is not
descriptive nor mnemotechnic enough and clearly clashes with ``extract`` method
(x sounds like a short for extract in english), we propose to rename it to
`select`, `sel` (is shortness if required), or `xpath` after `lxml's
<http://
codespeak.net/lxml
/xpathxslt.html>`_ ``xpath`` method.
<http://
lxml.de
/xpathxslt.html>`_ ``xpath`` method.
Bonus (ItemBuilder)
===================
...
...
@@ -71,5 +71,5 @@ webpage or set of pages.
References
==========
1. XPath Selectors (http://doc.scrapy.org/topics/selectors.html)
2. XPath and XSLT with lxml (http://
codespeak.net/lxml
/xpathxslt.html)
1. XPath Selectors (http
s
://doc.scrapy.org/topics/selectors.html)
2. XPath and XSLT with lxml (http://
lxml.de
/xpathxslt.html)
sep/sep-013.rst
浏览文件 @
108f8c4f
...
...
@@ -44,7 +44,7 @@ Overview of changes proposed
Most of the inconsistencies come from the fact that middlewares don't follow
the typical
[http://twistedmatrix.com/projects/core/documentation/howto/defer.html
[http
s
://twistedmatrix.com/projects/core/documentation/howto/defer.html
deferred] callback/errback chaining logic. Twisted logic is fine and quite
intuitive, and also fits middlewares very well. Due to some bad design choices
the integration between middleware calls and deferred is far from optional. So
...
...
sep/sep-017.rst
浏览文件 @
108f8c4f
...
...
@@ -13,7 +13,7 @@ SEP-017: Spider Contracts
The motivation for Spider Contracts is to build a lightweight mechanism for
testing your spiders, and be able to run the tests quickly without having to
wait for all the spider to run. It's partially based on the
[http://en.wikipedia.org/wiki/Design_by_contract Design by contract] approach
[http
s
://en.wikipedia.org/wiki/Design_by_contract Design by contract] approach
(hence its name) where you define certain conditions that spider callbacks must
met, and you give example testing pages.
...
...
sep/sep-020.rst
浏览文件 @
108f8c4f
...
...
@@ -29,7 +29,7 @@ the rows and the further embedded ``<td>`` elements denoting the individual
fields.
One pattern that is particularly well suited for auto-populating an Item Loader
is the `definition list <http://www.w3.org/TR/html401/struct/lists.html#h-10.3>`_::
is the `definition list <http
s
://www.w3.org/TR/html401/struct/lists.html#h-10.3>`_::
<div class="geeks">
<dl>
...
...
setup.py
浏览文件 @
108f8c4f
...
...
@@ -29,7 +29,7 @@ if has_environment_marker_platform_impl_support():
setup
(
name
=
'Scrapy'
,
version
=
version
,
url
=
'http://scrapy.org'
,
url
=
'http
s
://scrapy.org'
,
description
=
'A high-level Web Crawling and Web Scraping framework'
,
long_description
=
open
(
'README.rst'
).
read
(),
author
=
'Scrapy developers'
,
...
...
tests/__init__.py
浏览文件 @
108f8c4f
"""
tests: this package contains all Scrapy unittests
see http://doc.scrapy.org/en/latest/contributing.html#running-tests
see http
s
://doc.scrapy.org/en/latest/contributing.html#running-tests
"""
import
os
...
...
tests/keys/example-com.conf
浏览文件 @
108f8c4f
# this is copied from http://stackoverflow.com/a/27931596
# this is copied from http
s
://stackoverflow.com/a/27931596
[
req
]
default_bits
=
2048
default_keyfile
=
server
-
key
.
pem
...
...
tox.ini
浏览文件 @
108f8c4f
# Tox (http
://tox.testrun.org
/) is a tool for running tests
# Tox (http
s://tox.readthedocs.io
/) is a tool for running tests
# in multiple virtualenvs. This configuration file will run the
# test suite on all supported python versions. To use it, "pip install tox"
# and then run "tox" from this directory.
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录