Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
looyolo
scrapy
提交
5ff64ad0
S
scrapy
项目概览
looyolo
/
scrapy
与 Fork 源项目一致
从无法访问的项目Fork
通知
2
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
S
scrapy
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
体验新版 GitCode,发现更多精彩内容 >>
提交
5ff64ad0
编写于
11月 15, 2016
作者:
E
Eugenio Lacuesta
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
handle relative sitemap urls in robots.txt
上级
2086ff40
变更
4
隐藏空白更改
内联
并排
Showing
4 changed file
with
18 addition
and
6 deletion
+18
-6
scrapy/spiders/sitemap.py
scrapy/spiders/sitemap.py
+1
-1
scrapy/utils/sitemap.py
scrapy/utils/sitemap.py
+5
-2
tests/test_spider.py
tests/test_spider.py
+5
-1
tests/test_utils_sitemap.py
tests/test_utils_sitemap.py
+7
-2
未找到文件。
scrapy/spiders/sitemap.py
浏览文件 @
5ff64ad0
...
...
@@ -32,7 +32,7 @@ class SitemapSpider(Spider):
def
_parse_sitemap
(
self
,
response
):
if
response
.
url
.
endswith
(
'/robots.txt'
):
for
url
in
sitemap_urls_from_robots
(
response
.
text
):
for
url
in
sitemap_urls_from_robots
(
response
.
text
,
base_url
=
response
.
url
):
yield
Request
(
url
,
callback
=
self
.
_parse_sitemap
)
else
:
body
=
self
.
_get_sitemap_body
(
response
)
...
...
scrapy/utils/sitemap.py
浏览文件 @
5ff64ad0
...
...
@@ -4,7 +4,9 @@ Module for processing Sitemaps.
Note: The main purpose of this module is to provide support for the
SitemapSpider, its API is subject to change without notice.
"""
import
lxml.etree
from
six.moves.urllib.parse
import
urljoin
class
Sitemap
(
object
):
...
...
@@ -34,10 +36,11 @@ class Sitemap(object):
yield
d
def
sitemap_urls_from_robots
(
robots_text
):
def
sitemap_urls_from_robots
(
robots_text
,
base_url
=
None
):
"""Return an iterator over all sitemap urls contained in the given
robots.txt file
"""
for
line
in
robots_text
.
splitlines
():
if
line
.
lstrip
().
lower
().
startswith
(
'sitemap:'
):
yield
line
.
split
(
':'
,
1
)[
1
].
strip
()
url
=
line
.
split
(
':'
,
1
)[
1
].
strip
()
yield
urljoin
(
base_url
,
url
)
tests/test_spider.py
浏览文件 @
5ff64ad0
...
...
@@ -332,13 +332,17 @@ class SitemapSpiderTest(SpiderTest):
robots
=
b
"""# Sitemap files
Sitemap: http://example.com/sitemap.xml
Sitemap: http://example.com/sitemap-product-index.xml
Sitemap: HTTP://example.com/sitemap-uppercase.xml
Sitemap: /sitemap-relative-url.xml
"""
r
=
TextResponse
(
url
=
"http://www.example.com/robots.txt"
,
body
=
robots
)
spider
=
self
.
spider_class
(
"example.com"
)
self
.
assertEqual
([
req
.
url
for
req
in
spider
.
_parse_sitemap
(
r
)],
[
'http://example.com/sitemap.xml'
,
'http://example.com/sitemap-product-index.xml'
])
'http://example.com/sitemap-product-index.xml'
,
'http://example.com/sitemap-uppercase.xml'
,
'http://www.example.com/sitemap-relative-url.xml'
])
class
BaseSpiderDeprecationTest
(
unittest
.
TestCase
):
...
...
tests/test_utils_sitemap.py
浏览文件 @
5ff64ad0
...
...
@@ -119,13 +119,18 @@ Disallow: /s*/*tags
# Sitemap files
Sitemap: http://example.com/sitemap.xml
Sitemap: http://example.com/sitemap-product-index.xml
Sitemap: HTTP://example.com/sitemap-uppercase.xml
Sitemap: /sitemap-relative-url.xml
# Forums
Disallow: /forum/search/
Disallow: /forum/active/
"""
self
.
assertEqual
(
list
(
sitemap_urls_from_robots
(
robots
)),
[
'http://example.com/sitemap.xml'
,
'http://example.com/sitemap-product-index.xml'
])
self
.
assertEqual
(
list
(
sitemap_urls_from_robots
(
robots
,
base_url
=
'http://example.com'
)),
[
'http://example.com/sitemap.xml'
,
'http://example.com/sitemap-product-index.xml'
,
'http://example.com/sitemap-uppercase.xml'
,
'http://example.com/sitemap-relative-url.xml'
])
def
test_sitemap_blanklines
(
self
):
"""Assert we can deal with starting blank lines before <xml> tag"""
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录