Merge pull request #1724 from scrapy/robotstxt-default

[MRG+1] Enable robots.txt handling by default for new projects.

Merge pull request #1724 from scrapy/robotstxt-default
[MRG+1] Enable robots.txt handling by default for new projects.
0d368c5d · Paul Tremberth · 2246280b · f30758c2 · 0d368c5d · 0d368c5d
隐藏空白更改
内联并排

Showing with 13 addition and 4 deletion

docs/topics/settings.rst docs/topics/settings.rst +10 -4

scrapy/templates/project/module/settings.py.tmpl scrapy/templates/project/module/settings.py.tmpl +3 -0

未找到文件。
--- a/docs/topics/settings.rst
+++ b/docs/topics/settings.rst
@@ -750,8 +750,8 @@ Default: ``60.0``
 Scope: ``scrapy.extensions.memusage``

 The :ref:`Memory usage extension <topics-extensions-ref-memusage>`
-checks the current memory usage, versus the limits set by 
-:setting:`MEMUSAGE_LIMIT_MB` and :setting:`MEMUSAGE_WARNING_MB`, 
+checks the current memory usage, versus the limits set by
+:setting:`MEMUSAGE_LIMIT_MB` and :setting:`MEMUSAGE_WARNING_MB`,
 at fixed time intervals.

 This sets the length of these intervals, in seconds.
@@ -877,7 +877,13 @@ Default: ``False``
 Scope: ``scrapy.downloadermiddlewares.robotstxt``

 If enabled, Scrapy will respect robots.txt policies. For more information see
-:ref:`topics-dlmw-robots`
+:ref:`topics-dlmw-robots`.
+
+.. note::
+
+    While the default value is ``False`` for historical reasons,
+    this option is enabled by default in settings.py file generated
+    by ``scrapy startproject`` command.

 .. setting:: SCHEDULER

@@ -1036,7 +1042,7 @@ TEMPLATES_DIR
 Default: ``templates`` dir inside scrapy module

 The directory where to look for templates when creating new projects with
-:command:`startproject` command and new spiders with :command:`genspider` 
+:command:`startproject` command and new spiders with :command:`genspider`
 command.

 The project name must not conflict with the name of custom files or directories

--- a/scrapy/templates/project/module/settings.py.tmpl
+++ b/scrapy/templates/project/module/settings.py.tmpl
@@ -18,6 +18,9 @@ NEWSPIDER_MODULE = '$project_name.spiders'
 # Crawl responsibly by identifying yourself (and your website) on the user-agent
 #USER_AGENT = '$project_name (+http://www.yourdomain.com)'

+# Obey robots.txt rules
+ROBOTSTXT_OBEY = True
+
 # Configure maximum concurrent requests performed by Scrapy (default: 16)
 #CONCURRENT_REQUESTS = 32