Merge pull request #705 from Curita/sep19-update

Per-spider settings and api cleanup: sep#19 update

Merge pull request #705 from Curita/sep19-update
Per-spider settings and api cleanup: sep#19 update
4bd8eb17 · Daniel Graña · 197e3607 · 9cbce910 · 4bd8eb17
隐藏空白更改
内联并排

Showing with 226 addition and 54 deletion

sep/sep-019.rst sep/sep-019.rst +226 -54

未找到文件。
--- a/sep/sep-019.rst
+++ b/sep/sep-019.rst
@@ -6,12 +6,13 @@ Created  2013-03-07
 Status   Draft
 =======  ===================

-============================
-SEP-019: Per-spider settings
-============================
+======================================================
+SEP-019: Per-spider settings and Crawl Process Cleanup
+======================================================

 This is a proposal to add support for overriding settings per-spiders in a
-consistent way.
+consistent way, while taking the chance to refactor the settings population
+and whole crawl workflow.

 In short, you will be able to overwrite settings (on a per-spider basis) by
 implementing a class method in your spider::
@@ -26,29 +27,23 @@ implementing a class method in your spider::
            }


-What this solves
-================
-
-1. support true overridable per-spider setting, from both command-line usage
-   and library mode
-
-2. support for accessing settings from spiders (currently not supported
-   without hacky code)
-3. avoids mistakenly believing you can change settings after they have been
-   populated (you can, but they won't have any effect)
-
 Proposed changes
 ================

 - new ``custom_settings`` class method will be added to spiders, to give them
  a chance to override settings *before* they're used to instantiate the crawler
- new ``from_crawler`` class method will be added to spiders, to give spiders a
-  chance to access settings, stats, or the crawler core components themselves
- spider manager will be striped out of Crawler class
- ``SPIDER_MODULES`` setting will be removed and replaced by an entry on ``scrapy.cfg``
- Crawler object constructor will receive a spider class as (required) first argument
- add new settings to ``scrapy.cfg`` to define the spider manager class and
-  spider modules
+- spider managers will maintain loading spider classes functionality (with a
+  new ``load`` method that will return a spider class given its name), but
+  spider initialization will be delegated to crawlers (with a new
+  ``from_crawler`` class method in spiders, that will allow them access to
+  crawlers directly)
+- spider manager will be striped out of Crawler class, as it will no longer
+  need it
+- ``SPIDER_MODULES`` and ``SPIDER_MANAGER_CLASS`` settings will be removed and
+  replaced by entries on ``scrapy.cfg``. Thus spider managers won't need
+  project settings to configure themselves
+- CrawlerProcess will be remove, since crawlers will be created independently
+  with a required spider class and optional ``SettingsReader`` instance
 - Settings class will be split into two classes: ``SettingsLoader`` and
  ``SettingsReader``, and a new concept of "setting priority" will be added

@@ -56,21 +51,73 @@ Proposed changes
 Settings
 ========

-Settings class will be split into two classes ``SettingsLoader`` and ``SettingsReader``.
+Settings class will be split into two classes ``SettingsLoader`` and
+``SettingsReader``. First one will be used to settle all different levels of
+settings across the project, and the later will be a frozen version of the
+already loaded settings and will be the preferred way to access them. This
+will avoid the current possible misconception that you can change settings
+after they have been populated. There'll be a new concept of settings
+priorities, and ``settings.overrides`` will be deprecated in favor of
+explicitly loaded settings with priorities, as it'll make the settings
+overriding not order-dependent.
+
+Because of this, ``CrawlerSettings`` (with its overrides, settings_module and
+defaults) will be remove, but its interface could be maintained for backward
+compatibility in ``SettingsReader`` (since on ``SettingsLoader``, overrides
+dictionary and settings with priorities don't get along with a consistent
+implementation). Maintaining this attributes and their functionality is not
+advisable since it breaks the read-only access nature of the class.
+
+With the new per-spider settings, there's a need of a helper function that
+will take a spider and return a ``SettingsReader`` instance populated with
+defaults', project's and the given spider's settings.  Motive behind this is
+that ``get_project_settings`` can't continue being used for getting settings
+instance for crawler usage when using the API directly, as the project is not
+the only source of settings anymore. ``get_projects_settings`` will become an
+internal function because of that.

 SettingsLoader
 --------------

- used at startup (only) to populate settings, then converted to a ``SettingsReader`` and discarded
- will have a method ``set(name, value, priority)`` to register a setting with a given priority
+``SettingsLoader`` is going to populate settings at startup, then it'll be
+converted to a ``SettingsReader`` instance and discarded afterwards.
+
+It is supposed to be write-only, but many previously loaded settings are
+needed to be accessed before freezing them. For example, the
+``COMMANDS_MODULE`` setting allows loading more command default settings.
+Another example is that we need to read ``LOG_*`` settings early because we
+must be able to log errors on the load settings process. ``ScrapyCommands``
+may be configured based upon current settings, as users can plug custom
+commands. These are some of the reasons that suggest that we need a read-write
+access for this class.
+
+- Will have a method ``set(name, value, priority)`` to register a setting with
+  a given priority. A ``setdict(dict, priority)`` method may come handy for
+  loading project's and per-spider settings.
+
+- Will have current Settings getter functions (``get``, ``getint``,
+  ``getfloat``, ``getdict``, etc.) (See above for reasons behind this).
+
+- Will have a ``freeze`` method that returns an instance of
+  ``SettingsReader``, with a copy of the current state of settings (already
+  prioritized).

 SettingsReader
 --------------

- used by core, extensions, et.al to configure themselves
- read-only
- this will be the one with methods: get, getint, getfloat, etc
- this will be the one accesible via ``crawler.settings``
+It's intended to be the one used by core, extensions, and all components that
+use settings without modifying them. Because there are logical objects that
+change settings, such as ``ScrapyCommands``, use cases of each settings class
+need to be comprehensively explained.
+
+New crawlers will be created with an instance of this class (The one returned
+by the ``freeze`` method on the already populated ``SettingsReader``), because
+they are not expected to alter the settings.
+
+It'll be read-only, keeping the same getter methods of current ``Settings``
+(``get``, ``getint``, ``getfloat``, ``getdict``, etc.). There could be a
+``set`` method that will throw a descriptive explanatory error for debugging
+compatibility, avoiding its inadvertently usage.

 Setting priorities
 ------------------
@@ -83,14 +130,35 @@ There will be 5 setting priorities used by default:
 - 30: per-spider settings (those returned by ``Spider.custom_settings`` class method)
 - 40: command line arguments (those passed in the command line)

+There are a couple of issues here:
+
+- ``SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE`` and ``SCRAPY_{settings}`` need-to-be
+  deprecated environment variables: Can be kept, with a new or existing
+  priority.
+
+- We could have different priorities for settings given with the ``-s``
+  argument and other named arguments in the command line (For example, ``-s
+  LOG_ENABLE=False --loglevel=ERROR`` will set ``LOG_ENABLE`` to True, because
+  named options are overridden later in the current implementation), but
+  because the processing of command line options is done in one place we could
+  leave them with the same priority and depend on the order of the set calls
+  just for this case.
+
+Deprecated code
+---------------
+
+``scrapy.conf.settings`` singleton is a deprecated implementation concerning
+settings load. Could be maintained as it is, but the singleton should
+implement new ``SettingsReader`` interface in order to work.
+
+
 Spider manager
 ==============

 Currently, the spider manager is part of the crawler which creates a cyclic
 loop between settings and spiders and it shouldn't belong there. The spiders
 should be loaded outside and passed to the crawler object, which will require a
-spider class to be instantiated. It will need to be a class because when the
-spider is instantiated the ``SettingsReader`` should already be available.
+spider class to be instantiated.

 This new spider manager will not have access to the settings (they won't be
 loaded yet) so it will use scrapy.cfg to configure itself.
@@ -104,52 +172,156 @@ The ``scrapy.cfg`` would look like this::
    manager = scrapy.spidermanager.SpiderManager
    modules = myproject.spiders

- ``manager`` replaces ``SPIDER_MANAGER_CLASS`` setting and can, if omitted,
-  will default to ``scrapy.spidermanager.SpiderManager``
+- ``manager`` replaces ``SPIDER_MANAGER_CLASS`` setting and will, if omitted,
+  default to ``scrapy.spidermanager.SpiderManager``
 - ``modules`` replaces ``SPIDER_MODULES`` setting and will be required

-Startup process
+These ideas translate to the following changes on the ``SpiderManager`` class:
+
+- ``__init__(spider_modules)`` -> ``__init__()``. ``spider_modules`` will be
+  looked in ``scrapy.cfg``.
+
+- ``create('spider_name', **spider_kargs)`` -> ``load('spider_name')``. This
+  will return a spider class, not an instance. It's basically a ``__get__``
+  to ``self._spiders``.
+
+- All remaining functions should be deprecated or remove accordantly, since a
+  crawler reference is no longer needed.
+
+- New helper ``get_spider_manager_class_from_scrapycfg`` in
+  ``scrapy/utils/spidermanager.py``.
+
+
+Spiders
+=======
+
+A new class method ``custom_settings`` is proposed, that could be use to
+override project and default settings before they're used to instantiate the
+crawler::
+
+    def MySpider(BaseSpider):
+
+        @classmethod
+        def custom_settings(cls):
+            return {
+                "DOWNLOAD_DELAY": 5.0,
+                "RETRY_ENABLED": False,
+            }
+
+This will only involve a ``set`` call with the corresponding priority when
+populating ``SettingsLoader``.
+
+Contributing to API changes, new ``from_crawler`` class method will be added
+to spiders to give them a chance to access settings, stats, or the crawler
+core components themselves. This should be the new way to create a spider from
+now on (instead of normally instantiate it as is currently).
+
+
+Scrapy commands
 ===============

-This describes the current and new proposed mechanism for starting up a Scrapy
-crawler assuming one is running the following command::
+As already stated, ``ScrapyCommands`` modify the settings, so they need a
+``SettingsLoader`` instance reference in order to do that.
+
+Present ``process_option`` implementations on Base and other commands read and
+override settings. These overrides should be changed to ``set`` calls with
+the allocated priority.
+
+Each command with a custom ``run`` method should be modified to reflect the new
+refactored API (Particularly ``crawl`` command).
+
+
+CrawlerProcess
+==============
+
+``CrawlerProcess`` should be remove because Scrapy crawl command no longer
+supports running multiple spiders. The preferred way for doing this is using
+the API manually, instantiating a separate Crawler for each spider, so
+``CrawlerProcess`` has loosen its utility.
+
+This change is not directly related to the project (it's not focus on settings
+but it fits in the API clean up task), but it's a great opportunity to
+consider since we're changing the crawling startup flow.
+
+This class will be deleted and the attributes and methods will be merge with
+``Crawler``. For that effect, these are the specific merges and removals:
+
+- ``self.crawlers`` doesn't make sense is this new set up, each reference will
+  be replace with self.
+
+- ``create_crawler`` will be ``__init__`` of ``Crawler``
+
+- ``_start_crawler`` will be merge with ``Crawler.start``
+
+- ``start`` will be merge with ``Crawler.crawl`` but this will need from the
+  later an extra boolean parameter ``start_reactor`` (default: True) to crawl
+  with or without starting twisted reactor (This is required in
+  ``commands.shell`` in order to start the reactor in another thread).
+
+
+Startup process
+===============

-    scrapy crawl myspider -a arg=value -s DOWNLOAD_DELAY=5
+This summarizes the current and new proposed mechanisms for starting up a
+Scrapy crawler. Imports and non representative functions are omitted for
+brevity.

-Most of the code here happens in ``scrapy.cmdline`` and
-``scrapy.commands.crawl`` modules, imports are omitted for brevity.

 Current (old) startup process
 -----------------------------

 ::

-    settings = get_project_settings() # loads settings.py
-    settings.overrides.update(DOWNLOAD_DELAY=5)
+    # execute in cmdline
+
+    # loads settings.py, returns CrawlerSettings(settings_module)
+    settings = get_project_settings()
+    settings.defaults.update(cmd.default_settings)
+
+    cmd.crawler_process = CrawlerProcess(settings)
+    cmd.run # (In a _run_print_help call)
+
+        # Command.run in commands/crawl.py

-    crawler = CrawlerProcess(settings)
-    crawler.configure()
-        # load extensions, middlewares, pipelines
-    spider = crawler.spiders.create('myspider', arg=value)
+        self.crawler_process.create_crawler()
+        spider = crawler.spiders.create(spider_name, **spider_kwargs)
    crawler.crawl(spider)
-    crawler.start()
-        # starts crawling spider
+        self.crawler_process.start() # starts crawling spider
+
+            # CrawlerProcess._start_crawler in crawler.py
+
+            crawler.configure()

 Proposed (new) startup process
 ------------------------------

 ::

+    # execute in cmdline
+
    smcls = get_spider_manager_class_from_scrapycfg()
    sm = smcls() # loads spiders from module defined in scrapy.cfg
-    spidercls = sm.load('myspider') # NOTE: returns spider class, not instance
+    spidercls = sm.load(spider_name) # returns spider class, not instance

    settings = get_project_settings() # loads settings.py
-    settings.set('DOWNLOAD_DELAY', 5, priority=40)
+    settings.setdict(cmd.default_settings, priority=40)
+
+    settings.setdict(spidercls.custom_settings(), priority=30)
+
+    settings = settings.freeze()
+    cmd.crawler = Crawler(spidercls, settings=settings)
+
+        # Crawler.__init__ in crawler.py
+
+        self.configure()
+
+    cmd.run # (In a _run_print_help call)
+
+        # Command.run in commands/crawl.py
+
+        self.crawler.crawl(**spider_kwargs)
+
+            # Crawler.crawl in crawler.py

-    crawler = Crawler(spidercls, settings=settings)
-        settings.overrides.update(spidercls.custom_settings())
-        # load extensions, middlewares, pipelines
-    crawler.crawl(arg='value')
-        spider = self.spidercls.from_crawler(self, arg='value')
+            spider = self.spidercls.from_crawler(self, **spider_kwargs)
        # starts crawling spider