提交 94ead94b 编写于 作者: P Pablo Hoffman

Improved documentation of Scrapy command-line tool

--HG--
rename : docs/topics/cmdline.rst => docs/topics/commands.rst
上级 34554da2
......@@ -9,3 +9,8 @@ def setup(app):
rolename = "signal",
indextemplate = "pair: %s; signal",
)
app.add_crossref_type(
directivename = "command",
rolename = "command",
indextemplate = "pair: %s; command",
)
......@@ -50,6 +50,7 @@ Scraping basics
.. toctree::
:hidden:
topics/commands
topics/items
topics/spiders
topics/link-extractors
......@@ -59,6 +60,9 @@ Scraping basics
topics/item-pipeline
topics/feed-exports
:doc:`topics/commands`
Learn about the command-line tool used to manage your Scrapy project.
:doc:`topics/items`
Define the data you want to scrape.
......@@ -169,15 +173,14 @@ Reference
.. toctree::
:hidden:
topics/cmdline
topics/request-response
topics/settings
topics/signals
topics/exceptions
topics/exporters
:doc:`topics/cmdline`
Understand the command-line tool used to control your Scrapy project.
:doc:`topics/commands`
Learn about the command-line tool and see all :ref:`available commands <topics-commands-ref>`.
:doc:`topics/request-response`
Understand the classes used to represent HTTP requests and responses.
......
.. _topics-cmdline:
========================
Scrapy command line tool
========================
Scrapy is controlled through the ``scrapy`` command, which we'll refer to as
the "Scrapy tool" from now on to differentiate it from Scrapy commands.
The Scrapy tool provides several commands, for different purposes. Each command
supports its own particular syntax. In other words, each command supports a
different set of arguments and options.
This page doesn't describe each command and its syntax, but instead provides an
introduction to how the ``scrapy`` tool is used. After you learn the basics,
you can get help for each particular command using the ``scrapy`` tool itself.
Using the ``scrapy`` tool
=========================
The first thing you would do with the ``scrapy`` tool is to create your Scrapy
project::
scrapy startproject myproject
That will create a Scrapy project under the ``myproject`` directory.
Next, you go inside the new project directory::
cd myproject
And you're ready to use use the ``scrapy`` command to manage and control your
project from there. For example, to create a new spider::
scrapy genspider mydomain mydomain.com
See all available commands
--------------------------
To see all available commands type::
scrapy -h
That will print a summary of all available Scrapy commands.
The first line will print the currently active project, if you're inside a
Scrapy project.
Example (with an active project)::
Scrapy X.X.X - project: myproject
Usage
=====
...
Example (with no active project)::
Scrapy X.X.X - no active project
Usage
=====
...
Get help for a particular command
---------------------------------
To get help about a particular command, including its description, usage, and
available options type::
scrapy <command> -h
Example::
scrapy crawl -h
Using ``scrapy`` tool outside your project
==========================================
Not all commands must be run from "inside" a Scrapy project. You can, for
example, use the ``fetch`` command to download a page (using Scrapy built-in
downloader) from outside a project. Other commands that can be used outside a
project are ``startproject`` (obviously) and ``shell``, to launch a
:ref:`Scrapy Shell <topics-shell>`.
Also, keep in mind that some commands may have slightly different behaviours
when running them from inside projects. For example, the fetch command will use
spider arguments (such as ``user_agent`` attribute) if the url being fetched is
handled by some specific project spider that happens to define a custom
``user_agent`` attribute. This is feature, as the ``fetch`` command is meant to
download pages as they would be downloaded from the spider.
.. _topics-commands:
=================
Command line tool
=================
Scrapy is controlled through the ``scrapy`` command-line tool, to be referred
here as the "Scrapy tool" to differentiate it from their sub-commands which we
just call "commands", or "Scrapy commands".
The Scrapy tool provides several commands, for multiple purposes, and each one
accepts a different set of arguments and options.
Using the ``scrapy`` tool
=========================
You can start by running the Scrapy tool with no arguments and it will print
some usage help and the available commands::
Scrapy X.Y - no active project
Usage
=====
To run a command:
scrapy <command> [options] [args]
To get help:
scrapy <command> -h
Available commands
==================
[...]
The first line will print the currently active project, if you're inside a
Scrapy project. In this, it was run from outside a project. If run from inside
a project it would have printed something like this::
Scrapy X.Y - project: myproject
Usage
=====
[...]
Using the ``scrapy`` tool to create projects
============================================
The first thing you typically do with the ``scrapy`` tool is create your Scrapy
project::
scrapy startproject myproject
That will create a Scrapy project under the ``myproject`` directory.
Next, you go inside the new project directory::
cd myproject
And you're ready to use use the ``scrapy`` command to manage and control your
project from there.
Using the ``scrapy`` tool to control projects
=============================================
You use the ``scrapy`` tool from inside your projects to control and manage
them.
For example, to create a new spider::
scrapy genspider mydomain mydomain.com
Some Scrapy commands (like :command:`crawl`) must be run from inside a Scrapy
project. See the :ref:`commands reference <topics-commands-ref>` below for more
information on which commands must be run from inside projects, and which not.
Also keep in mind that some commands may have slightly different behaviours
when running them from inside projects. For example, the fetch command will use
spider-overridden behaviours (such as custom ``user_agent`` attribute) if the
url being fetched is associated with some specific spider. This is intentional,
as the ``fetch`` command is meant to be used to check how spiders are
downloading pages.
.. _topics-commands-ref:
Available tool commands
=======================
Here's a list of available built-in commands with a description and some usage
examples. Remember you can always get more info about each command by running::
scrapy <command> -h
And you can check all available commands with::
scrapy -h
.. command:: startproject
startproject
------------
+-------------------+----------------------------------------+
| Syntax: | ``scrapy startproject <project_name>`` |
+-------------------+----------------------------------------+
| Requires project: | *no* |
+-------------------+----------------------------------------+
Creates a new Scrapy project named ``project_name``, under the ``project_name``
directory.
Usage example::
$ scrapy startproject myproject
.. command:: genspider
genspider
---------
+-------------------+--------------------------------------+
| Syntax: | ``scrapy genspider <name> <domain>`` |
+-------------------+--------------------------------------+
| Requires project: | *yes* |
+-------------------+--------------------------------------+
Create a new spider in the current project.
This is just a convenient shortcut command for creating spiders based on
pre-defined templates, but certainly not the only way to create spiders. You
can just create the spider source code files yourself.
Usage example::
$ scrapy genspider example example.com
Created spider 'example' using template 'crawl' in module:
jobsbot.spiders.example
.. command:: crawl
crawl
-----
+-------------------+-------------------------------+
| Syntax: | ``scrapy crawl <spider|url>`` |
+-------------------+-------------------------------+
| Requires project: | *yes* |
+-------------------+-------------------------------+
Start crawling a spider. If a URL is passed instead of a spider, it will start
from that URL instead of the spider start urls.
Usage examples::
$ scrapy crawl example.com
[ ... example.com spider starts crawling ... ]
$ scrapy crawl myspider
[ ... myspider starts crawling ... ]
$ scrapy crawl http://example.com/some/page.html
[ ... spider that handles example.com starts crawling from that url ... ]
.. command:: start
start
-----
+-------------------+------------------+
| Syntax: | ``scrapy start`` |
+-------------------+------------------+
| Requires project: | *yes* |
+-------------------+------------------+
Start Scrapy in server mode.
Usage example::
$ scrapy start
[ ... scrapy starts and stays idle waiting for spiders to get scheduled ... ]
.. command:: list
list
----
+-------------------+-----------------+
| Syntax: | ``scrapy list`` |
+-------------------+-----------------+
| Requires project: | *yes* |
+-------------------+-----------------+
List all available spiders in the current project. The output is one spider per
line.
Usage example::
$ scrapy list
spider1
spider2
.. command:: fetch
fetch
-----
+-------------------+------------------------+
| Syntax: | ``scrapy fetch <url>`` |
+-------------------+------------------------+
| Requires project: | *no* |
+-------------------+------------------------+
Downloads the given URL using the Scrapy downloader and writes the contents to
standard output.
The interesting thing about this command is that it fetches the page how the
the spider would download it. For example, if the spider has an ``user_agent``
attribute which overrides the User Agent, it will use that one.
So this command can be used to "see" how your spider would fetch certain page.
If used outside a project, no particular per-spider behaviour would be applied
and it will just use the default Scrapy downloder settings.
Usage examples::
$ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]
$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
'Age': ['1263 '],
'Connection': ['close '],
'Content-Length': ['596'],
'Content-Type': ['text/html; charset=UTF-8'],
'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
'Etag': ['"573c1-254-48c9c87349680"'],
'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
'Server': ['Apache/2.2.3 (CentOS)']}
.. command:: view
view
----
+-------------------+-----------------------+
| Syntax: | ``scrapy view <url>`` |
+-------------------+-----------------------+
| Requires project: | *no* |
+-------------------+-----------------------+
Opens the given URL in a browser, as your Scrapy spider would "see" it.
Sometimes spiders see pages differently from regular users, so this can be used
to check what the spider "sees" and confirm it's what you expect.
Usage example::
$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]
.. command:: shell
shell
-----
+-------------------+------------------------+
| Syntax: | ``scrapy shell [url]`` |
+-------------------+------------------------+
| Requires project: | *no* |
+-------------------+------------------------+
Starts the Scrapy shell for the given URL (if given) or empty if not URL is
given. See :ref:`topics-shell` for more info.
Usage example::
$ scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ... ]
.. command:: parse
parse
-----
+-------------------+----------------------------------+
| Syntax: | ``scrapy parse <url> [options]`` |
+-------------------+----------------------------------+
| Requires project: | *yes* |
+-------------------+----------------------------------+
Fetches the given URL and parses with the spider that handles it, using the
method passed with the ``--callback`` option, or ``parse`` if not given.
Supported options:
* ``--callback`` or ``-c``: spider method to use as callback for parsing the
response
* ``--noitems``: don't show extracted links
* ``--nolinks``: don't show scraped items
Usage example::
$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]
# Scraped Items - callback: parse ------------------------------------------------------------
MyItem({'name': u"Example item",
'category': u'Furniture',
'length': u'12 cm'}
)
.. command:: settings
settings
--------
+-------------------+-------------------------------+
| Syntax: | ``scrapy settings [options]`` |
+-------------------+-------------------------------+
| Requires project: | *no* |
+-------------------+-------------------------------+
Get the value of a Scrapy setting.
If used inside a project it'll show the project setting value, otherwise it'll
show the default Scrapy value for that setting.
Example usage::
$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0
.. command:: runspider
runspider
---------
+-------------------+---------------------------------------+
| Syntax: | ``scrapy runspider <spider_file.py>`` |
+-------------------+---------------------------------------+
| Requires project: | *no* |
+-------------------+---------------------------------------+
Run a spider self-contained in a Python file, without having to create a
project.
Example usage::
$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]
......@@ -24,7 +24,8 @@ Designating the settings
When you use Scrapy, you have to tell it which settings you're using. You can
do this by using an environment variable, ``SCRAPY_SETTINGS_MODULE``, or the
``--settings`` argument of the :doc:`scrapy command </topics/cmdline>`.
``--settings`` argument of the :doc:`scrapy command-line tool
</topics/commands>`.
The value of ``SCRAPY_SETTINGS_MODULE`` should be in Python path syntax, e.g.
``myproject.settings``. Note that the settings module should be on the
......@@ -89,9 +90,10 @@ It's where most of your custom settings will be populated.
4. Default settings per-command
-------------------------------
Each :doc:`/topics/cmdline` command can have its own default settings, which
override the global default settings. Those custom command settings are
specified in the ``default_settings`` attribute of the command class.
Each :doc:`Scrapy tool </topics/commands>` command can have its own default
settings, which override the global default settings. Those custom command
settings are specified in the ``default_settings`` attribute of the command
class.
5. Default global settings
--------------------------
......@@ -223,8 +225,7 @@ project name). This will be used to construct the User-Agent by default, and
also for logging.
It's automatically populated with your project name when you create your
project with the :doc:`scrapy </topics/cmdline>` ``startproject``
command.
project with the :command:`startproject` command.
.. setting:: BOT_VERSION
......@@ -720,7 +721,7 @@ NEWSPIDER_MODULE
Default: ``''``
Module where to create new spiders using the ``genspider`` command.
Module where to create new spiders using the :command:`genspider` command.
Example::
......@@ -996,8 +997,8 @@ TEMPLATES_DIR
Default: ``templates`` dir inside scrapy module
The directory where to look for template when creating new projects with
:doc:`scrapy startproject </topics/cmdline>` command.
The directory where to look for templates when creating new projects with
:command:`startproject` command.
.. setting:: URLLENGTH_LIMIT
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册