.. _news: Release notes ============= Scrapy 1.2.2 (2016-12-XX) ------------------------- Bug fixes ~~~~~~~~~ - Fix a cryptic traceback when a pipeline fails on ``open_spider()`` (:issue:`2011`) - Fix embedded IPython shell variables (fixing :issue:`396` that re-appeared in 1.2.0, fixed in :issue:`2418`) - A couple of patches when dealing with robots.txt: - handle (non-standard) relative sitemap URLs (:issue:`2390`) - handle non-ASCII URLs and User-Agents in Python 2 (:issue:`2373`) Documentation ~~~~~~~~~~~~~ - Document ``"download_latency"`` key in ``Request``'s ``meta`` dict (:issue:`2033`) - Remove page on (deprecated & unsupported) Ubuntu packages from ToC (:issue:`2335`) - A few fixed typos (:issue:`2346`, :issue:`2369`, :issue:`2369`, :issue:`2380`) and clarifications (:issue:`2354`, :issue:`2325`) Other changes ~~~~~~~~~~~~~ - Advertize `conda-forge`_ as Scrapy's official conda channel (:issue:`2387`) - More helpful error messages when trying to use ``.css()`` or ``.xpath()`` on non-Text Responses (:issue:`2264`) - ``startproject`` command now generates a sample ``middlewares.py`` file (:issue:`2335`) - Add more dependencies' version info in ``scrapy version`` verbose output (:issue:`2404`) - Remove all ``*.pyc`` files from source distribution (:issue:`2386`) .. _conda-forge: https://anaconda.org/conda-forge/scrapy Scrapy 1.2.1 (2016-10-21) ------------------------- Bug fixes ~~~~~~~~~ - Include OpenSSL's more permissive default ciphers when establishing TLS/SSL connections (:issue:`2314`). - Fix "Location" HTTP header decoding on non-ASCII URL redirects (:issue:`2321`). Documentation ~~~~~~~~~~~~~ - Fix JsonWriterPipeline example (:issue:`2302`). - Various notes: :issue:`2330` on spider names, :issue:`2329` on middleware methods processing order, :issue:`2327` on getting multi-valued HTTP headers as lists. Other changes ~~~~~~~~~~~~~ - Removed ``www.`` from ``start_urls`` in built-in spider templates (:issue:`2299`). Scrapy 1.2.0 (2016-10-03) ------------------------- New Features ~~~~~~~~~~~~ - New :setting:`FEED_EXPORT_ENCODING` setting to customize the encoding used when writing items to a file. This can be used to turn off ``\uXXXX`` escapes in JSON output. This is also useful for those wanting something else than UTF-8 for XML or CSV output (:issue:`2034`). - ``startproject`` command now supports an optional destination directory to override the default one based on the project name (:issue:`2005`). - New :setting:`SCHEDULER_DEBUG` setting to log requests serialization failures (:issue:`1610`). - JSON encoder now supports serialization of ``set`` instances (:issue:`2058`). - Interpret ``application/json-amazonui-streaming`` as ``TextResponse`` (:issue:`1503`). - ``scrapy`` is imported by default when using shell tools (:command:`shell`, :ref:`inspect_response `) (:issue:`2248`). Bug fixes ~~~~~~~~~ - DefaultRequestHeaders middleware now runs before UserAgent middleware (:issue:`2088`). **Warning: this is technically backwards incompatible**, though we consider this a bug fix. - HTTP cache extension and plugins that use the ``.scrapy`` data directory now work outside projects (:issue:`1581`). **Warning: this is technically backwards incompatible**, though we consider this a bug fix. - ``Selector`` does not allow passing both ``response`` and ``text`` anymore (:issue:`2153`). - Fixed logging of wrong callback name with ``scrapy parse`` (:issue:`2169`). - Fix for an odd gzip decompression bug (:issue:`1606`). - Fix for selected callbacks when using ``CrawlSpider`` with :command:`scrapy parse ` (:issue:`2225`). - Fix for invalid JSON and XML files when spider yields no items (:issue:`872`). - Implement ``flush()`` fpr ``StreamLogger`` avoiding a warning in logs (:issue:`2125`). Refactoring ~~~~~~~~~~~ - ``canonicalize_url`` has been moved to `w3lib.url`_ (:issue:`2168`). .. _w3lib.url: http://w3lib.readthedocs.io/en/latest/w3lib.html#w3lib.url.canonicalize_url Tests & Requirements ~~~~~~~~~~~~~~~~~~~~ Scrapy's new requirements baseline is Debian 8 "Jessie". It was previously Ubuntu 12.04 Precise. What this means in practice is that we run continuous integration tests with these (main) packages versions at a minimum: Twisted 14.0, pyOpenSSL 0.14, lxml 3.4. Scrapy may very well work with older versions of these packages (the code base still has switches for older Twisted versions for example) but it is not guaranteed (because it's not tested anymore). Documentation ~~~~~~~~~~~~~ - Grammar fixes: :issue:`2128`, :issue:`1566`. - Download stats badge removed from README (:issue:`2160`). - New scrapy :ref:`architecture diagram ` (:issue:`2165`). - Updated ``Response`` parameters documentation (:issue:`2197`). - Reworded misleading :setting:`RANDOMIZE_DOWNLOAD_DELAY` description (:issue:`2190`). - Add StackOverflow as a support channel (:issue:`2257`). Scrapy 1.1.3 (2016-09-22) ------------------------- Bug fixes ~~~~~~~~~ - Class attributes for subclasses of ``ImagesPipeline`` and ``FilesPipeline`` work as they did before 1.1.1 (:issue:`2243`, fixes :issue:`2198`) Documentation ~~~~~~~~~~~~~ - :ref:`Overview ` and :ref:`tutorial ` rewritten to use http://toscrape.com websites (:issue:`2236`, :issue:`2249`, :issue:`2252`). Scrapy 1.1.2 (2016-08-18) ------------------------- Bug fixes ~~~~~~~~~ - Introduce a missing :setting:`IMAGES_STORE_S3_ACL` setting to override the default ACL policy in ``ImagesPipeline`` when uploading images to S3 (note that default ACL policy is "private" -- instead of "public-read" -- since Scrapy 1.1.0) - :setting:`IMAGES_EXPIRES` default value set back to 90 (the regression was introduced in 1.1.1) Scrapy 1.1.1 (2016-07-13) ------------------------- Bug fixes ~~~~~~~~~ - Add "Host" header in CONNECT requests to HTTPS proxies (:issue:`2069`) - Use response ``body`` when choosing response class (:issue:`2001`, fixes :issue:`2000`) - Do not fail on canonicalizing URLs with wrong netlocs (:issue:`2038`, fixes :issue:`2010`) - a few fixes for ``HttpCompressionMiddleware`` (and ``SitemapSpider``): - Do not decode HEAD responses (:issue:`2008`, fixes :issue:`1899`) - Handle charset parameter in gzip Content-Type header (:issue:`2050`, fixes :issue:`2049`) - Do not decompress gzip octet-stream responses (:issue:`2065`, fixes :issue:`2063`) - Catch (and ignore with a warning) exception when verifying certificate against IP-address hosts (:issue:`2094`, fixes :issue:`2092`) - Make ``FilesPipeline`` and ``ImagesPipeline`` backward compatible again regarding the use of legacy class attributes for customization (:issue:`1989`, fixes :issue:`1985`) New features ~~~~~~~~~~~~ - Enable genspider command outside project folder (:issue:`2052`) - Retry HTTPS CONNECT ``TunnelError`` by default (:issue:`1974`) Documentation ~~~~~~~~~~~~~ - ``FEED_TEMPDIR`` setting at lexicographical position (:commit:`9b3c72c`) - Use idiomatic ``.extract_first()`` in overview (:issue:`1994`) - Update years in copyright notice (:commit:`c2c8036`) - Add information and example on errbacks (:issue:`1995`) - Use "url" variable in downloader middleware example (:issue:`2015`) - Grammar fixes (:issue:`2054`, :issue:`2120`) - New FAQ entry on using BeautifulSoup in spider callbacks (:issue:`2048`) - Add notes about scrapy not working on Windows with Python 3 (:issue:`2060`) - Encourage complete titles in pull requests (:issue:`2026`) Tests ~~~~~ - Upgrade py.test requirement on Travis CI and Pin pytest-cov to 2.2.1 (:issue:`2095`) Scrapy 1.1.0 (2016-05-11) ------------------------- This 1.1 release brings a lot of interesting features and bug fixes: - Scrapy 1.1 has beta Python 3 support (requires Twisted >= 15.5). See :ref:`news_betapy3` for more details and some limitations. - Hot new features: - Item loaders now support nested loaders (:issue:`1467`). - ``FormRequest.from_response`` improvements (:issue:`1382`, :issue:`1137`). - Added setting :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` and improved AutoThrottle docs (:issue:`1324`). - Added ``response.text`` to get body as unicode (:issue:`1730`). - Anonymous S3 connections (:issue:`1358`). - Deferreds in downloader middlewares (:issue:`1473`). This enables better robots.txt handling (:issue:`1471`). - HTTP caching now follows RFC2616 more closely, added settings :setting:`HTTPCACHE_ALWAYS_STORE` and :setting:`HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS` (:issue:`1151`). - Selectors were extracted to the parsel_ library (:issue:`1409`). This means you can use Scrapy Selectors without Scrapy and also upgrade the selectors engine without needing to upgrade Scrapy. - HTTPS downloader now does TLS protocol negotiation by default, instead of forcing TLS 1.0. You can also set the SSL/TLS method using the new :setting:`DOWNLOADER_CLIENT_TLS_METHOD`. - These bug fixes may require your attention: - Don't retry bad requests (HTTP 400) by default (:issue:`1289`). If you need the old behavior, add ``400`` to :setting:`RETRY_HTTP_CODES`. - Fix shell files argument handling (:issue:`1710`, :issue:`1550`). If you try ``scrapy shell index.html`` it will try to load the URL http://index.html, use ``scrapy shell ./index.html`` to load a local file. - Robots.txt compliance is now enabled by default for newly-created projects (:issue:`1724`). Scrapy will also wait for robots.txt to be downloaded before proceeding with the crawl (:issue:`1735`). If you want to disable this behavior, update :setting:`ROBOTSTXT_OBEY` in ``settings.py`` file after creating a new project. - Exporters now work on unicode, instead of bytes by default (:issue:`1080`). If you use ``PythonItemExporter``, you may want to update your code to disable binary mode which is now deprecated. - Accept XML node names containing dots as valid (:issue:`1533`). - When uploading files or images to S3 (with ``FilesPipeline`` or ``ImagesPipeline``), the default ACL policy is now "private" instead of "public" **Warning: backwards incompatible!**. You can use :setting:`FILES_STORE_S3_ACL` to change it. - We've reimplemented ``canonicalize_url()`` for more correct output, especially for URLs with non-ASCII characters (:issue:`1947`). This could change link extractors output compared to previous scrapy versions. This may also invalidate some cache entries you could still have from pre-1.1 runs. **Warning: backwards incompatible!**. Keep reading for more details on other improvements and bug fixes. .. _news_betapy3: Beta Python 3 Support ~~~~~~~~~~~~~~~~~~~~~ We have been `hard at work to make Scrapy run on Python 3 `_. As a result, now you can run spiders on Python 3.3, 3.4 and 3.5 (Twisted >= 15.5 required). Some features are still missing (and some may never be ported). Almost all builtin extensions/middlewares are expected to work. However, we are aware of some limitations in Python 3: - Scrapy does not work on Windows with Python 3 - Sending emails is not supported - FTP download handler is not supported - Telnet console is not supported Additional New Features and Enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Scrapy now has a `Code of Conduct`_ (:issue:`1681`). - Command line tool now has completion for zsh (:issue:`934`). - Improvements to ``scrapy shell``: - Support for bpython and configure preferred Python shell via ``SCRAPY_PYTHON_SHELL`` (:issue:`1100`, :issue:`1444`). - Support URLs without scheme (:issue:`1498`) **Warning: backwards incompatible!** - Bring back support for relative file path (:issue:`1710`, :issue:`1550`). - Added :setting:`MEMUSAGE_CHECK_INTERVAL_SECONDS` setting to change default check interval (:issue:`1282`). - Download handlers are now lazy-loaded on first request using their scheme (:issue:`1390`, :issue:`1421`). - HTTPS download handlers do not force TLS 1.0 anymore; instead, OpenSSL's ``SSLv23_method()/TLS_method()`` is used allowing to try negotiating with the remote hosts the highest TLS protocol version it can (:issue:`1794`, :issue:`1629`). - ``RedirectMiddleware`` now skips the status codes from ``handle_httpstatus_list`` on spider attribute or in ``Request``'s ``meta`` key (:issue:`1334`, :issue:`1364`, :issue:`1447`). - Form submission: - now works with ``