1. 04 3月, 2016 8 次提交
  2. 15 2月, 2016 1 次提交
    • M
      nbd: Create size change events for userspace · 37091fdd
      Markus Pargmann 提交于
      The userspace needs to know when nbd devices are ready for use.
      Currently no events are created for the userspace which doesn't work for
      systemd.
      
      See the discussion here: https://github.com/systemd/systemd/pull/358
      
      This patch uses a central point to setup the nbd-internal sizes. A ioctl
      to set a size does not lead to a visible size change. The size of the
      block device will be kept at 0 until nbd is connected. As soon as it
      connects, the size will be changed to the real value and a uevent is
      created. When disconnecting, the blockdevice is set to 0 size and
      another uevent is generated.
      Signed-off-by: NMarkus Pargmann <mpa@pengutronix.de>
      37091fdd
  3. 05 2月, 2016 5 次提交
    • D
      nbd: ratelimit error msgs after socket close · da6ccaaa
      Dan Streetman 提交于
      Make the "Attempted send on closed socket" error messages generated in
      nbd_request_handler() ratelimited.
      
      When the nbd socket is shutdown, the nbd_request_handler() function emits
      an error message for every request remaining in its queue.  If the queue
      is large, this will spam a large amount of messages to the log.  There's
      no need for a separate error message for each request, so this patch
      ratelimits it.
      
      In the specific case this was found, the system was virtual and the error
      messages were logged to the serial port, which overwhelmed it.
      
      Fixes: 4d48a542 ("nbd: fix I/O hang on disconnected nbds")
      Signed-off-by: NDan Streetman <dan.streetman@canonical.com>
      Signed-off-by: NMarkus Pargmann <mpa@pengutronix.de>
      da6ccaaa
    • M
      nbd: Move flag parsing to a function · d02cf531
      Markus Pargmann 提交于
      nbd changes properties of the blockdevice depending on flags that were
      received. This patch moves this flag parsing into a separate function
      nbd_parse_flags().
      Signed-off-by: NMarkus Pargmann <mpa@pengutronix.de>
      d02cf531
    • M
      nbd: Cleanup reset of nbd and bdev after a disconnect · 0e4f0f6f
      Markus Pargmann 提交于
      Group all variables that are reset after a disconnect into reset
      functions. This patch adds two of these functions, nbd_reset() and
      nbd_bdev_reset().
      Signed-off-by: NMarkus Pargmann <mpa@pengutronix.de>
      0e4f0f6f
    • M
      nbd: Timeouts are not user requested disconnects · 1f7b5cf1
      Markus Pargmann 提交于
      It may be useful to know in the client that a connection timed out. The
      current code returns success for a timeout.
      
      This patch reports the error code -ETIMEDOUT for a timeout.
      Signed-off-by: NMarkus Pargmann <mpa@pengutronix.de>
      1f7b5cf1
    • M
      nbd: Remove signal usage · 23272a67
      Markus Pargmann 提交于
      As discussed on the mailing list, the usage of signals for timeout
      handling has a lot of potential issues. The nbd driver used for some
      time signals for timeouts. These signals where able to get the threads
      out of the blocking socket operations.
      
      This patch removes all signal usage and uses a socket shutdown instead.
      The socket descriptor itself is cleared later when the whole nbd device
      is closed.
      
      The tasks_lock is removed as we do not depend on this anymore. Instead
      a new lock for the socket is introduced so we can safely work with the
      socket in the timeout handler outside of the two main threads.
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NMarkus Pargmann <mpa@pengutronix.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      23272a67
  4. 03 2月, 2016 1 次提交
    • M
      nbd: Fix debugfs error handling · 27ea43fe
      Markus Pargmann 提交于
      Static checker complains about the implemented error handling. It is
      indeed wrong. We don't care about the return values of created debugfs
      files.
      
      We only have to check the return values of created dirs for NULL
      pointer. If we use a null pointer as parent directory for files, this
      may lead to debugfs files in wrong places.
      Signed-off-by: NMarkus Pargmann <mpa@pengutronix.de>
      27ea43fe
  5. 23 1月, 2016 2 次提交
  6. 22 1月, 2016 1 次提交
  7. 16 1月, 2016 6 次提交
    • D
      mm, dax, pmem: introduce pfn_t · 34c0fd54
      Dan Williams 提交于
      For the purpose of communicating the optional presence of a 'struct
      page' for the pfn returned from ->direct_access(), introduce a type that
      encapsulates a page-frame-number plus flags.  These flags contain the
      historical "page_link" encoding for a scatterlist entry, but can also
      denote "device memory".  Where "device memory" is a set of pfns that are
      not part of the kernel's linear mapping by default, but are accessed via
      the same memory controller as ram.
      
      The motivation for this new type is large capacity persistent memory
      that needs struct page entries in the 'memmap' to support 3rd party DMA
      (i.e.  O_DIRECT I/O with a persistent memory source/target).  However,
      we also need it in support of maintaining a list of mapped inodes which
      need to be unmapped at driver teardown or freeze_bdev() time.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34c0fd54
    • J
      zram: don't call idr_remove() from zram_remove() · 17ec4cd9
      Jerome Marchand 提交于
      The use of idr_remove() is forbidden in the callback functions of
      idr_for_each().  It is therefore unsafe to call idr_remove in
      zram_remove().
      
      This patch moves the call to idr_remove() from zram_remove() to
      hot_remove_store().  In the detroy_devices() path, idrs are removed by
      idr_destroy().  This solves an use-after-free detected by KASan.
      
      [akpm@linux-foundation.org: fix coding stype, per Sergey]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17ec4cd9
    • S
      zram/zcomp: do not zero out zcomp private pages · e02d238c
      Sergey Senozhatsky 提交于
      Do not __GFP_ZERO allocated zcomp ->private pages.  We keep allocated
      streams around and use them for read/write requests, so we supply a
      zeroed out ->private to compression algorithm as a scratch buffer only
      once -- the first time we use that stream.  For the rest of IO requests
      served by this stream ->private usually contains some temporarily data
      from the previous requests.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e02d238c
    • M
      zram: pass gfp from zcomp frontend to backend · 75d8947a
      Minchan Kim 提交于
      Each zcomp backend uses own gfp flag but it's pointless because the
      context they could be called is driven by upper layer(ie, zcomp
      frontend).  As well, zcomp frondend could call them in different
      context.  One context(ie, zram init part) is it should be better to make
      sure successful allocation other context(ie, further stream allocation
      part for accelarating I/O speed) is just optional so let's pass gfp down
      from driver (ie, zcomp frontend) like normal MM convention.
      
      [sergey.senozhatsky@gmail.com: add missing __vmalloc zero and highmem gfps]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75d8947a
    • K
      zram: try vmalloc() after kmalloc() · d913897a
      Kyeongdon Kim 提交于
      When we're using LZ4 multi compression streams for zram swap, we found
      out page allocation failure message in system running test.  That was
      not only once, but a few(2 - 5 times per test).  Also, some failure
      cases were continually occurring to try allocation order 3.
      
      In order to make parallel compression private data, we should call
      kzalloc() with order 2/3 in runtime(lzo/lz4).  But if there is no order
      2/3 size memory to allocate in that time, page allocation fails.  This
      patch makes to use vmalloc() as fallback of kmalloc(), this prevents
      page alloc failure warning.
      
      After using this, we never found warning message in running test, also
      It could reduce process startup latency about 60-120ms in each case.
      
      For reference a call trace :
      
          Binder_1: page allocation failure: order:3, mode:0x10c0d0
          CPU: 0 PID: 424 Comm: Binder_1 Tainted: GW 3.10.49-perf-g991d02b-dirty #20
          Call trace:
            dump_backtrace+0x0/0x270
            show_stack+0x10/0x1c
            dump_stack+0x1c/0x28
            warn_alloc_failed+0xfc/0x11c
            __alloc_pages_nodemask+0x724/0x7f0
            __get_free_pages+0x14/0x5c
            kmalloc_order_trace+0x38/0xd8
            zcomp_lz4_create+0x2c/0x38
            zcomp_strm_alloc+0x34/0x78
            zcomp_strm_multi_find+0x124/0x1ec
            zcomp_strm_find+0xc/0x18
            zram_bvec_rw+0x2fc/0x780
            zram_make_request+0x25c/0x2d4
            generic_make_request+0x80/0xbc
            submit_bio+0xa4/0x15c
            __swap_writepage+0x218/0x230
            swap_writepage+0x3c/0x4c
            shrink_page_list+0x51c/0x8d0
            shrink_inactive_list+0x3f8/0x60c
            shrink_lruvec+0x33c/0x4cc
            shrink_zone+0x3c/0x100
            try_to_free_pages+0x2b8/0x54c
            __alloc_pages_nodemask+0x514/0x7f0
            __get_free_pages+0x14/0x5c
            proc_info_read+0x50/0xe4
            vfs_read+0xa0/0x12c
            SyS_read+0x44/0x74
          DMA: 3397*4kB (MC) 26*8kB (RC) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
               0*512kB 0*1024kB 0*2048kB 0*4096kB = 13796kB
      
      [minchan@kernel.org: change vmalloc gfp and adding comment about gfp]
      [sergey.senozhatsky@gmail.com: tweak comments and styles]
      Signed-off-by: NKyeongdon Kim <kyeongdon.kim@lge.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d913897a
    • S
      zram/zcomp: use GFP_NOIO to allocate streams · 3d5fe03a
      Sergey Senozhatsky 提交于
      We can end up allocating a new compression stream with GFP_KERNEL from
      within the IO path, which may result is nested (recursive) IO
      operations.  That can introduce problems if the IO path in question is a
      reclaimer, holding some locks that will deadlock nested IOs.
      
      Allocate streams and working memory using GFP_NOIO flag, forbidding
      recursive IO and FS operations.
      
      An example:
      
        inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
        git/20158 [HC0[0]:SC0[0]:HE1:SE1] takes:
         (jbd2_handle){+.+.?.}, at:  start_this_handle+0x4ca/0x555
        {IN-RECLAIM_FS-W} state was registered at:
           __lock_acquire+0x8da/0x117b
           lock_acquire+0x10c/0x1a7
           start_this_handle+0x52d/0x555
           jbd2__journal_start+0xb4/0x237
           __ext4_journal_start_sb+0x108/0x17e
           ext4_dirty_inode+0x32/0x61
           __mark_inode_dirty+0x16b/0x60c
           iput+0x11e/0x274
           __dentry_kill+0x148/0x1b8
           shrink_dentry_list+0x274/0x44a
           prune_dcache_sb+0x4a/0x55
           super_cache_scan+0xfc/0x176
           shrink_slab.part.14.constprop.25+0x2a2/0x4d3
           shrink_zone+0x74/0x140
           kswapd+0x6b7/0x930
           kthread+0x107/0x10f
           ret_from_fork+0x3f/0x70
        irq event stamp: 138297
        hardirqs last  enabled at (138297):  debug_check_no_locks_freed+0x113/0x12f
        hardirqs last disabled at (138296):  debug_check_no_locks_freed+0x33/0x12f
        softirqs last  enabled at (137818):  __do_softirq+0x2d3/0x3e9
        softirqs last disabled at (137813):  irq_exit+0x41/0x95
      
                     other info that might help us debug this:
         Possible unsafe locking scenario:
               CPU0
               ----
          lock(jbd2_handle);
          <Interrupt>
            lock(jbd2_handle);
      
                      *** DEADLOCK ***
        5 locks held by git/20158:
         #0:  (sb_writers#7){.+.+.+}, at: [<ffffffff81155411>] mnt_want_write+0x24/0x4b
         #1:  (&type->i_mutex_dir_key#2/1){+.+.+.}, at: [<ffffffff81145087>] lock_rename+0xd9/0xe3
         #2:  (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8114f8e2>] lock_two_nondirectories+0x3f/0x6b
         #3:  (&sb->s_type->i_mutex_key#11/4){+.+.+.}, at: [<ffffffff8114f909>] lock_two_nondirectories+0x66/0x6b
         #4:  (jbd2_handle){+.+.?.}, at: [<ffffffff811e31db>] start_this_handle+0x4ca/0x555
      
                     stack backtrace:
        CPU: 2 PID: 20158 Comm: git Not tainted 4.1.0-rc7-next-20150615-dbg-00016-g8bdf555-dirty #211
        Call Trace:
          dump_stack+0x4c/0x6e
          mark_lock+0x384/0x56d
          mark_held_locks+0x5f/0x76
          lockdep_trace_alloc+0xb2/0xb5
          kmem_cache_alloc_trace+0x32/0x1e2
          zcomp_strm_alloc+0x25/0x73 [zram]
          zcomp_strm_multi_find+0xe7/0x173 [zram]
          zcomp_strm_find+0xc/0xe [zram]
          zram_bvec_rw+0x2ca/0x7e0 [zram]
          zram_make_request+0x1fa/0x301 [zram]
          generic_make_request+0x9c/0xdb
          submit_bio+0xf7/0x120
          ext4_io_submit+0x2e/0x43
          ext4_bio_write_page+0x1b7/0x300
          mpage_submit_page+0x60/0x77
          mpage_map_and_submit_buffers+0x10f/0x21d
          ext4_writepages+0xc8c/0xe1b
          do_writepages+0x23/0x2c
          __filemap_fdatawrite_range+0x84/0x8b
          filemap_flush+0x1c/0x1e
          ext4_alloc_da_blocks+0xb8/0x117
          ext4_rename+0x132/0x6dc
          ? mark_held_locks+0x5f/0x76
          ext4_rename2+0x29/0x2b
          vfs_rename+0x540/0x636
          SyS_renameat2+0x359/0x44d
          SyS_rename+0x1e/0x20
          entry_SYSCALL_64_fastpath+0x12/0x6f
      
      [minchan@kernel.org: add stable mark]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Kyeongdon Kim <kyeongdon.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d5fe03a
  8. 14 1月, 2016 1 次提交
    • A
      null_blk: use sector_div instead of do_div · e93d12ae
      Arnd Bergmann 提交于
      Dividing a sector_t number should be done using sector_div rather than do_div
      to optimize the 32-bit sector_t case, and with the latest do_div optimizations,
      we now get a compile-time warning for this:
      
      arch/arm/include/asm/div64.h:32:95: note: expected 'uint64_t * {aka long long unsigned int *}' but argument is of type 'sector_t * {aka long unsigned int *}'
      drivers/block/null_blk.c:521:81: warning: comparison of distinct pointer types lacks a cast
      
      This changes the newly added code to use sector_div. It is a simplified version
      of the original patch, as Linus Torvalds pointed out that we should not be using
      an expensive division function in the first place.
      
      This version was suggested by Matias Bjorling.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Matias Bjorling <m@bjorling.me>
      Fixes: b2b7e001 ("null_blk: register as a LightNVM device")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e93d12ae
  9. 12 1月, 2016 1 次提交
    • M
      lightnvm: refactor end_io functions for sync · 91276162
      Matias Bjørling 提交于
      To implement sync I/O support within the LightNVM core, the end_io
      functions are refactored to take an end_io function pointer instead of
      testing for initialized media manager, followed by calling its end_io
      function.
      
      Sync I/O can then be implemented using a callback that signal I/O
      completion. This is similar to the logic found in blk_to_execute_io().
      By implementing it this way, the underlying device I/Os submission logic
      is abstracted away from core, targets, and media managers.
      Signed-off-by: NMatias Bjørling <m@bjorling.me>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      91276162
  10. 09 1月, 2016 2 次提交
  11. 06 1月, 2016 2 次提交
  12. 05 1月, 2016 10 次提交
    • C
      cciss: print max outstanding commands as a hex value · a8036dfb
      Colin Ian King 提交于
      The max outstanding commands is being printed with a 0x prefix to
      suggest it is a hex value, when in fact the integer decimal %d format
      specifier is being used and this is a bit confusing. Use %x instead to
      match the proceeding 0x prefix.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      a8036dfb
    • K
      xen/blkfront: Fix crash if backend doesn't follow the right states. · c31ecf6c
      Konrad Rzeszutek Wilk 提交于
      We have split the setting up of all the resources in two steps:
      1) talk_to_blkback  - which figures out the num_ring_pages (from
         the default value of zero), sets up shadow and so
      2) blkfront_connect - does the real part of filling out the
         internal structures.
      
      The problem is if we bypass the 1) step and go straight to 2)
      and call blkfront_setup_indirect where we use the macro
      BLK_RING_SIZE - which returns an negative value (because
      sz is zero  - since num_ring_pages is zero - since it has never
      been set).
      
      We can fix this by making sure that we always have called
      talk_to_blkback before going to blkfront_connect.
      
      Or we could set in blkfront_probe info->nr_ring_pages = 1
      to have a default value. But that looks odd - as we haven't
      actually negotiated any ring size.
      
      This patch changes XenbusStateConnected state to detect if
      we haven't done the initial handshake - and if so continue
      on as if were in XenbusStateInitWait state.
      
      We also roll the error recovery (freeing the structure) into
      talk_to_blkback error path - which is safe since that function
      is only called from blkback_changed.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      c31ecf6c
    • B
      xen/blkback: Fix two memory leaks. · 93bb277f
      Bob Liu 提交于
      This patch fixs two memleaks:
        backtrace:
          [<ffffffff817ba5e8>] kmemleak_alloc+0x28/0x50
          [<ffffffff81205e3b>] kmem_cache_alloc+0xbb/0x1d0
          [<ffffffff81534028>] xen_blkbk_probe+0x58/0x230
          [<ffffffff8146adb6>] xenbus_dev_probe+0x76/0x130
          [<ffffffff81511716>] driver_probe_device+0x166/0x2c0
          [<ffffffff815119bc>] __device_attach_driver+0xac/0xb0
          [<ffffffff8150fa57>] bus_for_each_drv+0x67/0x90
          [<ffffffff81511ab7>] __device_attach+0xc7/0x120
          [<ffffffff81511b23>] device_initial_probe+0x13/0x20
          [<ffffffff8151059a>] bus_probe_device+0x9a/0xb0
          [<ffffffff8150f0a1>] device_add+0x3b1/0x5c0
          [<ffffffff8150f47e>] device_register+0x1e/0x30
          [<ffffffff8146a9e8>] xenbus_probe_node+0x158/0x170
          [<ffffffff8146abaf>] xenbus_dev_changed+0x1af/0x1c0
          [<ffffffff8146b1bb>] backend_changed+0x1b/0x20
          [<ffffffff81468ca6>] xenwatch_thread+0xb6/0x160
      unreferenced object 0xffff880007ba8ef8 (size 224):
      
        backtrace:
          [<ffffffff817ba5e8>] kmemleak_alloc+0x28/0x50
          [<ffffffff81205c73>] __kmalloc+0xd3/0x1e0
          [<ffffffff81534d87>] frontend_changed+0x2c7/0x580
          [<ffffffff8146af12>] xenbus_otherend_changed+0xa2/0xb0
          [<ffffffff8146b2c0>] frontend_changed+0x10/0x20
          [<ffffffff81468ca6>] xenwatch_thread+0xb6/0x160
          [<ffffffff810d3e97>] kthread+0xd7/0xf0
          [<ffffffff817c4a9f>] ret_from_fork+0x3f/0x70
          [<ffffffffffffffff>] 0xffffffffffffffff
      unreferenced object 0xffff8800048dcd38 (size 224):
      
      The first leak is caused by not put() the be->blkif reference
      which we had gotten in xen_blkif_alloc(), while the second is
      us not freeing blkif->rings in the right place.
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Reported-and-Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      93bb277f
    • B
      xen/blkback: make st_ statistics per ring · db6fbc10
      Bob Liu 提交于
      Make st_* statistics per ring and the VBD sysfs would iterate over all the
      rings.
      
      Note: xenvbd_sysfs_delif() is called in xen_blkbk_remove() before all rings
      are torn down, so it's safe.
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      ---
      v2: Aligned the variables on the same column.
      db6fbc10
    • J
      xen/blkfront: Handle non-indirect grant with 64KB pages · 6cc56833
      Julien Grall 提交于
      The minimal size of request in the block framework is always PAGE_SIZE.
      It means that when 64KB guest is support, the request will at least be
      64KB.
      
      Although, if the backend doesn't support indirect descriptor (such as QDISK
      in QEMU), a ring request is only able to accommodate 11 segments of 4KB
      (i.e 44KB).
      
      The current frontend is assuming that an I/O request will always fit in
      a ring request. This is not true any more when using 64KB page
      granularity and will therefore crash during boot.
      
      On ARM64, the ABI is completely neutral to the page granularity used by
      the domU. The guest has the choice between different page granularity
      supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB).
      This can't be enforced by the hypervisor and therefore it's possible to
      run guests using different page granularity.
      
      So we can't mandate the block backend to support indirect descriptor
      when the frontend is using 64KB page granularity and have to fix it
      properly in the frontend.
      
      The solution exposed below is based on modifying directly the frontend
      guest rather than asking the block framework to support smaller size
      (i.e < PAGE_SIZE). This is because the change is the block framework are
      not trivial as everything seems to relying on a struct *page (see [1]).
      Although, it may be possible that someone succeed to do it in the future
      and we would therefore be able to use it.
      
      Given that a block request may not fit in a single ring request, a
      second request is introduced for the data that cannot fit in the first
      one. This means that the second ring request should never be used on
      Linux if the page size is smaller than 44KB.
      
      To achieve the support of the extra ring request, the block queue size
      is divided by two. Therefore, the ring will always contain enough space
      to accommodate 2 ring requests. While this will reduce the overall
      performance, it will make the implementation more contained. The way
      forward to get better performance is to implement in the backend either
      indirect descriptor or multiple grants ring.
      
      Note that the parameters blk_queue_max_* helpers haven't been updated.
      The block code will set the mimimum size supported and we may be able
      to support directly any change in the block framework that lower down
      the minimal size of a request.
      
      [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.htmlSigned-off-by: NJulien Grall <julien.grall@citrix.com>
      Acked-by: NRoger Pau Monné <roger.pau@citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      6cc56833
    • J
      xen-blkfront: Introduce blkif_ring_get_request · 2e073969
      Julien Grall 提交于
      The code to get a request is always the same. Therefore we can factorize
      it in a single function.
      Signed-off-by: NJulien Grall <julien.grall@citrix.com>
      Acked-by: NRoger Pau Monné <roger.pau@citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      2e073969
    • J
      xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule() · a6e7af12
      Jiri Kosina 提交于
      xen_blkif_schedule() kthread calls try_to_freeze() at the beginning of
      every attempt to purge the LRU. This operation can't ever succeed though,
      as the kthread hasn't marked itself as freezable.
      
      Before (hopefully eventually) kthread freezing gets converted to fileystem
      freezing, we'd rather mark xen_blkif_schedule() freezable (as it can
      generate I/O during suspend).
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      a6e7af12
    • K
      xen/blkback: Free resources if connect_ring failed. · 2d0382fa
      Konrad Rzeszutek Wilk 提交于
      With the multi-queue support we could fail at setting up
      some of the rings and fail the connection. That meant that
      all resources tied to rings[0..n-1] (where n is the ring
      that failed to be setup). Eventually the frontend will switch
      to the states and we will call xen_blkif_disconnect.
      
      However we do not want to be at the mercy of the frontend
      deciding when to change states. This allows us to do the
      cleanup right away and freeing resources.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      2d0382fa
    • K
      xen/blocks: Return -EXX instead of -1 · bde21f73
      Konrad Rzeszutek Wilk 提交于
      Lets return sensible values instead of -1.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      bde21f73
    • B
      xen/blkback: make pool of persistent grants and free pages per-queue · d4bf0065
      Bob Liu 提交于
      Make pool of persistent grants and free pages per-queue/ring instead of
      per-device to get better scalability.
      
      Test was done based on null_blk driver:
      dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
      domu: v4.2-rc8 16vcpus 10GB
      
      [test]
      rw=read
      direct=1
      ioengine=libaio
      bs=4k
      time_based
      runtime=30
      filename=/dev/xvdb
      numjobs=16
      iodepth=64
      iodepth_batch=64
      iodepth_batch_complete=64
      group_reporting
      
      Results:
      iops1: After patch "xen/blkfront: make persistent grants per-queue".
      iops2: After this patch.
      
      Queues:			  1 	   4 	  	  8 	 	 16
      Iops orig(k):		810 	1064 		780 		700
      Iops1(k):		810     1230(~20%)	1024(~20%)	850(~20%)
      Iops2(k):		810     1410(~35%)	1354(~75%)      1440(~100%)
      
      With 4 queues after this commit we can get ~75% increase in IOPS, and
      performance won't drop if increasing queue numbers.
      
      Please find the respective chart in this link:
      https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      d4bf0065