1. 23 12月, 2015 1 次提交
  2. 18 12月, 2015 2 次提交
  3. 09 12月, 2015 1 次提交
  4. 08 12月, 2015 1 次提交
  5. 04 12月, 2015 1 次提交
  6. 02 12月, 2015 3 次提交
    • A
      null_blk: change type of completion_nsec to unsigned long · dbac1175
      Arianna Avanzini 提交于
      This commit at least doubles the maximum value for
      completion_nsec. This helps in special cases where one wants/needs to
      emulate an extremely slow I/O (for example to spot bugs).
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: NArianna Avanzini <avanzini@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dbac1175
    • A
      null_blk: guarantee device restart in all irq modes · cf8ecc5a
      Arianna Avanzini 提交于
      In single-queue (block layer) mode,the function null_rq_prep_fn stops
      the device if alloc_cmd fails. Then, once stopped, the device must be
      restarted on the next command completion, so that the request(s) for
      which alloc_cmd failed can be requeued. Otherwise the device hangs.
      
      Unfortunately, device restart is currently performed only for delayed
      completions, i.e., in irqmode==2. This fact causes hangs, for the
      above reasons, with the other irqmodes in combination with single-queue
      block layer.
      
      This commits addresses this issue by making sure that, if stopped, the
      device is properly restarted for all irqmodes on completions.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: NArianna AVanzini <avanzini@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cf8ecc5a
    • P
      null_blk: set a separate timer for each command · 3c395a96
      Paolo Valente 提交于
      For the Timer IRQ mode (i.e., when command completions are delayed),
      there is one timer for each CPU. Each of these timers
      . has a completion queue associated with it, containing all the
        command completions to be executed when the timer fires;
      . is set, and a new completion-to-execute is inserted into its
        completion queue, every time the dispatch code for a new command
        happens to be executed on the CPU related to the timer.
      
      This implies that, if the dispatch of a new command happens to be
      executed on a CPU whose timer has already been set, but has not yet
      fired, then the timer is set again, to the completion time of the
      newly arrived command. When the timer eventually fires, all its queued
      completions are executed.
      
      This way of handling delayed command completions entails the following
      problem: if more than one command completion is inserted into the
      queue of a timer before the timer fires, then the expiration time for
      the timer is moved forward every time each of these completions is
      enqueued. As a consequence, only the last completion enqueued enjoys a
      correct execution time, while all previous completions are unjustly
      delayed until the last completion is executed (and at that time they
      are executed all together).
      
      Specifically, if all the above completions are enqueued almost at the
      same time, then the problem is negligible. On the opposite end, if
      every completion is enqueued a while after the previous completion was
      enqueued (in the extreme case, it is enqueued only right before the
      timer would have expired), then every enqueued completion, except for
      the last one, experiences an inflated delay, proportional to the number
      of completions enqueued after it. In the end, commands, and thus I/O
      requests, may be completed at an arbitrarily lower rate than the
      desired one.
      
      This commit addresses this issue by replacing per-CPU timers with
      per-command timers, i.e., by associating an individual timer with each
      command.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: NArianna Avanzini <avanzini@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3c395a96
  7. 20 11月, 2015 4 次提交
  8. 17 11月, 2015 1 次提交
    • M
      null_blk: register as a LightNVM device · b2b7e001
      Matias Bjørling 提交于
      Add support for registering as a LightNVM device. This allows us to
      evaluate the performance of the LightNVM subsystem.
      
      In /drivers/Makefile, LightNVM is moved above block device drivers
      to make sure that the LightNVM media managers have been initialized
      before drivers under /drivers/block are initialized.
      Signed-off-by: NMatias Bjørling <m@bjorling.me>
      Fix by Jens Axboe to remove unneeded slab cache and the following
      memory leak.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b2b7e001
  9. 12 11月, 2015 1 次提交
  10. 08 11月, 2015 1 次提交
  11. 07 11月, 2015 6 次提交
    • O
      signal: turn dequeue_signal_lock() into kernel_dequeue_signal() · be0e6f29
      Oleg Nesterov 提交于
      1. Rename dequeue_signal_lock() to kernel_dequeue_signal(). This
         matches another "for kthreads only" kernel_sigaction() helper.
      
      2. Remove the "tsk" and "mask" arguments, they are always current
         and current->blocked. And it is simply wrong if tsk != current.
      
      3. We could also remove the 3rd "siginfo_t *info" arg but it looks
         potentially useful. However we can simplify the callers if we
         change kernel_dequeue_signal() to accept info => NULL.
      
      4. Remove _irqsave, it is never called from atomic context.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Markus Pargmann <mpa@pengutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be0e6f29
    • G
      zram: make is_partial_io/valid_io_request/page_zero_filled return boolean · 1c53e0d2
      Geliang Tang 提交于
      Make is_partial_io()/valid_io_request()/page_zero_filled() return boolean,
      since each function only uses either one or zero as its return value.
      Signed-off-by: NGeliang Tang <geliangtang@163.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c53e0d2
    • S
      zram: keep the exact overcommited value in mem_used_max · 12372755
      Sergey SENOZHATSKY 提交于
      `mem_used_max' is designed to store the max amount of memory zram consumed
      to store the data.  However, it does not represent the actual
      'overcommited' (max) value.  The existing code goes to -ENOMEM
      overcommited case before it updates `->stats.max_used_pages', which hides
      the reason we went to -ENOMEM in the first place -- we actually used more
      memory than `->limit_pages':
      
              alloced_pages = zs_get_total_pages(meta->mem_pool);
              if (zram->limit_pages && alloced_pages > zram->limit_pages) {
                      zs_free(meta->mem_pool, handle);
                      ret = -ENOMEM;
                      goto out;
              }
      
              update_used_max(zram, alloced_pages);
      
      Which is misleading.  User will see -ENOMEM, check `->limit_pages', check
      `->stats.max_used_pages', which will keep the value BEFORE zram passed
      `->limit_pages', and see:
      	`->stats.max_used_pages' < `->limit_pages'
      
      Move update_used_max() before we do `->limit_pages' check, so that
      user will see:
      	`->stats.max_used_pages' > `->limit_pages'
      should the overcommit and -ENOMEM happen.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12372755
    • L
      zram: introduce comp algorithm fallback functionality · 1d5b43bf
      Luis Henriques 提交于
      When the user supplies an unsupported compression algorithm, keep the
      previously selected one (knowingly supported) or the default one (if the
      compression algorithm hasn't been changed yet).
      
      Note that previously this operation (i.e. setting an invalid algorithm)
      would result in no algorithm being selected, which means that this
      represents a small change in the default behaviour.
      
      Minchan said:
      
      For initializing zram, we need to set up 3 optional parameters in advance.
      
      1. the number of compression streams
      2. memory limitation
      3. compression algorithm
      
      Although user pass completely wrong value to set up for 1 and 2
      parameters, it's okay because they have default value so zram will be
      initialized with the default value (of course, when user passes a wrong
      value via *echo*, sysfs returns -EINVAL so the user can notice it).
      
      But 3 is not consistent with other optional parameters.  IOW, if the
      user passes a wrong value to set up 3 parameter, zram's initialization
      would fail unlike other optional parameters.
      
      So this patch makes them consistent.
      Signed-off-by: NLuis Henriques <luis.henriques@canonical.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d5b43bf
    • M
      mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM · 71baba4b
      Mel Gorman 提交于
      __GFP_WAIT was used to signal that the caller was in atomic context and
      could not sleep.  Now it is possible to distinguish between true atomic
      context and callers that are not willing to sleep.  The latter should
      clear __GFP_DIRECT_RECLAIM so kswapd will still wake.  As clearing
      __GFP_WAIT behaves differently, there is a risk that people will clear the
      wrong flags.  This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
      indicate what it does -- setting it allows all reclaim activity, clearing
      them prevents it.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71baba4b
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  12. 03 11月, 2015 5 次提交
  13. 31 10月, 2015 1 次提交
    • R
      rbd: require stable pages if message data CRCs are enabled · bae818ee
      Ronny Hegewald 提交于
      rbd requires stable pages, as it performs a crc of the page data before
      they are send to the OSDs.
      
      But since kernel 3.9 (patch 1d1d1a76
      "mm: only enforce stable page writes if the backing device requires
      it") it is not assumed anymore that block devices require stable pages.
      
      This patch sets the necessary flag to get stable pages back for rbd.
      
      In a ceph installation that provides multiple ext4 formatted rbd
      devices "bad crc" messages appeared regularly (ca 1 message every 1-2
      minutes on every OSD that provided the data for the rbd) in the
      OSD-logs before this patch. After this patch this messages are pretty
      much gone (only ca 1-2 / month / OSD).
      
      Cc: stable@vger.kernel.org # 3.9+, needs backporting
      Signed-off-by: NRonny Hegewald <Ronny.Hegewald@online.de>
      [idryomov@gmail.com: require stable pages only in crc case, changelog]
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      bae818ee
  14. 24 10月, 2015 2 次提交
    • I
      rbd: prevent kernel stack blow up on rbd map · 6d69bb53
      Ilya Dryomov 提交于
      Mapping an image with a long parent chain (e.g. image foo, whose parent
      is bar, whose parent is baz, etc) currently leads to a kernel stack
      overflow, due to the following recursion in the reply path:
      
        rbd_osd_req_callback()
          rbd_obj_request_complete()
            rbd_img_obj_callback()
              rbd_img_parent_read_callback()
                rbd_obj_request_complete()
                  ...
      
      Limit the parent chain to 16 images, which is ~5K worth of stack.  When
      the above recursion is eliminated, this limit can be lifted.
      
      Fixes: http://tracker.ceph.com/issues/12538
      
      Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NJosh Durgin <jdurgin@redhat.com>
      6d69bb53
    • I
      rbd: don't leak parent_spec in rbd_dev_probe_parent() · 1f2c6651
      Ilya Dryomov 提交于
      Currently we leak parent_spec and trigger a "parent reference
      underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
      The problem is we take the !parent out_err branch and that only drops
      refcounts; parent_spec that would've been freed had we called
      rbd_dev_unparent() remains and triggers rbd_warn() in
      rbd_dev_parent_put() - at that point we have parent_spec != NULL and
      parent_ref == 0, so counter ends up being -1 after the decrement.
      
      Redo rbd_dev_probe_parent() to fix this.
      
      Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      1f2c6651
  15. 23 10月, 2015 6 次提交
  16. 16 10月, 2015 3 次提交
    • I
      rbd: use writefull op for object size writes · e30b7577
      Ilya Dryomov 提交于
      This covers only the simplest case - an object size sized write, but
      it's still useful in tiering setups when EC is used for the base tier
      as writefull op can be proxied, saving an object promotion.
      
      Even though updating ceph_osdc_new_request() to allow writefull should
      just be a matter of fixing an assert, I didn't do it because its only
      user is cephfs.  All other sites were updated.
      
      Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      e30b7577
    • I
      rbd: set max_sectors explicitly · 0d9fde4f
      Ilya Dryomov 提交于
      Commit 30e2bc08 ("Revert "block: remove artifical max_hw_sectors
      cap"") restored a clamp on max_sectors.  It's now 2560 sectors instead
      of 1024, but it's not good enough: we set max_hw_sectors to rbd object
      size because we don't want object sized I/Os to be split, and the
      default object size is 4M.
      
      So, set max_sectors to max_hw_sectors in rbd at queue init time.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      0d9fde4f
    • K
      NVMe: Fix memory leak on retried commands · 0dfc70c3
      Keith Busch 提交于
      Resources are reallocated for requeued commands, so unmap and release
      the iod for the failed command.
      
      It's a pretty bad memory leak and causes a kernel hang if you remove a
      drive because of a busy dma pool. You'll get messages spewing like this:
      
        nvme 0000:xx:xx.x: dma_pool_destroy prp list 256, ffff880420dec000 busy
      
      and lock up pci and the driver since removal never completes while
      holding a lock.
      
      Cc: stable@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 4.0.x-
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0dfc70c3
  17. 15 10月, 2015 1 次提交