1. 04 6月, 2014 9 次提交
  2. 27 5月, 2014 2 次提交
  3. 21 5月, 2014 1 次提交
    • M
      dm thin: add 'no_space_timeout' dm-thin-pool module param · 80c57893
      Mike Snitzer 提交于
      Commit 85ad643b ("dm thin: add timeout to stop out-of-data-space mode
      holding IO forever") introduced a fixed 60 second timeout.  Users may
      want to either disable or modify this timeout.
      
      Allow the out-of-data-space timeout to be configured using the
      'no_space_timeout' dm-thin-pool module param.  Setting it to 0 will
      disable the timeout, resulting in IO being queued until more data space
      is added to the thin-pool.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.14+
      80c57893
  4. 15 5月, 2014 4 次提交
    • M
      dm mpath: fix lock order inconsistency in multipath_ioctl · 4cdd2ad7
      Mike Snitzer 提交于
      Commit 3e9f1be1 ("dm mpath: remove process_queued_ios()") did not
      consistently take the multipath device's spinlock (m->lock) before
      calling dm_table_run_md_queue_async() -- which takes the q->queue_lock.
      
      Found with code inspection using hint from reported lockdep warning.
      Reported-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      4cdd2ad7
    • J
      dm thin: add timeout to stop out-of-data-space mode holding IO forever · 85ad643b
      Joe Thornber 提交于
      If the pool runs out of data space, dm-thin can be configured to
      either error IOs that would trigger provisioning, or hold those IOs
      until the pool is resized.  Unfortunately, holding IOs until the pool is
      resized can result in a cascade of tasks hitting the hung_task_timeout,
      which may render the system unavailable.
      
      Add a fixed timeout so IOs can only be held for a maximum of 60 seconds.
      If LVM is going to resize a thin-pool that is out of data space it needs
      to be prompt about it.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.14+
      85ad643b
    • J
      dm thin: allow metadata commit if pool is in PM_OUT_OF_DATA_SPACE mode · 8d07e8a5
      Joe Thornber 提交于
      Commit 3e1a0699 ("dm thin: fix out of data space handling") introduced
      a regression in the metadata commit() method by returning an error if
      the pool is in PM_OUT_OF_DATA_SPACE mode.  This oversight caused a thin
      device to return errors even if the default queue_if_no_space ENOSPC
      handling mode is used.
      
      Fix commit() to only fail if pool is in PM_READ_ONLY or PM_FAIL mode.
      
      Reported-by: qindehua@163.com
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.14+
      8d07e8a5
    • M
      dm crypt: fix cpu hotplug crash by removing per-cpu structure · 610f2de3
      Mikulas Patocka 提交于
      The DM crypt target used per-cpu structures to hold pointers to a
      ablkcipher_request structure.  The code assumed that the work item keeps
      executing on a single CPU, so it didn't use synchronization when
      accessing this structure.
      
      If a CPU is disabled by writing 0 to /sys/devices/system/cpu/cpu*/online,
      the work item could be moved to another CPU.  This causes dm-crypt
      crashes, like the following, because the code starts using an incorrect
      ablkcipher_request:
      
       smpboot: CPU 7 is now offline
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000130
       IP: [<ffffffffa1862b3d>] crypt_convert+0x12d/0x3c0 [dm_crypt]
       ...
       Call Trace:
        [<ffffffffa1864415>] ? kcryptd_crypt+0x305/0x470 [dm_crypt]
        [<ffffffff81062060>] ? finish_task_switch+0x40/0xc0
        [<ffffffff81052a28>] ? process_one_work+0x168/0x470
        [<ffffffff8105366b>] ? worker_thread+0x10b/0x390
        [<ffffffff81053560>] ? manage_workers.isra.26+0x290/0x290
        [<ffffffff81058d9f>] ? kthread+0xaf/0xc0
        [<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120
        [<ffffffff813464ac>] ? ret_from_fork+0x7c/0xb0
        [<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120
      
      Fix this bug by removing the per-cpu definition.  The structure
      ablkcipher_request is accessed via a pointer from convert_context.
      Consequently, if the work item is rescheduled to a different CPU, the
      thread still uses the same ablkcipher_request.
      
      This change may undermine performance improvements intended by commit
      c0297721 ("dm crypt: scale to multiple cpus") on select hardware.  In
      practice no performance difference was observed on recent hardware.  But
      regardless, correctness is more important than performance.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      610f2de3
  5. 06 5月, 2014 2 次提交
    • N
      md: avoid possible spinning md thread at shutdown. · 0f62fb22
      NeilBrown 提交于
      If an md array with externally managed metadata (e.g. DDF or IMSM)
      is in use, then we should not set safemode==2 at shutdown because:
      
      1/ this is ineffective: user-space need to be involved in any 'safemode' handling,
      2/ The safemode management code doesn't cope with safemode==2 on external metadata
         and md_check_recover enters an infinite loop.
      
      Even at shutdown, an infinite-looping process can be problematic, so this
      could cause shutdown to hang.
      
      Cc: stable@vger.kernel.org (any kernel)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0f62fb22
    • N
      md/raid10: call wait_barrier() for each request submitted. · cc13b1d1
      NeilBrown 提交于
      wait_barrier() includes a counter, so we must call it precisely once
      (unless balanced by allow_barrier()) for each request submitted.
      
      Since
      commit 20d0189b
          block: Introduce new bio_split()
      in 3.14-rc1, we don't call it for the extra requests generated when
      we need to split a bio.
      
      When this happens the counter goes negative, any resync/recovery will
      never start, and  "mdadm --stop" will hang.
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Fixes: 20d0189b
      Cc: stable@vger.kernel.org (3.14+)
      Cc: Kent Overstreet <kmo@daterainc.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cc13b1d1
  6. 02 5月, 2014 1 次提交
  7. 29 4月, 2014 1 次提交
  8. 17 4月, 2014 1 次提交
  9. 16 4月, 2014 1 次提交
  10. 09 4月, 2014 5 次提交
    • S
      raid5: get_active_stripe avoids device_lock · e240c183
      Shaohua Li 提交于
      For sequential workload (or request size big workload), get_active_stripe can
      find cached stripe. In this case, we always hold device_lock, which exposes a
      lot of lock contention for such workload. If stripe count isn't 0, we don't
      need hold the lock actually, since we just increase its count. And this is the
      hot code path for such workload. Unfortunately we must delete the BUG_ON.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e240c183
    • S
      raid5: make_request does less prepare wait · 27c0f68f
      Shaohua Li 提交于
      In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a
      lot of contention for sequential workload (or big request size
      workload). For such workload, each bio includes several stripes. So we
      can just do prepare_to_wait/finish_wait once for the whold bio instead
      of every stripe.  This reduces the lock contention completely for such
      workload. Random workload might have the similar lock contention too,
      but I didn't see it yet, maybe because my stroage is still not fast
      enough.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      27c0f68f
    • N
      md: avoid oops on unload if some process is in poll or select. · e2f23b60
      NeilBrown 提交于
      If md-mod is unloaded while some process is in poll() or select(),
      then that process maintains a pointer to md_event_waiters, and when
      the try to unlink from that list, they will oops.
      
      The procfs infrastructure ensures that ->poll won't be called after
      remove_proc_entry, but doesn't provide a wait_queue_head for us to
      use, and the waitqueue code doesn't provide a way to remove all
      listeners from a waitqueue.
      
      So we need to:
       1/ make sure no further references to md_event_waiters are taken (by
          setting md_unloading)
       2/ wake up all processes currently waiting, and
       3/ wait until all those processes have disconnected from our
          wait_queue_head.
      Reported-by: N"majianpeng" <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e2f23b60
    • N
      md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation fails. · da1aab3d
      NeilBrown 提交于
      When performing a user-request check/repair (MD_RECOVERY_REQUEST is set)
      on a raid1, we allocate multiple bios each with their own set of pages.
      
      If the page allocations for one bio fails, we currently do *not* free
      the pages allocated for the previous bios, nor do we free the bio itself.
      
      This patch frees all the already-allocate pages, and makes sure that
      all the bios are freed as well.
      
      This bug can cause a memory leak which can ultimately OOM a machine.
      It was introduced in 3.10-rc1.
      
      Fixes: a0787606
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: stable@vger.kernel.org (3.10+)
      Reported-by: NRussell King - ARM Linux <linux@arm.linux.org.uk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      da1aab3d
    • N
      md/bitmap: don't abuse i_writecount for bitmap files. · 035328c2
      NeilBrown 提交于
      md bitmap code currently tries to use i_writecount to stop any other
      process from writing to out bitmap file.  But that is really an abuse
      and has bit-rotted so locking is all wrong.
      
      So discard that - root should be allowed to shoot self in foot.
      
      Still use it in a much less intrusive way to stop the same file being
      used as bitmap on two different array, and apply other checks to
      ensure the file is at least vaguely usable for bitmap storage
      (is regular, is open for write.  Support for ->bmap is already checked
      elsewhere).
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      035328c2
  11. 08 4月, 2014 2 次提交
    • J
      dm thin: fix rcu_read_lock being held in code that can sleep · b10ebd34
      Joe Thornber 提交于
      Commit c140e1c4 ("dm thin: use per thin device deferred bio lists")
      introduced the use of an rculist for all active thin devices.  The use
      of rcu_read_lock() in process_deferred_bios() can result in a BUG if a
      dm_bio_prison_cell must be allocated as a side-effect of bio_detain():
      
       BUG: sleeping function called from invalid context at mm/mempool.c:203
       in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u8:0
       3 locks held by kworker/u8:0/6:
         #0:  ("dm-" "thin"){.+.+..}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550
         #1:  ((&pool->worker)){+.+...}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550
         #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff816360b5>] do_worker+0x5/0x4d0
      
      We can't process deferred bios with the rcu lock held, since
      dm_bio_prison_cell allocation may block if the bio-prison's cell mempool
      is exhausted.
      
      To fix:
      
      - Introduce a refcount and completion field to each thin_c
      
      - Add thin_get/put methods for adjusting the refcount.  If the refcount
        hits zero then the completion is triggered.
      
      - Initialise refcount to 1 when creating thin_c
      
      - When iterating the active_thins list we thin_get() whilst the rcu
        lock is held.
      
      - After the rcu lock is dropped we process the deferred bios for that
        thin.
      
      - When destroying a thin_c we thin_put() and then wait for the
        completion -- to avoid a race between the worker thread iterating
        from that thin_c and destroying the thin_c.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b10ebd34
    • J
      dm thin: irqsave must always be used with the pool->lock spinlock · 5e3283e2
      Joe Thornber 提交于
      Commit c140e1c4 ("dm thin: use per thin device deferred bio lists")
      incorrectly stopped disabling irqs when taking the pool's spinlock.
      
      Irqs must be disabled when taking the pool's spinlock otherwise a thread
      could spin_lock(), then get interrupted to service thin_endio() in
      interrupt context, which would then deadlock in spin_lock_irqsave().
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      5e3283e2
  12. 05 4月, 2014 2 次提交
    • J
      dm cache: fix a lock-inversion · 0596661f
      Joe Thornber 提交于
      When suspending a cache the policy is walked and the individual policy
      hints written to the metadata via sync_metadata().  This led to this
      lock order:
      
            policy->lock
              cache_metadata->root_lock
      
      When loading the cache target the policy is populated while the metadata
      lock is held:
      
            cache_metadata->root_lock
               policy->lock
      
      Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by
      ensuring the cache_metadata root_lock is held whilst all the hints are
      written, rather than being repeatedly locked while policy->lock is held
      (as was the case with each callout that policy_walk_mappings() made to
      the old save_hint() method).
      
      Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove
      locking correctness") build option.  However, it is not clear how the
      LOCKDEP reported paths can lead to a deadlock since the two paths,
      suspending a target and loading a target, never occur at the same time.
      But that doesn't mean the same lock-inversion couldn't have occurred
      elsewhere.
      Reported-by: NMarian Csontos <mcsontos@redhat.com>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      0596661f
    • M
      dm thin: sort the per thin deferred bios using an rb_tree · 67324ea1
      Mike Snitzer 提交于
      A thin-pool will allocate blocks using FIFO order for all thin devices
      which share the thin-pool.  Because of this simplistic allocation the
      thin-pool's space can become fragmented quite easily; especially when
      multiple threads are requesting blocks in parallel.
      
      Sort each thin device's deferred_bio_list based on logical sector to
      help reduce fragmentation of the thin-pool's ondisk layout.
      
      The following tables illustrate the realized gains/potential offered by
      sorting each thin device's deferred_bio_list.  An "io size"-sized random
      read of the device would result in "seeks/io" fragments being read, with
      an average "distance/seek" between each fragment.
      
      Data was written to a single thin device using multiple threads via
      iozone (8 threads, 64K for both the block_size and io_size).
      
      unsorted:
      
           io size   seeks/io distance/seek
        --------------------------------------
                4k    0.000   0b
               16k    0.013   11m
               64k    0.065   11m
              256k    0.274   10m
                1m    1.109   10m
                4m    4.411   10m
               16m    17.097  11m
               64m    60.055  13m
              256m    148.798 25m
                1g    809.929 21m
      
      sorted:
      
           io size   seeks/io distance/seek
        --------------------------------------
                4k    0.000   0b
               16k    0.000   1g
               64k    0.001   1g
              256k    0.003   1g
                1m    0.011   1g
                4m    0.045   1g
               16m    0.181   1g
               64m    0.747   1011m
              256m    3.299   1g
                1g    14.373  1g
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      67324ea1
  13. 01 4月, 2014 1 次提交
    • M
      dm thin: use per thin device deferred bio lists · c140e1c4
      Mike Snitzer 提交于
      The thin-pool previously only had a single deferred_bios list that would
      collect bios for all thin devices in the pool.  Split this per-pool
      deferred_bios list out to per-thin deferred_bios_list -- doing so
      enables increased parallelism when processing deferred bios.  And now
      that each thin device has it's own deferred_bios_list we can sort all
      bios in the list using logical sector.  The requeue code in error
      handling path is also cleaner as a side-effect.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      c140e1c4
  14. 31 3月, 2014 1 次提交
  15. 29 3月, 2014 1 次提交
  16. 28 3月, 2014 6 次提交
新手
引导
客服 返回
顶部