1. 09 4月, 2014 3 次提交
    • N
      md: avoid oops on unload if some process is in poll or select. · e2f23b60
      NeilBrown 提交于
      If md-mod is unloaded while some process is in poll() or select(),
      then that process maintains a pointer to md_event_waiters, and when
      the try to unlink from that list, they will oops.
      
      The procfs infrastructure ensures that ->poll won't be called after
      remove_proc_entry, but doesn't provide a wait_queue_head for us to
      use, and the waitqueue code doesn't provide a way to remove all
      listeners from a waitqueue.
      
      So we need to:
       1/ make sure no further references to md_event_waiters are taken (by
          setting md_unloading)
       2/ wake up all processes currently waiting, and
       3/ wait until all those processes have disconnected from our
          wait_queue_head.
      Reported-by: N"majianpeng" <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e2f23b60
    • N
      md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation fails. · da1aab3d
      NeilBrown 提交于
      When performing a user-request check/repair (MD_RECOVERY_REQUEST is set)
      on a raid1, we allocate multiple bios each with their own set of pages.
      
      If the page allocations for one bio fails, we currently do *not* free
      the pages allocated for the previous bios, nor do we free the bio itself.
      
      This patch frees all the already-allocate pages, and makes sure that
      all the bios are freed as well.
      
      This bug can cause a memory leak which can ultimately OOM a machine.
      It was introduced in 3.10-rc1.
      
      Fixes: a0787606
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: stable@vger.kernel.org (3.10+)
      Reported-by: NRussell King - ARM Linux <linux@arm.linux.org.uk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      da1aab3d
    • N
      md/bitmap: don't abuse i_writecount for bitmap files. · 035328c2
      NeilBrown 提交于
      md bitmap code currently tries to use i_writecount to stop any other
      process from writing to out bitmap file.  But that is really an abuse
      and has bit-rotted so locking is all wrong.
      
      So discard that - root should be allowed to shoot self in foot.
      
      Still use it in a much less intrusive way to stop the same file being
      used as bitmap on two different array, and apply other checks to
      ensure the file is at least vaguely usable for bitmap storage
      (is regular, is open for write.  Support for ->bmap is already checked
      elsewhere).
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      035328c2
  2. 13 3月, 2014 2 次提交
    • H
      dm cache: fix access beyond end of origin device · e893fba9
      Heinz Mauelshagen 提交于
      In order to avoid wasting cache space a partial block at the end of the
      origin device is not cached.  Unfortunately, the check for such a
      partial block at the end of the origin device was flawed.
      
      Fix accesses beyond the end of the origin device that occured due to
      attempted promotion of an undetected partial block by:
      
      - initializing the per bio data struct to allow cache_end_io to work properly
      - recognizing access to the partial block at the end of the origin device
      - avoiding out of bounds access to the discard bitset
      
      Otherwise, users can experience errors like the following:
      
       attempt to access beyond end of device
       dm-5: rw=0, want=20971520, limit=20971456
       ...
       device-mapper: cache: promotion failed; couldn't copy block
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      e893fba9
    • H
      dm cache: fix truncation bug when copying a block to/from >2TB fast device · 8b9d9666
      Heinz Mauelshagen 提交于
      During demotion or promotion to a cache's >2TB fast device we must not
      truncate the cache block's associated sector to 32bits.  The 32bit
      temporary result of from_cblock() caused a 32bit multiplication when
      calculating the sector of the fast device in issue_copy_real().
      
      Use an intermediate 64bit type to store the 32bit from_cblock() to allow
      for proper 64bit multiplication.
      
      Here is an example of how this bug manifests on an ext4 filesystem:
      
       EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt.
       JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      8b9d9666
  3. 08 3月, 2014 1 次提交
    • J
      dm space map metadata: fix refcount decrement below 0 which caused corruption · cebc2de4
      Joe Thornber 提交于
      This has been a relatively long-standing issue that wasn't nailed down
      until Teng-Feng Yang's meticulous bug report to dm-devel on 3/7/2014,
      see: http://www.redhat.com/archives/dm-devel/2014-March/msg00021.html
      
      From that report:
        "When decreasing the reference count of a metadata block with its
        reference count equals 3, we will call dm_btree_remove() to remove
        this enrty from the B+tree which keeps the reference count info in
        metadata device.
      
        The B+tree will try to rebalance the entry of the child nodes in each
        node it traversed, and the rebalance process contains the following
        steps.
      
        (1) Finding the corresponding children in current node (shadow_current(s))
        (2) Shadow the children block (issue BOP_INC)
        (3) redistribute keys among children, and free children if necessary (issue BOP_DEC)
      
        Since the update of a metadata block's reference count could be
        recursive, we will stash these reference count update operations in
        smm->uncommitted and then process them in a FILO fashion.
      
        The problem is that step(3) could free the children which is created
        in step(2), so the BOP_DEC issued in step(3) will be carried out
        before the BOP_INC issued in step(2) since these BOPs will be
        processed in FILO fashion. Once the BOP_DEC from step(3) tries to
        decrease the reference count of newly shadow block, it will report
        failure for its reference equals 0 before decreasing. It looks like we
        can solve this issue by processing these BOPs in a FIFO fashion
        instead of FILO."
      
      Commit 5b564d80 ("dm space map: disallow decrementing a reference count
      below zero") changed the code to report an error for this temporary
      refcount decrement below zero.  So what was previously a harmless
      invalid refcount became a hard failure due to the new error path:
      
       device-mapper: space map common: unable to decrement a reference count below 0
       device-mapper: thin: 253:6: dm_thin_insert_block() failed: error = -22
       device-mapper: thin: 253:6: switching pool to read-only mode
      
      This bug is in dm persistent-data code that is common to the DM thin and
      cache targets.  So any users of those targets should apply this fix.
      
      Fix this by applying recursive space map operations in FIFO order rather
      than FILO.
      
      Resolves: https://bugzilla.kernel.org/show_bug.cgi?id=68801Reported-by: NApollon Oikonomopoulos <apoikos@debian.org>
      Reported-by: edwillam1007@gmail.com
      Reported-by: NTeng-Feng Yang <shinrairis@gmail.com>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.13+
      cebc2de4
  4. 06 3月, 2014 4 次提交
    • J
      dm thin: fix noflush suspend IO queueing · 738211f7
      Joe Thornber 提交于
      i) by the time DM core calls the postsuspend hook the dm_noflush flag
      has been cleared.  So the old thin_postsuspend did nothing.  We need to
      use the presuspend hook instead.
      
      ii) There was a race between bios leaving DM core and arriving in the
      deferred queue.
      
      thin_presuspend now sets a 'requeue' flag causing all bios destined for
      that thin to be requeued back to DM core.  Then it requeues all held IO,
      and all IO on the deferred queue (destined for that thin).  Finally
      postsuspend clears the 'requeue' flag.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      738211f7
    • J
      dm thin: fix deadlock in __requeue_bio_list · 18adc577
      Joe Thornber 提交于
      The spin lock in requeue_io() was held for too long, allowing deadlock.
      Don't worry, due to other issues addressed in the following "dm thin:
      fix noflush suspend IO queueing" commit, this code was never called.
      
      Fix this by taking the spin lock for a much shorter period of time.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      18adc577
    • J
      dm thin: fix out of data space handling · 3e1a0699
      Joe Thornber 提交于
      Ideally a thin pool would never run out of data space; the low water
      mark would trigger userland to extend the pool before we completely run
      out of space.  However, many small random IOs to unprovisioned space can
      consume data space at an alarming rate.  Adjust your low water mark if
      you're frequently seeing "out-of-data-space" mode.
      
      Before this fix, if data space ran out the pool would be put in
      PM_READ_ONLY mode which also aborted the pool's current metadata
      transaction (data loss for any changes in the transaction).  This had a
      side-effect of needlessly compromising data consistency.  And retry of
      queued unserviceable bios, once the data pool was resized, could
      initiate changes to potentially inconsistent pool metadata.
      
      Now when the pool's data space is exhausted transition to a new pool
      mode (PM_OUT_OF_DATA_SPACE) that allows metadata to be changed but data
      may not be allocated.  This allows users to remove thin volumes or
      discard data to recover data space.
      
      The pool is no longer put in PM_READ_ONLY mode in response to the pool
      running out of data space.  And PM_READ_ONLY mode no longer aborts the
      pool's current metadata transaction.  Also, set_pool_mode() will now
      notify userspace when the pool mode is changed.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      3e1a0699
    • M
      dm thin: ensure user takes action to validate data and metadata consistency · 07f2b6e0
      Mike Snitzer 提交于
      If a thin metadata operation fails the current transaction will abort,
      whereby causing potential for IO layers up the stack (e.g. filesystems)
      to have data loss.  As such, set THIN_METADATA_NEEDS_CHECK_FLAG in the
      thin metadata's superblock which:
      1) requires the user verify the thin metadata is consistent (e.g. use
         thin_check, etc)
      2) suggests the user verify the thin data is consistent (e.g. use fsck)
      
      The only way to clear the superblock's THIN_METADATA_NEEDS_CHECK_FLAG is
      to run thin_repair.
      
      On metadata operation failure: abort current metadata transaction, set
      pool in read-only mode, and now set the needs_check flag.
      
      As part of this change, constraints are introduced or relaxed:
      * don't allow a pool to transition to write mode if needs_check is set
      * don't allow data or metadata space to be resized if needs_check is set
      * if a thin pool's metadata space is exhausted: the kernel will now
        force the user to take the pool offline for repair before the kernel
        will allow the metadata space to be extended.
      
      Also, update Documentation to include information about when the thin
      provisioning target commits metadata, how it handles metadata failures
      and running out of space.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      07f2b6e0
  5. 05 3月, 2014 1 次提交
    • M
      dm thin: synchronize the pool mode during suspend · cdc2b415
      Mike Snitzer 提交于
      Commit b5330655 ("dm thin: handle metadata failures more consistently")
      increased potential for the pool's mode to be changed in response to
      metadata operation failures.
      
      When the pool mode is changed it isn't synchronized with the mode in
      pool_features stored in the target's context (ti->private) that is used
      as the basis for (re)establishing the pool mode during resume via
      bind_control_target.
      
      It is important that we synchronize the pool mode when it is changed
      otherwise the pool may experience and unexpected mode transition on the
      next resume (especially if there was no new table load).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      cdc2b415
  6. 04 3月, 2014 2 次提交
    • M
      dm snapshot: fix metadata corruption · 2c945820
      Mikulas Patocka 提交于
      Commit 55494bf2 ("dm snapshot: use dm-bufio") broke snapshots.
      Before that 3.14-rc1 commit, loading a snapshot's list of exceptions
      involved reading exception areas one by one into ps->area and inserting
      those exceptions into the hash table.  Commit 55494bf2 changed
      it so that dm-bufio with prefetch is used to load exceptions in batchs.
      Exceptions are loaded correctly, but ps->area is left uninitialized.
      When a new exception is allocated, it is stored in this uninitialized
      ps->area which will be written to the disk.  This causes metadata
      corruption.
      
      Fix this corruption by copying the last area that was read via dm-bufio
      into ps->area.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2c945820
    • M
      dm: fix Kconfig indentation · c64d240d
      Mike Snitzer 提交于
      Since DM_DEBUG_BLOCK_STACK_TRACING is a DM_PERSISTENT_DATA config option
      move it from drivers/md/Kconfig to drivers/md/persistent-data/Kconfig.
      
      Doing so fixes indentation for other DM config options.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c64d240d
  7. 01 3月, 2014 1 次提交
    • H
      dm cache mq: fix memory allocation failure for large cache devices · 14f398ca
      Heinz Mauelshagen 提交于
      The memory allocated for the multiqueue policy's hash table doesn't need
      to be physically contiguous.  Use vzalloc() instead of kzalloc().
      Fedora has been carrying this fix since 10/10/2013.
      
      Failure seen during creation of a 10TB cached device with a 2048 sector
      block size and 411GB cache size:
      
       dmsetup: page allocation failure: order:9, mode:0x10c0d0
       CPU: 11 PID: 29235 Comm: dmsetup Not tainted 3.10.4 #3
       Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1a       12/30/2011
        000000000010c0d0 ffff880090941898 ffffffff81387ab4 ffff880090941928
        ffffffff810bb26f 0000000000000009 000000000010c0d0 ffff880090941928
        ffffffff81385dbc ffffffff815f3840 ffffffff00000000 000002000010c0d0
       Call Trace:
        [<ffffffff81387ab4>] dump_stack+0x19/0x1b
        [<ffffffff810bb26f>] warn_alloc_failed+0x110/0x124
        [<ffffffff81385dbc>] ? __alloc_pages_direct_compact+0x17c/0x18e
        [<ffffffff810bda2e>] __alloc_pages_nodemask+0x6c7/0x75e
        [<ffffffff810bdad7>] __get_free_pages+0x12/0x3f
        [<ffffffff810ea148>] kmalloc_order_trace+0x29/0x88
        [<ffffffff810ec1fd>] __kmalloc+0x36/0x11b
        [<ffffffffa031eeed>] ? mq_create+0x1dc/0x2cf [dm_cache_mq]
        [<ffffffffa031efc0>] mq_create+0x2af/0x2cf [dm_cache_mq]
        [<ffffffffa0314605>] dm_cache_policy_create+0xa7/0xd2 [dm_cache]
        [<ffffffffa0312530>] ? cache_ctr+0x245/0xa13 [dm_cache]
        [<ffffffffa031263e>] cache_ctr+0x353/0xa13 [dm_cache]
        [<ffffffffa012b916>] dm_table_add_target+0x227/0x2ce [dm_mod]
        [<ffffffffa012e8e4>] table_load+0x286/0x2ac [dm_mod]
        [<ffffffffa012e65e>] ? dev_wait+0x8a/0x8a [dm_mod]
        [<ffffffffa012e324>] ctl_ioctl+0x39a/0x3c2 [dm_mod]
        [<ffffffffa012e35a>] dm_ctl_ioctl+0xe/0x12 [dm_mod]
        [<ffffffff81101181>] vfs_ioctl+0x21/0x34
        [<ffffffff811019d3>] do_vfs_ioctl+0x3b1/0x3f4
        [<ffffffff810f4d2e>] ? ____fput+0x9/0xb
        [<ffffffff81050b6c>] ? task_work_run+0x7e/0x92
        [<ffffffff81101a68>] SyS_ioctl+0x52/0x82
        [<ffffffff81391d92>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      14f398ca
  8. 28 2月, 2014 2 次提交
    • H
      dm cache: fix truncation bug when mapping I/O to >2TB fast device · e0d849fa
      Heinz Mauelshagen 提交于
      When remapping a block to the cache's fast device that is larger than
      2TB we must not truncate the destination sector to 32bits.  The 32bit
      temporary result of from_cblock() was being overflowed in
      remap_to_cache() due to the logical left shift.
      
      Use an intermediate 64bit type to store the 32bit from_cblock() result
      to fix the overflow.
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      e0d849fa
    • M
      dm thin: allow metadata space larger than supported to go unused · 7d48935e
      Mike Snitzer 提交于
      It was always intended that a user could provide a thin metadata device
      that is larger than the max supported by the on-disk format.  The extra
      space would just go unused.
      
      Unfortunately that never worked.  If the user attempted to use a larger
      metadata device on creation they would get an error like the following:
      
       device-mapper: space map common: space map too large
       device-mapper: transaction manager: couldn't create metadata space map
       device-mapper: thin metadata: tm_create_with_sm failed
       device-mapper: table: 252:17: thin-pool: Error creating metadata object
       device-mapper: ioctl: error adding target to table
      
      Fix this by allowing the initial metadata space map creation to cap its
      size at the max number of blocks supported (DM_SM_METADATA_MAX_BLOCKS).
      get_metadata_dev_size() must also impose DM_SM_METADATA_MAX_BLOCKS (via
      THIN_METADATA_MAX_SECTORS), otherwise extending metadata would cap at
      THIN_METADATA_MAX_SECTORS_WARNING (which is larger than supported).
      
      Also, the calculation for THIN_METADATA_MAX_SECTORS didn't account for
      the sizeof the disk_bitmap_header.  So the supported maximum metadata
      size is a bit smaller (reduced from 33423360 to 33292800 sectors).
      
      Lastly, remove the "excess space will not be used" warning message from
      get_metadata_dev_size(); it resulted in printing the warning multiple
      times.  Factor out warn_if_metadata_device_too_big(), call it from
      pool_ctr() and maybe_resize_metadata_dev().
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      7d48935e
  9. 26 2月, 2014 1 次提交
  10. 25 2月, 2014 1 次提交
    • M
      dm thin: fix the error path for the thin device constructor · 1acacc07
      Mike Snitzer 提交于
      dm_pool_close_thin_device() must be called if dm_set_target_max_io_len()
      fails in thin_ctr().  Otherwise __pool_destroy() will fail because the
      pool will still have an open thin device:
      
       device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open
       device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed.
      
      Also, must establish error code if failing thin_ctr() because the pool
      is in fail_io mode.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Cc: stable@vger.kernel.org
      1acacc07
  11. 18 2月, 2014 5 次提交
    • M
      dm raid1: fix immutable biovec related BUG when retrying read bio · f3a44fe0
      Mikulas Patocka 提交于
      When restoring bi_end_io, increase bi_remaining before retrying the bio
      to avoid BUG_ON(atomic_read(&bio->bi_remaining) <= 0) in bio_endio().
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f3a44fe0
    • M
      dm io: fix I/O to multiple destinations · d73f9907
      Mikulas Patocka 提交于
      Commit 003b5c57 ("block: Convert drivers
      to immutable biovecs") broke dm-mirror due to dm-io breakage.
      
      dm-io had three possible iterators (DM_IO_PAGE_LIST, DM_IO_BVEC,
      DM_IO_VMA) that iterate over pages where the I/O should be performed.
      
      The switch to immutable biovecs changed the DM_IO_BVEC iterator to
      DM_IO_BIO.  Before this change the iterator stored the pointer to a bio
      vector in the dpages structure.  The iterator incremented the pointer in
      the dpages structure as it advanced over the pages.  After the immutable
      biovecs change, the DM_IO_BIO iterator stores a pointer to the bio in
      the dpages structure and uses bio_advance to change the bio as it
      advances.
      
      The problem is that the function dispatch_io stores the content of the
      dpages structure into the variable old_pages and restores it before
      issuing I/O to each of the devices.  Before the change, the statement
      "*dp = old_pages;" restored the iterator to its starting position.
      After the change, struct dpages holds a pointer to the bio, thus the
      statement "*dp = old_pages;" doesn't restore the iterator.
      
      Consequently, in the context of dm-mirror: only the first mirror leg is
      written correctly, the kernel locks up when trying to write the other
      mirror legs because the number of sectors to write in the where->count
      variable doesn't match the number of sectors returned by the iterator.
      
      This patch fixes the bug by partially reverting the original patch - it
      changes the code so that struct dpages holds a pointer to the bio vector,
      so that the statement "*dp = old_pages;" restores the iterator correctly.
      
      The field "context_u" holds the offset from the beginning of the current
      bio vector entry, just like the "bio->bi_iter.bi_bvec_done" field.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d73f9907
    • M
      dm thin: avoid metadata commit if a pool's thin devices haven't changed · 4d1662a3
      Mike Snitzer 提交于
      Commit 905e51b3 ("dm thin: commit outstanding data every second")
      introduced a periodic commit.  This commit occurs regardless of whether
      any thin devices have made changes.
      
      Fix the periodic commit to check if any of a pool's thin devices have
      changed using dm_pool_changed_this_transaction().
      Reported-by: NAlexander Larsson <alexl@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Cc: stable@vger.kernel.org
      4d1662a3
    • M
      dm cache: do not add migration to completed list before unhooking bio · 80ae49aa
      Mike Snitzer 提交于
      When completing an overwrite bio, in overwrite_endio(), the associated
      migration should not be added to the 'completed_migrations' until the
      bio's fields are restored with dm_unhook_bio().
      
      Otherwise, do_worker() can race to process 'completed_migrations' before
      dm_unhook_bio() -- so the bio's bi_end_io is incorrect.  This is
      unlikely to cause any problems given the current code but should be
      fixed on the basis of correctness.
      
      Also, the cache's spinlock only needs to be held when manipulating the
      'completed_migrations' list -- other changes don't need protection.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      80ae49aa
    • M
      dm cache: move hook_info into common portion of per_bio_data structure · c6eda5e8
      Mike Snitzer 提交于
      Commit c9d28d5d ("dm cache: promotion optimisation for writes")
      incorrectly placed the 'hook_info' member in the writethrough-only
      portion of the per_bio_data structure.
      
      Given that the overwrite optimization may be used for writeback the
      'hook_info' member must be placed above the 'cache' member of the
      per_bio_data structure.  Any members above 'cache' are available from
      both writeback and writethrough modes' per_bio_data structure.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Cc: stable@vger.kernel.org # 3.13+
      c6eda5e8
  12. 13 2月, 2014 1 次提交
    • O
      md/raid5: Fix CPU hotplug callback registration · 789b5e03
      Oleg Nesterov 提交于
      Subsystems that want to register CPU hotplug callbacks, as well as perform
      initialization for the CPUs that are already online, often do it as shown
      below:
      
      	get_online_cpus();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	register_cpu_notifier(&foobar_cpu_notifier);
      
      	put_online_cpus();
      
      This is wrong, since it is prone to ABBA deadlocks involving the
      cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
      with CPU hotplug operations).
      
      Interestingly, the raid5 code can actually prevent double initialization and
      hence can use the following simplified form of callback registration:
      
      	register_cpu_notifier(&foobar_cpu_notifier);
      
      	get_online_cpus();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	put_online_cpus();
      
      A hotplug operation that occurs between registering the notifier and calling
      get_online_cpus(), won't disrupt anything, because the code takes care to
      perform the memory allocations only once.
      
      So reorganize the code in raid5 this way to fix the deadlock with callback
      registration.
      
      Cc: linux-raid@vger.kernel.org
      Cc: stable@vger.kernel.org (v2.6.32+)
      Fixes: 36d1c647Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      [Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
      free_scratch_buffer() helper to condense code further and wrote the changelog.]
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      789b5e03
  13. 11 2月, 2014 1 次提交
  14. 05 2月, 2014 1 次提交
    • N
      md/raid1: restore ability for check and repair to fix read errors. · 1877db75
      NeilBrown 提交于
      commit 30bc9b53
          md/raid1: fix bio handling problems in process_checks()
      
      Move the bio_reset() to a point before where BIO_UPTODATE is checked,
      so that check now always report that the bio is uptodate, even if it is not.
      
      This causes process_check() to sometimes treat read-errors as
      successful matches so the good data isn't written out.
      
      This patch preserves the flag until it is needed.
      
      Bug was introduced in 3.11, but backported to 3.10-stable (as it fixed
      an even worse bug).  So suitable for any -stable since 3.10.
      Reported-and-tested-by: NMichael Tokarev <mjt@tls.msk.ru>
      Cc: stable@vger.kernel.org (3.10+)
      Fixed: 30bc9b53Signed-off-by: NNeilBrown <neilb@suse.de>
      1877db75
  15. 30 1月, 2014 3 次提交
  16. 22 1月, 2014 3 次提交
    • D
      dm log userspace: allow mark requests to piggyback on flush requests · 5066a4df
      Dongmao Zhang 提交于
      In the cluster evironment, cluster write has poor performance because
      userspace_flush() has to contact a userspace program (cmirrord) for
      clear/mark/flush requests.  But both mark and flush requests require
      cmirrord to communicate the message to all the cluster nodes for each
      flush call.  This behaviour is really slow.
      
      To address this we now merge mark and flush requests together to reduce
      the kernel-userspace-kernel time.  We allow a new directive,
      "integrated_flush" that can be used to instruct the kernel log code to
      combine flush and mark requests when directed by userspace.  If not
      directed by userspace (due to an older version of the userspace code
      perhaps), the kernel will function as it did previously - preserving
      backwards compatibility.  Additionally, flush requests are performed
      lazily when only clear requests exist.
      Signed-off-by: NDongmao Zhang <dmzhang@suse.com>
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      5066a4df
    • N
      md/raid5: close recently introduced race in stripe_head management. · 7da9d450
      NeilBrown 提交于
      As release_stripe and __release_stripe decrement ->count and then
      manipulate ->lru both under ->device_lock, it is important that
      get_active_stripe() increments ->count and clears ->lru also under
      ->device_lock.
      
      However we currently list_del_init ->lru under the lock, but increment
      the ->count outside the lock.  This can lead to races and list
      corruption.
      
      So move the atomic_inc(&sh->count) up inside the ->device_lock
      protected region.
      
      Note that we still increment ->count without device lock in the case
      where get_free_stripe() was called, and in fact don't take
      ->device_lock at all in that path.
      This is safe because if the stripe_head can be found by
      get_free_stripe, then the hash lock assures us the no-one else could
      possibly be calling release_stripe() at the same time.
      
      Fixes: 566c09c5
      Cc: stable@vger.kernel.org (3.13)
      Reported-and-tested-by: NIan Kumlien <ian.kumlien@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7da9d450
    • J
      dm space map metadata: fix bug in resizing of thin metadata · fca02843
      Joe Thornber 提交于
      This bug was introduced in commit 7e664b3d ("dm space map metadata:
      fix extending the space map").
      
      When extending a dm-thin metadata volume we:
      
      - Switch the space map into a simple bootstrap mode, which allocates
        all space linearly from the newly added space.
      - Add new bitmap entries for the new space
      - Increment the reference counts for those newly allocated bitmap
        entries
      - Commit changes to disk
      - Switch back out of bootstrap mode.
      
      But, the disk commit may allocate space itself, if so this fact will be
      lost when switching out of bootstrap mode.
      
      The bug exhibited itself as an error when the bitmap_root, with an
      erroneous ref count of 0, was subsequently decremented as part of a
      later disk commit.  This would cause the disk commit to fail, and thinp
      to enter read_only mode.  The metadata was not damaged (thin_check
      passed).
      
      The fix is to put the increments + commit into a loop, running until
      the commit has not allocated extra space.  In practise this loop only
      runs twice.
      
      With this fix the following device mapper testsuite test passes:
       dmtest run --suite thin-provisioning -n thin_remove_works_after_resize
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # depends on commit 7e664b3d
      fca02843
  17. 17 1月, 2014 1 次提交
    • M
      dm cache: add policy name to status output · 2e68c4e6
      Mike Snitzer 提交于
      The cache's policy may have been established using the "default" alias,
      which is currently the "mq" policy but the default policy may change in
      the future.  It is useful to know exactly which policy is being used.
      
      Add a 'real' member to the dm_cache_policy_type structure and have the
      "default" dm_cache_policy_type point to the real "mq"
      dm_cache_policy_type.  Update dm_cache_policy_get_name() to check if
      real is set, if so report the name of the real policy (not the alias).
      Requested-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2e68c4e6
  18. 16 1月, 2014 3 次提交
    • M
      dm thin: fix pool feature parsing · 74aa45c3
      Mike Snitzer 提交于
      Commit 787a996c ("dm thin: add error_if_no_space feature")
      mistakenly forgot to increase the number of feature args supported.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      74aa45c3
    • N
      md/raid5: fix long-standing problem with bitmap handling on write failure. · 9f97e4b1
      NeilBrown 提交于
      Before a write starts we set a bit in the write-intent bitmap.
      When the write completes we clear that bit if the write was successful
      to all devices.  However if the write wasn't fully successful we
      should not clear the bit.  If the faulty drive is subsequently
      re-added, the fact that the bit is still set ensure that we will
      re-write the data that is missing.
      
      This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
      bitmap bit when this flag is not set.
      Currently we correctly set the flag if a write starts when some
      devices are failed or missing.  But we do *not* set the flag if some
      device failed during the write attempt.
      This is wrong and can result in clearing the bit inappropriately.
      
      So: set the flag when a write fails.
      
      This bug has been present since bitmaps were introduces, so the fix is
      suitable for any -stable kernel.
      Reported-by: NEthan Wilson <ethan.wilson@shiftmail.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9f97e4b1
    • N
      md: check command validity early in md_ioctl(). · cb335f88
      Nicolas Schichan 提交于
      Verify that the cmd parameter passed to md_ioctl() is valid before
      doing anything.
      
      This fixes mddev->hold_active being set to 0 when an invalid ioctl
      command is passed to md_ioctl() before the array has been configured.
      
      Clearing mddev->hold_active in that case can lead to a livelock
      situation when an invalid ioctl number is given to md_ioctl() by a
      process when the mddev is currently being opened by another process:
      
      Process 1				Process 2
      ---------				---------
      
      md_alloc()
        mddev_find()
        -> returns a new mddev with
           hold_active == UNTIL_IOCTL
        add_disk()
        -> sends KOBJ_ADD uevent
      
      					(sees KOBJ_ADD uevent for device)
                          			md_open()
                          			md_ioctl(INVALID_IOCTL)
                          			-> returns ENODEV and clears
                             			   mddev->hold_active
                          			md_release()
                            			md_put()
                            			-> deletes the mddev as
                               		   hold_active is 0
      
      md_open()
        mddev_find()
        -> returns a newly
          allocated mddev with
          mddev->gendisk == NULL
      -> returns with ERESTARTSYS
         (kernel restarts the open syscall)
      Signed-off-by: NNicolas Schichan <nschichan@freebox.fr>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cb335f88
  19. 15 1月, 2014 4 次提交
    • M
      dm sysfs: fix a module unload race · 2995fa78
      Mikulas Patocka 提交于
      This reverts commit be35f486 ("dm: wait until embedded kobject is
      released before destroying a device") and provides an improved fix.
      
      The kobject release code that calls the completion must be placed in a
      non-module file, otherwise there is a module unload race (if the process
      calling dm_kobject_release is preempted and the DM module unloaded after
      the completion is triggered, but before dm_kobject_release returns).
      
      To fix this race, this patch moves the completion code to dm-builtin.c
      which is always compiled directly into the kernel if BLK_DEV_DM is
      selected.
      
      The patch introduces a new dm_kobject_holder structure, its purpose is
      to keep the completion and kobject in one place, so that it can be
      accessed from non-module code without the need to export the layout of
      struct mapped_device to that code.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      2995fa78
    • M
      dm snapshot: use dm-bufio prefetch · 55b082e6
      Mikulas Patocka 提交于
      This patch modifies dm-snapshot so that it prefetches the buffers when
      loading the exceptions.
      
      The number of buffers read ahead is specified in the DM_PREFETCH_CHUNKS
      macro.  The current value for DM_PREFETCH_CHUNKS (12) was found to
      provide the best performance on a single 15k SCSI spindle.  In the
      future we may modify this default or make it configurable.
      
      Also, introduce the function dm_bufio_set_minimum_buffers to setup
      bufio's number of internal buffers before freeing happens.  dm-bufio may
      hold more buffers if enough memory is available.  There is no guarantee
      that the specified number of buffers will be available - if you need a
      guarantee, use the argument reserved_buffers for
      dm_bufio_client_create.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      55b082e6
    • M
      dm snapshot: use dm-bufio · 55494bf2
      Mikulas Patocka 提交于
      Use dm-bufio for initial loading of the exceptions.
      Introduce a new function dm_bufio_forget that frees the given buffer.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      55494bf2
    • M
      dm snapshot: prepare for switch to using dm-bufio · 2cadabd5
      Mikulas Patocka 提交于
      Change the functions get_exception, read_exception and insert_exceptions
      so that ps->area is passed as an argument.
      
      This patch doesn't change any functionality, but it refactors the code
      to allow for a cleaner switch over to using dm-bufio.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2cadabd5