1. 03 6月, 2012 5 次提交
    • J
      dm thin: provide userspace access to pool metadata · cc8394d8
      Joe Thornber 提交于
      This patch implements two new messages that can be sent to the thin
      pool target allowing it to take a snapshot of the _metadata_.  This,
      read-only snapshot can be accessed by userland, concurrently with the
      live target.
      
      Only one metadata snapshot can be held at a time.  The pool's status
      line will give the block location for the current msnap.
      
      Since version 0.1.5 of the userland thin provisioning tools, the
      thin_dump program displays the msnap as follows:
      
          thin_dump -m <msnap root> <metadata dev>
      
      Available here: https://github.com/jthornber/thin-provisioning-tools
      
      Now that userland can access the metadata we can do various things
      that have traditionally been kernel side tasks:
      
           i) Incremental backups.
      
           By using metadata snapshots we can work out what blocks have
           changed over time.  Combined with data snapshots we can ensure
           the data doesn't change while we back it up.
      
           A short proof of concept script can be found here:
      
           https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb
      
           ii) Migration of thin devices from one pool to another.
      
           iii) Merging snapshots back into an external origin.
      
           iv) Asyncronous replication.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      cc8394d8
    • M
      dm thin: use slab mempools · a24c2569
      Mike Snitzer 提交于
      Use dedicated caches prefixed with a "dm_" name rather than relying on
      kmalloc mempools backed by generic slab caches so the memory usage of
      thin provisioning (and any leaks) can be accounted for independently.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a24c2569
    • M
      dm mpath: allow ioctls to trigger pg init · 35991652
      Mikulas Patocka 提交于
      After the failure of a group of paths, any alternative paths that
      need initialising do not become available until further I/O is sent to
      the device.  Until this has happened, ioctls return -EAGAIN.
      
      With this patch, new paths are made available in response to an ioctl
      too.  The processing of the ioctl gets delayed until this has happened.
      
      Instead of returning an error, we submit a work item to kmultipathd
      (that will potentially activate the new path) and retry in ten
      milliseconds.
      
      Note that the patch doesn't retry an ioctl if the ioctl itself fails due
      to a path failure.  Such retries should be handled intelligently by the
      code that generated the ioctl in the first place, noting that some SCSI
      commands should not be retried because they are not idempotent (XOR write
      commands).  For commands that could be retried, there is a danger that
      if the device rejected the SCSI command, the path could be errorneously
      marked as failed, and the request would be retried on another path which
      might fail too.  It can be determined if the failure happens on the
      device or on the SCSI controller, but there is no guarantee that all
      SCSI drivers set these flags correctly.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      35991652
    • M
      dm mpath: delay retry of bypassed pg · f220fd4e
      Mike Christie 提交于
      If I/O needs retrying and only bypassed priority groups are available,
      set the pg_init_delay_retry flag to wait before retrying.
      
      If, for example, the reason for the bypass is that the controller is
      getting reset or there is a firmware upgrade happening, retrying right
      away would cause a flood of log messages and retries for what could be a
      few seconds or even several minutes.
      Signed-off-by: NMike Christie <michaelc@cs.wisc.edu>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f220fd4e
    • M
      dm mpath: reduce size of struct multipath · 1fbdd2b3
      Mike Snitzer 提交于
      Move multipath structure's 'lock' and 'queue_size' members to eliminate
      two 4-byte holes.  Also use a bit within a single unsigned int for each
      existing flag (saves 8-bytes).  This allows future flags to be added
      without each consuming an unsigned int.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      1fbdd2b3
  2. 19 5月, 2012 2 次提交
    • M
      dm thin: fix table output when pool target disables discard passdown internally · f402693d
      Mike Snitzer 提交于
      When the thin pool target clears the discard_passdown parameter
      internally, it incorrectly changes the table line reported to userspace.
      This breaks dumb string comparisons on these table lines in generic
      userspace device-mapper library code and leads to tables being reloaded
      repeatedly when nothing is actually meant to be changing.
      
      This patch corrects this by no longer changing the table line when
      discard passdown was disabled.
      
      We can still tell when discard passdown is overridden by looking for the
      message "Discard unsupported by data device (sdX): Disabling discard passdown."
      
      This automatic detection is also moved from the 'load' to the 'resume'
      so that it is re-evaluated should the properties of underlying devices
      change.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f402693d
    • N
      md/raid10: fix transcription error in calc_sectors conversion. · b0d634d5
      NeilBrown 提交于
      The old code was
      		sector_div(stride, fc);
      the new code was
      		sector_dir(size, conf->near_copies);
      
      'size' is right (the stride various wasn't really needed), but
      'fc' means 'far_copies', and that is an important difference.
      
      Signed-off-by: NeilBrown <neilb@suse.de>       
      b0d634d5
  3. 17 5月, 2012 2 次提交
    • J
      MD: Add del_timer_sync to mddev_suspend (fix nasty panic) · 0d9f4f13
      Jonathan Brassow 提交于
      Use del_timer_sync to remove timer before mddev_suspend finishes.
      
      We don't want a timer going off after an mddev_suspend is called.  This is
      especially true with device-mapper, since it can call the destructor function
      immediately following a suspend.  This results in the removal (kfree) of the
      structures upon which the timer depends - resulting in a very ugly panic.
      Therefore, we add a del_timer_sync to mddev_suspend to prevent this.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0d9f4f13
    • N
      md/raid10: set dev_sectors properly when resizing devices in array. · 6508fdbf
      NeilBrown 提交于
      raid10 stores dev_sectors in 'conf' separately from the one in
      'mddev' because it can have a very significant effect on block
      addressing and so need to be updated carefully.
      
      However raid10_resize isn't updating it at all!
      
      To update it correctly, we need to make sure it is a proper
      multiple of the chunksize taking various details of the layout
      in to account.
      This calculation is currently done in setup_conf.   So split it
      out from there and call it from raid10_resize as well.
      Then set conf->dev_sectors properly.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6508fdbf
  4. 12 5月, 2012 4 次提交
    • M
      dm mpath: check if scsi_dh module already loaded before trying to load · 510193a2
      Mike Snitzer 提交于
      If the requested scsi_dh module is already loaded then skip
      request_module().
      
      Multipath table loads can hang in an unnecessary __request_module.
      Reported-by: NBen Marzinski <bmarzins@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      510193a2
    • A
      dm thin: correct module description · 7cab8bf1
      Alasdair G Kergon 提交于
      Remove duplicate copy of string "device-mapper" (DM_NAME) from
      MODULE_DESCRIPTION.
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      7cab8bf1
    • M
      dm thin: fix unprotected use of prepared_discards list · c3a0ce2e
      Mike Snitzer 提交于
      Fix two places in commit 104655fd ("dm thin: support discards") that
      didn't use pool->lock to protect against concurrent changes to the
      prepared_discards list.
      
      Without this fix, thin_endio() can race with process_discard(), leading
      to concurrent list_add()s that result in the processes locking up with
      an error like the following:
      
      WARNING: at lib/list_debug.c:32 __list_add+0x8f/0xa0()
      ...
      list_add corruption. next->prev should be prev (ffff880323b96140), but was ffff8801d2c48440. (next=ffff8801d2c485c0).
      ...
      Pid: 17205, comm: kworker/u:1 Tainted: G        W  O 3.4.0-rc3.snitm+ #1
      Call Trace:
       [<ffffffff8103ca1f>] warn_slowpath_common+0x7f/0xc0
       [<ffffffff8103cb16>] warn_slowpath_fmt+0x46/0x50
       [<ffffffffa04f6ce6>] ? bio_detain+0xc6/0x210 [dm_thin_pool]
       [<ffffffff8124ff3f>] __list_add+0x8f/0xa0
       [<ffffffffa04f70d2>] process_discard+0x2a2/0x2d0 [dm_thin_pool]
       [<ffffffffa04f6a78>] ? remap_and_issue+0x38/0x50 [dm_thin_pool]
       [<ffffffffa04f7c3b>] process_deferred_bios+0x7b/0x230 [dm_thin_pool]
       [<ffffffffa04f7df0>] ? process_deferred_bios+0x230/0x230 [dm_thin_pool]
       [<ffffffffa04f7e42>] do_worker+0x52/0x60 [dm_thin_pool]
       [<ffffffff81056fa9>] process_one_work+0x129/0x450
       [<ffffffff81059b9c>] worker_thread+0x17c/0x3c0
       [<ffffffff81059a20>] ? manage_workers+0x120/0x120
       [<ffffffff8105eabe>] kthread+0x9e/0xb0
       [<ffffffff814ceda4>] kernel_thread_helper+0x4/0x10
       [<ffffffff8105ea20>] ? kthread_freezable_should_stop+0x70/0x70
       [<ffffffff814ceda0>] ? gs_change+0x13/0x13
      ---[ end trace 7e0a523bc5e52692 ]---
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      c3a0ce2e
    • M
      dm thin: reinstate missing mempool_free in cell_release_singleton · 03aaae7c
      Mike Snitzer 提交于
      Fix a significant memory leak inadvertently introduced during
      simplification of cell_release_singleton() in commit
      6f94a4c4 ("dm thin: fix stacked bi_next
      usage").
      
      A cell's hlist_del() must be accompanied by a mempool_free().
      Use __cell_release() to do this, like before.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      03aaae7c
  5. 11 5月, 2012 1 次提交
    • E
      connector/userns: replace netlink uses of cap_raised() with capable() · 38bf1953
      Eric W. Biederman 提交于
      In 2009 Philip Reiser notied that a few users of netlink connector
      interface needed a capability check and added the idiom
      cap_raised(nsp->eff_cap, CAP_SYS_ADMIN) to a few of them, on the premise
      that netlink was asynchronous.
      
      In 2011 Patrick McHardy noticed we were being silly because netlink is
      synchronous and removed eff_cap from the netlink_skb_params and changed
      the idiom to cap_raised(current_cap(), CAP_SYS_ADMIN).
      
      Looking at those spots with a fresh eye we should be calling
      capable(CAP_SYS_ADMIN).  The only reason I can see for not calling capable
      is that it once appeared we were not in the same task as the caller which
      would have made calling capable() impossible.
      
      In the initial user_namespace the only difference between between
      cap_raised(current_cap(), CAP_SYS_ADMIN) and capable(CAP_SYS_ADMIN) are a
      few sanity checks and the fact that capable(CAP_SYS_ADMIN) sets
      PF_SUPERPRIV if we use the capability.
      
      Since we are going to be using root privilege setting PF_SUPERPRIV seems
      the right thing to do.
      
      The motivation for this that patch is that in a child user namespace
      cap_raised(current_cap(),...) tests your capabilities with respect to that
      child user namespace not capabilities in the initial user namespace and
      thus will allow processes that should be unprivielged to use the kernel
      services that are only protected with cap_raised(current_cap(),..).
      
      To fix possible user_namespace issues and to just clean up the code
      replace cap_raised(current_cap(), CAP_SYS_ADMIN) with
      capable(CAP_SYS_ADMIN).
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Acked-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      Acked-by: NAndrew G. Morgan <morgan@kernel.org>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: David Howells <dhowells@redhat.com>
      Reviewed-by: NJames Morris <james.l.morris@oracle.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38bf1953
  6. 04 5月, 2012 1 次提交
  7. 24 4月, 2012 3 次提交
    • N
      md: fix possible corruption of array metadata on shutdown. · 30b8aa91
      NeilBrown 提交于
      commit c744a65c
        md: don't set md arrays to readonly on shutdown.
      
      removed the possibility of a 'BUG' when data is written to an array
      that has just been switched to read-only, but also introduced the
      possibility that the array metadata could be corrupted.
      
      If, when md_notify_reboot gets the mddev lock, the array is
      in a state where it is assembled but hasn't been started (as can
      happen if the personality module is not available, or in other unusual
      situations), then incorrect metadata will be written out making it
      impossible to re-assemble the array.
      
      So only call __md_stop_writes() if the array has actually been
      activated.
      
      This patch is needed for any stable kernel which has had the above
      commit applied.
      
      Cc: stable@vger.kernel.org
      Reported-by: NChristoph Nelles <evilazrael@evilazrael.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      30b8aa91
    • N
      md: don't call ->add_disk unless there is good reason. · ed209584
      NeilBrown 提交于
      Commit 7bfec5f3
      
         md/raid5: If there is a spare and a want_replacement device, start replacement.
      
      cause md_check_recovery to call ->add_disk much more often.
      Instead of only when the array is degraded, it is now called whenever
      md_check_recovery finds anything useful to do, which includes
      updating the metadata for clean<->dirty transition.
      This causes unnecessary work, and causes info messages from ->add_disk
      to be reported much too often.
      
      So refine md_check_recovery to only do any actual recovery checking
      (including ->add_disk) if MD_RECOVERY_NEEDED is set.
      
      This fix is suitable for 3.3.y:
      
      Cc: stable@vger.kernel.org
      Reported-by: NJan Ceuleers <jan.ceuleers@computer.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ed209584
    • J
      DM RAID: Use safe version of rdev_for_each · a9ad8526
      Jonathan Brassow 提交于
      Fix segfault caused by using rdev_for_each instead of rdev_for_each_safe
      
      Commit dafb20fa mistakenly replaced a safe
      iterator with an unsafe one when making some macro changes.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a9ad8526
  8. 12 4月, 2012 3 次提交
  9. 03 4月, 2012 5 次提交
  10. 02 4月, 2012 3 次提交
  11. 29 3月, 2012 11 次提交
    • M
      dm: add verity target · a4ffc152
      Mikulas Patocka 提交于
      This device-mapper target creates a read-only device that transparently
      validates the data on one underlying device against a pre-generated tree
      of cryptographic checksums stored on a second device.
      
      Two checksum device formats are supported: version 0 which is already
      shipping in Chromium OS and version 1 which incorporates some
      improvements.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMandeep Singh Baines <msb@chromium.org>
      Signed-off-by: NWill Drewry <wad@chromium.org>
      Signed-off-by: NElly Jones <ellyjones@chromium.org>
      Cc: Milan Broz <mbroz@redhat.com>
      Cc: Olof Johansson <olofj@chromium.org>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a4ffc152
    • M
      dm bufio: prefetch · a66cc28f
      Mikulas Patocka 提交于
      This patch introduces a new function dm_bufio_prefetch. It prefetches
      the specified range of blocks into dm-bufio cache without waiting
      for i/o completion.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a66cc28f
    • J
      dm thin: add pool target flags to control discard · 67e2e2b2
      Joe Thornber 提交于
      Add dm thin target arguments to control discard support.
      
      ignore_discard: Disables discard support
      
      no_discard_passdown: Don't pass discards down to the underlying data
      device, but just remove the mapping within the thin provisioning target.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      67e2e2b2
    • J
      dm thin: support discards · 104655fd
      Joe Thornber 提交于
      Support discards in the thin target.
      
      On discard the corresponding mapping(s) are removed from the thin
      device.  If the associated block(s) are no longer shared the discard
      is passed to the underlying device.
      
      All bios other than discards now have an associated deferred_entry
      that is saved to the 'all_io_entry' in endio_hook.  When non-discard
      IO completes and associated mappings are quiesced any discards that
      were deferred, via ds_add_work() in process_discard(), will be queued
      for processing by the worker thread.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      
      drivers/md/dm-thin.c |  173 ++++++++++++++++++++++++++++++++++++++++++++++----
       drivers/md/dm-thin.c |  172 ++++++++++++++++++++++++++++++++++++++++++++++-----
       1 file changed, 158 insertions(+), 14 deletions(-)
      104655fd
    • J
      dm thin: prepare to support discard · eb2aa48d
      Joe Thornber 提交于
      This patch contains the ground work needed for dm-thin to support discard.
      
        - Adds endio function that replaces shared_read_endio.
      
        - Introduce an explicit 'quiesced' flag into the new_mapping structure.
          Before, this was implicitly indicated by m->list being empty.
      
        - The map_info->ptr remains constant for the duration of a bio's trip
          through the thin target.  Make it easier to reason about it.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      eb2aa48d
    • A
      dm thin: use dm_target_offset · 6efd6e83
      Alasdair G Kergon 提交于
      Use dm_target_offset wrapper instead of referencing the awkward ti->begin
      explicitly.
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      6efd6e83
    • J
      dm thin: support read only external snapshot origins · 2dd9c257
      Joe Thornber 提交于
      Support the use of an external _read only_ device as an origin for a thin
      device.
      
      Any read to an unprovisioned area of the thin device will be passed
      through to the origin.  Writes trigger allocation of new blocks as
      usual.
      
      One possible use case for this would be VM hosts that want to run
      guests on thinly-provisioned volumes but have the base image on another
      device (possibly shared between many VMs).
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      2dd9c257
    • M
      dm thin: relax hard limit on the maximum size of a metadata device · c4a69ecd
      Mike Snitzer 提交于
      The thin metadata format can only make use of a device that is <=
      THIN_METADATA_MAX_SECTORS (currently 15.9375 GB).  Therefore, there is no
      practical benefit to using a larger device.
      
      However, it may be that other factors impose a certain granularity for
      the space that is allocated to a device (E.g. lvm2 can impose a coarse
      granularity through the use of large, >= 1 GB, physical extents).
      
      Rather than reject a larger metadata device, during thin-pool device
      construction, switch to allowing it but issue a warning if a device
      larger than THIN_METADATA_MAX_SECTORS_WARNING (16 GB) is
      provided.  Any space over 15.9375 GB will not be used.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      c4a69ecd
    • J
      dm persistent data: remove space map ref_count entries if redundant · 71fd5ae2
      Joe Thornber 提交于
      Save space by removing entries from the space map ref_count tree if
      they're no longer needed.
      
      Ref counts are stored in two places: a bitmap if the ref_count is
      below 3, or a btree of uint32_t if 3 or above.
      
      When a ref_count that was above 3 drops below we can remove it from
      the tree and save some metadata space.  This removal was commented out
      before because I was unsure why this was causing under-populated btree
      nodes.  Earlier patches have fixed this issue.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      71fd5ae2
    • J
      dm thin: commit outstanding data every second · 905e51b3
      Joe Thornber 提交于
      Commit unwritten data every second to prevent too much building up.
      
      Released blocks don't become available until after the next commit
      (for crash resilience).  Prior to this patch commits were only
      triggered by a message to the target or a REQ_{FLUSH,FUA} bio.  This
      allowed far too big a position to build up.
      
      The interval is hard-coded to 1 second.  This is a sensible setting.
      I'm not making this user configurable, since there isn't much to be
      gained by tweaking this - and a lot lost by setting it far too high.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      905e51b3
    • M
      dm: reject trailing characters in sccanf input · 31998ef1
      Mikulas Patocka 提交于
      Device mapper uses sscanf to convert arguments to numbers. The problem is that
      the way we use it ignores additional unmatched characters in the scanned string.
      
      For example, this `if (sscanf(string, "%d", &number) == 1)' will match a number,
      but also it will match number with some garbage appended, like "123abc".
      
      As a result, device mapper accepts garbage after some numbers. For example
      the command `dmsetup create vg1-new --table "0 16384 linear 254:1bla 34816bla"'
      will pass without an error.
      
      This patch fixes all sscanf uses in device mapper. It appends "%c" with
      a pointer to a dummy character variable to every sscanf statement.
      
      The construct `if (sscanf(string, "%d%c", &number, &dummy) == 1)' succeeds
      only if string is a null-terminated number (optionally preceded by some
      whitespace characters). If there is some character appended after the number,
      sscanf matches "%c", writes the character to the dummy variable and returns 2.
      We check the return value for 1 and consequently reject numbers with some
      garbage appended.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      31998ef1