1. 08 6月, 2016 1 次提交
  2. 14 1月, 2016 1 次提交
    • D
      md/raid: only permit hot-add of compatible integrity profiles · 1501efad
      Dan Williams 提交于
      It is not safe for an integrity profile to be changed while i/o is
      in-flight in the queue.  Prevent adding new disks or otherwise online
      spares to an array if the device has an incompatible integrity profile.
      
      The original change to the blk_integrity_unregister implementation in
      md, commmit c7bfced9 "md: suspend i/o during runtime
      blk_integrity_unregister" introduced an immediate hang regression.
      
      This policy of disallowing changes the integrity profile once one has
      been established is shared with DM.
      
      Here is an abbreviated log from a test run that:
      1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [   59.076127]
      2/ Tries to add an integrity-disabled device (pmem1m) [   90.489209]
      3/ Retries with an integrity-enabled device (pmem1s) [  205.671277]
      
      [   59.076127] md/raid1:md0: active with 1 out of 2 mirrors
      [   59.078302] md: data integrity enabled on md0
      [..]
      [   90.489209] md0: incompatible integrity profile for pmem1m
      [..]
      [  205.671277] md: super_written gets error=-5
      [  205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
      [  205.677386] md/raid1:md0: Operation continuing on 1 devices.
      [  205.683037] RAID1 conf printout:
      [  205.684699]  --- wd:1 rd:2
      [  205.685972]  disk 0, wo:0, o:1, dev:pmem0s
      [  205.687562]  disk 1, wo:1, o:1, dev:pmem1s
      [  205.691717] md: recovery of RAID array md0
      
      Fixes: c7bfced9 ("md: suspend i/o during runtime blk_integrity_unregister")
      Cc: <stable@vger.kernel.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reported-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      1501efad
  3. 10 1月, 2016 1 次提交
  4. 07 1月, 2016 1 次提交
    • N
      md: Remove 'ready' field from mddev. · 274d8cbd
      NeilBrown 提交于
      This field is always set in tandem with ->pers, and when it is tested
      ->pers is also tested.  So ->ready is not needed.
      
      It was needed once, but code rearrangement and locking changes have
      removed that needed.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      274d8cbd
  5. 06 1月, 2016 3 次提交
    • D
      drivers: md: use ktime_get_real_seconds() · 9ebc6ef1
      Deepa Dinamani 提交于
      get_seconds() API is not y2038 safe on 32 bit systems and the API
      is deprecated. Replace it with calls to ktime_get_real_seconds()
      API instead. Change mddev structure types to time64_t accordingly.
      
      32 bit signed timestamps will overflow in the year 2038.
      
      Change the user interface mdu_array_info_s structure timestamps:
      ctime and utime values used in ioctls GET_ARRAY_INFO and
      SET_ARRAY_INFO to unsigned int. This will extend the field to last
      until the year 2106.
      The long term plan is to get rid of ctime and utime values in
      this structure as this information can be read from the on-disk
      meta data directly.
      
      Clamp the tim64_t timestamps to positive values with a max of U32_MAX
      when returning from GET_ARRAY_INFO ioctl to accommodate above changes
      in the data type of timestamps to unsigned int.
      
      v0.90 on disk meta data uses u32 for maintaining time stamps.
      So this will also last until year 2106.
      Assumption is that the usage of v0.90 will be deprecated by
      year 2106.
      
      Timestamp fields in the on disk meta data for v1.0 version already
      use 64 bit data types. Remove the truncation of the bits while
      writing to or reading from these from the disk.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      9ebc6ef1
    • G
      md-cluster: Defer MD reloading to mddev->thread · 15858fa5
      Guoqing Jiang 提交于
      Reloading of superblock must be performed under reconfig_mutex. However,
      this cannot be done with md_reload_sb because it would deadlock with
      the message DLM lock. So, we defer it in md_check_recovery() which is
      executed by mddev->thread.
      
      This introduces a new flag, MD_RELOAD_SB, which if set, will reload the
      superblock. And good_device_nr is also added to 'struct mddev' which is
      used to get the num of the good device within cluster raid.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      15858fa5
    • G
      md-cluster: remove a disk asynchronously from cluster environment · 659b254f
      Guoqing Jiang 提交于
      For cluster raid, if one disk couldn't be reach in one node, then
      other nodes would receive the REMOVE message for the disk.
      
      In receiving node, we can't call md_kick_rdev_from_array to remove
      the disk from array synchronously since the disk might still be busy
      in this node. So let's set a ClusterRemove flag on the disk, then
      let the thread to do the removal job eventually.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      659b254f
  6. 18 12月, 2015 1 次提交
  7. 01 11月, 2015 2 次提交
  8. 24 10月, 2015 2 次提交
  9. 12 10月, 2015 1 次提交
    • G
      md-cluster: Improve md_reload_sb to be less error prone · 70bcecdb
      Goldwyn Rodrigues 提交于
      md_reload_sb is too simplistic and it explicitly needs to determine
      the changes made by the writing node. However, there are multiple areas
      where a simple reload could fail.
      
      Instead, read the superblock of one of the "good" rdevs and update
      the necessary information:
      
      - read the superblock into a newly allocated page, by temporarily
        swapping out rdev->sb_page and calling ->load_super.
      - if that fails return
      - if it succeeds, call check_sb_changes
        1. iterates over list of active devices and checks the matching
         dev_roles[] value.
         	If that is 'faulty', the device must be  marked as faulty
      	 - call md_error to mark the device as faulty. Make sure
      	   not to set CHANGE_DEVS and wakeup mddev->thread or else
      	   it would initiate a resync process, which is the responsibility
      	   of the "primary" node.
      	 - clear the Blocked bit
      	 - Call remove_and_add_spares() to hot remove the device.
      	If the device is 'spare':
      	 - call remove_and_add_spares() to get the number of spares
      	   added in this operation.
      	 - Reduce mddev->degraded to mark the array as not degraded.
        2. reset recovery_cp
      - read the rest of the rdevs to update recovery_offset. If recovery_offset
        is equal to MaxSector, call spare_active() to set it In_sync
      
      This required that recovery_offset be initialized to MaxSector, as
      opposed to zero so as to communicate the end of sync for a rdev.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      70bcecdb
  10. 14 8月, 2015 1 次提交
    • K
      block: kill merge_bvec_fn() completely · 8ae12666
      Kent Overstreet 提交于
      As generic_make_request() is now able to handle arbitrarily sized bios,
      it's no longer necessary for each individual block driver to define its
      own ->merge_bvec_fn() callback. Remove every invocation completely.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: ceph-devel@vger.kernel.org
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
       dm-era-target, and resolve merge conflicts]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ae12666
  11. 02 6月, 2015 1 次提交
    • T
      writeback: separate out include/linux/backing-dev-defs.h · 66114cad
      Tejun Heo 提交于
      With the planned cgroup writeback support, backing-dev related
      declarations will be more widely used across block and cgroup;
      unfortunately, including backing-dev.h from include/linux/blkdev.h
      makes cyclic include dependency quite likely.
      
      This patch separates out backing-dev-defs.h which only has the
      essential definitions and updates blkdev.h to include it.  c files
      which need access to more backing-dev details now include
      backing-dev.h directly.  This takes backing-dev.h off the common
      include dependency chain making it a lot easier to use it across block
      and cgroup.
      
      v2: fs/fat build failure fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      66114cad
  12. 22 4月, 2015 3 次提交
  13. 23 2月, 2015 5 次提交
    • G
      Add new disk to clustered array · 1aee41f6
      Goldwyn Rodrigues 提交于
      Algorithm:
      1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
         ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
      2. Node 1 sends NEWDISK with uuid and slot number
      3. Other nodes issue kobject_uevent_env with uuid and slot number
      (Steps 4,5 could be a udev rule)
      4. In userspace, the node searches for the disk, perhaps
         using blkid -t SUB_UUID=""
      5. Other nodes issue either of the following depending on whether the disk
         was found:
         ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
      	 disc.number set to slot number)
         ioctl(CLUSTERED_DISK_NACK)
      6. Other nodes drop lock on no-new-devs (CR) if device is found
      7. Node 1 attempts EX lock on no-new-devs
      8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
         as SpareLocal
      9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
      10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message.
      Signed-off-by: NLidong Zhong <lzhong@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      1aee41f6
    • G
      Reload superblock if METADATA_UPDATED is received · 1d7e3e96
      Goldwyn Rodrigues 提交于
      Re-reads the devices by invalidating the cache.
      Since we don't write to faulty devices, this is detected using
      events recorded in the devices. If it is old as compared to the mddev
      mark it is faulty.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      1d7e3e96
    • G
      Add node recovery callbacks · cf921cc1
      Goldwyn Rodrigues 提交于
      DLM offers callbacks when a node fails and the lock remastery
      is performed:
      
      1. recover_prep: called when DLM discovers a node is down
      2. recover_slot: called when DLM identifies the node and recovery
      		can start
      3. recover_done: called when all nodes have completed recover_slot
      
      recover_slot() and recover_done() are also called when the node joins
      initially in order to inform the node with its slot number. These slot
      numbers start from one, so we deduct one to make it start with zero
      which the cluster-md code uses.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      cf921cc1
    • G
      Introduce md_cluster_info · c4ce867f
      Goldwyn Rodrigues 提交于
      md_cluster_info stores the cluster information in the MD device.
      
      The join() is called when mddev detects it is a clustered device.
      The main responsibilities are:
      	1. Setup a DLM lockspace
      	2. Setup all initial locks such as super block locks and bitmap lock (will come later)
      
      The leave() clears up the lockspace and all the locks held.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      c4ce867f
    • G
      Introduce md_cluster_operations to handle cluster functions · edb39c9d
      Goldwyn Rodrigues 提交于
      This allows dynamic registering of cluster hooks.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      edb39c9d
  14. 06 2月, 2015 5 次提交
  15. 04 2月, 2015 5 次提交
    • N
      md: protect ->pers changes with mddev->lock · 36d091f4
      NeilBrown 提交于
      ->pers is already protected by ->reconfig_mutex, and
      cannot possibly change when there are threads running or
      outstanding IO.
      
      However there are some places where we access ->pers
      not in a thread or IO context, and where ->reconfig_mutex
      is unnecessarily heavy-weight:  level_show and md_seq_show().
      
      So protect all changes, and those accesses, with ->lock.
      This is a step toward taking those accesses out from under
      reconfig_mutex.
      
      [Fixed missing "mddev->pers" -> "pers" conversion, thanks to
       Dan Carpenter <dan.carpenter@oracle.com>]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      36d091f4
    • N
      md: rename ->stop to ->free · afa0f557
      NeilBrown 提交于
      Now that the ->stop function only frees the private data,
      rename is accordingly.
      
      Also pass in the private pointer as an arg rather than using
      mddev->private.  This flexibility will be useful in level_store().
      
      Finally, don't clear ->private.  It doesn't make sense to clear
      it seeing that isn't what we free, and it is no longer necessary
      to clear ->private (it was some time ago before  ->to_remove was
      introduced).
      
      Setting ->to_remove in ->free() is a bit of a wart, but not a
      big problem at the moment.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      afa0f557
    • N
      md: make merge_bvec_fn more robust in face of personality changes. · 64590f45
      NeilBrown 提交于
      There is no locking around calls to merge_bvec_fn(), so
      it is possible that calls which coincide with a level (or personality)
      change could go wrong.
      
      So create a central dispatch point for these functions and use
      rcu_read_lock().
      If the array is suspended, reject any merge that can be rejected.
      If not, we know it is safe to call the function.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      64590f45
    • N
      md: make ->congested robust against personality changes. · 5c675f83
      NeilBrown 提交于
      There is currently no locking around calls to the 'congested'
      bdi function.  If called at an awkward time while an array is
      being converted from one level (or personality) to another, there
      is a tiny chance of running code in an unreferenced module etc.
      
      So add a 'congested' function to the md_personality operations
      structure, and call it with appropriate locking from a central
      'mddev_congested'.
      
      When the array personality is changing the array will be 'suspended'
      so no IO is processed.
      If mddev_congested detects this, it simply reports that the
      array is congested, which is a safe guess.
      As mddev_suspend calls synchronize_rcu(), mddev_congested can
      avoid races by included the whole call inside an rcu_read_lock()
      region.
      This require that the congested functions for all subordinate devices
      can be run under rcu_lock.  Fortunately this is the case.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5c675f83
    • N
      md: rename mddev->write_lock to mddev->lock · 85572d7c
      NeilBrown 提交于
      This lock is used for (slightly) more than helping with writing
      superblocks, and it will soon be extended further.  So the
      name is inappropriate.
      
      Also, the _irq variant hasn't been needed since 2.6.37 as it is
      never taking from interrupt or bh context.
      
      So:
        -rename write_lock to lock
        -document what it protects
        -remove _irq ... except in md_flush_request() as there
           is no wait_event_lock() (with no _irq).  This can be
           cleaned up after appropriate changes to wait.h.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      85572d7c
  16. 14 10月, 2014 1 次提交
  17. 09 4月, 2014 1 次提交
    • N
      md/bitmap: don't abuse i_writecount for bitmap files. · 035328c2
      NeilBrown 提交于
      md bitmap code currently tries to use i_writecount to stop any other
      process from writing to out bitmap file.  But that is really an abuse
      and has bit-rotted so locking is all wrong.
      
      So discard that - root should be allowed to shoot self in foot.
      
      Still use it in a much less intrusive way to stop the same file being
      used as bitmap on two different array, and apply other checks to
      ensure the file is at least vaguely usable for bitmap storage
      (is regular, is open for write.  Support for ->bmap is already checked
      elsewhere).
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      035328c2
  18. 14 1月, 2014 1 次提交
    • N
      md: fix problem when adding device to read-only array with bitmap. · 8313b8e5
      NeilBrown 提交于
      If an array is started degraded, and then the missing device
      is found it can be re-added and a minimal bitmap-based recovery
      will bring it fully up-to-date.
      
      If the array is read-only a recovery would not be allowed.
      But also if the array is read-only and the missing device was
      present very recently, then there could be no need for any
      recovery at all, so we simply include the device in the read-only
      array without any recovery.
      
      However... if the missing device was removed a little longer ago
      it could be missing some updates, but if a bitmap is present it will
      be conditionally accepted pending a bitmap-based update.  We don't
      currently detect this case properly and will include that old
      device into the read-only array with no recovery even though it really
      needs a recovery.
      
      This patch keeps track of whether a bitmap-based-recovery is really
      needed or not in the new Bitmap_sync rdev flag.  If that is set,
      then the device will not be added to a read-only array.
      
      Cc: Andrei Warkentin <andreiw@vmware.com>
      Fixes: d70ed2e4
      Cc: stable@vger.kernel.org (3.2+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8313b8e5
  19. 12 12月, 2013 1 次提交
    • T
      kernfs: s/sysfs_dirent/kernfs_node/ and rename its friends accordingly · 324a56e1
      Tejun Heo 提交于
      kernfs has just been separated out from sysfs and we're already in
      full conflict mode.  Nothing can make the situation any worse.  Let's
      take the chance to name things properly.
      
      This patch performs the following renames.
      
      * s/sysfs_elem_dir/kernfs_elem_dir/
      * s/sysfs_elem_symlink/kernfs_elem_symlink/
      * s/sysfs_elem_attr/kernfs_elem_file/
      * s/sysfs_dirent/kernfs_node/
      * s/sd/kn/ in kernfs proper
      * s/parent_sd/parent/
      * s/target_sd/target/
      * s/dir_sd/parent/
      * s/to_sysfs_dirent()/rb_to_kn()/
      * misc renames of local vars when they conflict with the above
      
      Because md, mic and gpio dig into sysfs details, this patch ends up
      modifying them.  All are sysfs_dirent renames and trivial.  While we
      can avoid these by introducing a dummy wrapping struct sysfs_dirent
      around kernfs_node, given the limited usage outside kernfs and sysfs
      proper, I don't think such workaround is called for.
      
      This patch is strictly rename only and doesn't introduce any
      functional difference.
      
      - mic / gpio renames were missing.  Spotted by kbuild test robot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      324a56e1
  20. 09 11月, 2013 1 次提交
  21. 27 9月, 2013 1 次提交
    • T
      sysfs: clean up sysfs_get_dirent() · 388975cc
      Tejun Heo 提交于
      The pre-existing sysfs interfaces which take explicit namespace
      argument are weird in that they place the optional @ns in front of
      @name which is contrary to the established convention.  For example,
      we end up forcing vast majority of sysfs_get_dirent() users to do
      sysfs_get_dirent(parent, NULL, name), which is silly and error-prone
      especially as @ns and @name may be interchanged without causing
      compilation warning.
      
      This renames sysfs_get_dirent() to sysfs_get_dirent_ns() and swap the
      positions of @name and @ns, and sysfs_get_dirent() is now a wrapper
      around sysfs_get_dirent_ns().  This makes confusions a lot less
      likely.
      
      There are other interfaces which take @ns before @name.  They'll be
      updated by following patches.
      
      This patch doesn't introduce any functional changes.
      
      v2: EXPORT_SYMBOL_GPL() wasn't updated leading to undefined symbol
          error on module builds.  Reported by build test robot.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      388975cc
  22. 27 8月, 2013 1 次提交
    • N
      md: avoid deadlock when dirty buffers during md_stop. · 260fa034
      NeilBrown 提交于
      When the last process closes /dev/mdX sync_blockdev will be called so
      that all buffers get flushed.
      So if it is then opened for the STOP_ARRAY ioctl to be sent there will
      be nothing to flush.
      
      However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
      moments before some other process which was writing closes their file
      descriptor, then there won't be a 'last close' and the buffers might
      not get flushed.
      
      So do_md_stop() calls sync_blockdev().  However at this point it is
      holding ->reconfig_mutex.  So if the array is currently 'clean' then
      the writes from sync_blockdev() will not complete until the array
      can be marked dirty and that won't happen until some other thread
      can get ->reconfig_mutex.  So we deadlock.
      
      We need to move the sync_blockdev() call to before we take
      ->reconfig_mutex.
      However then some other thread could open /dev/mdX and write to it
      after we call sync_blockdev() and before we actually stop the array.
      This can leave dirty data in the page cache which is awkward.
      
      So introduce new flag MD_STILL_CLOSED.  Set it before calling
      sync_blockdev(), clear it if anyone does open the file, and abort the
      STOP_ARRAY attempt if it gets set before we lock against further
      opens.
      
      It is still possible to get problems if you open /dev/mdX, write to
      it, then issue the STOP_ARRAY ioctl.  Just don't do that.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      260fa034