1. 26 7月, 2010 1 次提交
    • N
      md: be more careful setting MD_CHANGE_CLEAN · 676e42d8
      NeilBrown 提交于
      When MD_CHANGE_CLEAN is set we might block in md_write_start.
      So we should only set it when fairly sure that something will clear
      it.
      
      There are two places where it is set so as to encourage a metadata
      update to record the progress of resync/recovery.  This should only
      be done if the internal metadata update mechanisms are in use, which
      can be tested by by inspecting '->persistent'.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      676e42d8
  2. 22 5月, 2010 2 次提交
    • C
      sanitize vfs_fsync calling conventions · 8018ab05
      Christoph Hellwig 提交于
      Now that the last user passing a NULL file pointer is gone we can remove
      the redundant dentry argument and associated hacks inside vfs_fsynmc_range.
      
      The next step will be removig the dentry argument from ->fsync, but given
      the luck with the last round of method prototype changes I'd rather
      defer this until after the main merge window.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8018ab05
    • E
      sysfs: Implement sysfs tagged directory support. · 3ff195b0
      Eric W. Biederman 提交于
      The problem.  When implementing a network namespace I need to be able
      to have multiple network devices with the same name.  Currently this
      is a problem for /sys/class/net/*, /sys/devices/virtual/net/*, and
      potentially a few other directories of the form /sys/ ... /net/*.
      
      What this patch does is to add an additional tag field to the
      sysfs dirent structure.  For directories that should show different
      contents depending on the context such as /sys/class/net/, and
      /sys/devices/virtual/net/ this tag field is used to specify the
      context in which those directories should be visible.  Effectively
      this is the same as creating multiple distinct directories with
      the same name but internally to sysfs the result is nicer.
      
      I am calling the concept of a single directory that looks like multiple
      directories all at the same path in the filesystem tagged directories.
      
      For the networking namespace the set of directories whose contents I need
      to filter with tags can depend on the presence or absence of hotplug
      hardware or which modules are currently loaded.  Which means I need
      a simple race free way to setup those directories as tagged.
      
      To achieve a reace free design all tagged directories are created
      and managed by sysfs itself.
      
      Users of this interface:
      - define a type in the sysfs_tag_type enumeration.
      - call sysfs_register_ns_types with the type and it's operations
      - sysfs_exit_ns when an individual tag is no longer valid
      
      - Implement mount_ns() which returns the ns of the calling process
        so we can attach it to a sysfs superblock.
      - Implement ktype.namespace() which returns the ns of a syfs kobject.
      
      Everything else is left up to sysfs and the driver layer.
      
      For the network namespace mount_ns and namespace() are essentially
      one line functions, and look to remain that.
      
      Tags are currently represented a const void * pointers as that is
      both generic, prevides enough information for equality comparisons,
      and is trivial to create for current users, as it is just the
      existing namespace pointer.
      
      The work needed in sysfs is more extensive.  At each directory
      or symlink creating I need to check if the directory it is being
      created in is a tagged directory and if so generate the appropriate
      tag to place on the sysfs_dirent.  Likewise at each symlink or
      directory removal I need to check if the sysfs directory it is
      being removed from is a tagged directory and if so figure out
      which tag goes along with the name I am deleting.
      
      Currently only directories which hold kobjects, and
      symlinks are supported.  There is not enough information
      in the current file attribute interfaces to give us anything
      to discriminate on which makes it useless, and there are
      no potential users which makes it an uninteresting problem
      to solve.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NBenjamin Thery <benjamin.thery@bull.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      3ff195b0
  3. 18 5月, 2010 3 次提交
  4. 14 12月, 2009 10 次提交
  5. 16 10月, 2009 1 次提交
  6. 23 9月, 2009 1 次提交
  7. 26 5月, 2009 1 次提交
    • N
      md: bitmap: improve bitmap maintenance code. · be512691
      NeilBrown 提交于
      The code for checking which bits in the bitmap can be cleared
      has 2 problems:
       1/ it repeatedly takes and drops a spinlock, where it would make
          more sense to just hold on to it most of the time.
       2/ it doesn't make use of some opportunities to skip large sections
          of the bitmap
      
      This patch fixes those.  It will only affect CPU consumption, not
      correctness.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      be512691
  8. 23 5月, 2009 1 次提交
  9. 07 5月, 2009 2 次提交
    • N
      md: fix some (more) errors with bitmaps on devices larger than 2TB. · db305e50
      NeilBrown 提交于
      If a write intent bitmap covers more than 2TB, we sometimes work with
      values beyond 32bit, so these need to be sector_t.  This patches
      add the required casts to some unsigned longs that are being shifted
      up.
      
      This will affect any raid10 larger than 2TB, or any raid1/4/5/6 with
      member devices that are larger than 2TB.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reported-by: N"Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
      Cc: stable@kernel.org
      db305e50
    • N
      md: fix loading of out-of-date bitmap. · b74fd282
      NeilBrown 提交于
      When md is loading a bitmap which it knows is out of date, it fills
      each page with 1s and writes it back out again.  However the
      write_page call makes used of bitmap->file_pages and
      bitmap->last_page_size which haven't been set correctly yet.  So this
      can sometimes fail.
      
      Move the setting of file_pages and last_page_size to before the call
      to write_page.
      
      This bug can cause the assembly on an array to fail, thus making the
      data inaccessible.  Hence I think it is a suitable candidate for
      -stable.
      
      Cc: stable@kernel.org
      Reported-by: NVojtech Pavlik <vojtech@suse.cz>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b74fd282
  10. 20 4月, 2009 1 次提交
  11. 14 4月, 2009 1 次提交
    • N
      md: improve usefulness and accuracy of sysfs file md/sync_completed. · acb180b0
      NeilBrown 提交于
      The sync_completed file reports how much of a resync (or recovery or
      reshape) has been completed.
      However due to the possibility of out-of-order completion of writes,
      it is not certain to be accurate.
      
      We have an internal value - mddev->curr_resync_completed - which is an
      accurate value (though it might not always be quite so uptodate).
      
      So:
       - make curr_resync_completed be uptodate a little more often,
         particularly when raid5 reshape updates status in the metadata
       - report curr_resync_completed in the sysfs file
       - allow poll/select to report all updates to md/sync_completed.
      
      This makes sync_completed completed usable by any external metadata
      handler that wants to record this status information in its metadata.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      acb180b0
  12. 31 3月, 2009 8 次提交
    • A
      md: Make mddev->size sector-based. · 58c0fed4
      Andre Noll 提交于
      This patch renames the "size" field of struct mddev_s to "dev_sectors"
      and stores the number of 512-byte sectors instead of the number of
      1K-blocks in it.
      
      All users of that field, including raid levels 1,4-6,10, are adjusted
      accordingly. This simplifies the code a bit because it allows to get
      rid of a couple of divisions/multiplications by two.
      
      In order to make checkpatch happy, some minor coding style issues
      have also been addressed. In particular, size_store() now uses
      strict_strtoull() instead of simple_strtoull().
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      58c0fed4
    • N
      md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash · 97e4f42d
      NeilBrown 提交于
      Version 1.x metadata has the ability to record the status of a
      partially completed drive recovery.
      However we only update that record on a clean shutdown.
      It would be nice to update it on unclean shutdowns too, particularly
      when using a bitmap that removes much to the 'sync' effort after an
      unclean shutdown.
      
      One complication with checkpointing recovery is that we only know
      where we are up to in terms of IO requests started, not which ones
      have completed.  And we need to know what has completed to record
      how much is recovered.  So occasionally pause the recovery until all
      submitted requests are completed, then update the record of where
      we are up to.
      
      When we have a bitmap, we already do that pause occasionally to keep
      the bitmap up-to-date.  So enhance that code to record the recovery
      offset and schedule a superblock update.
      And when there is no bitmap, just pause 16 times during the resync to
      do a checkpoint.
      '16' is a fairly arbitrary number.  But we don't really have any good
      way to judge how often is acceptable, and it seems like a reasonable
      number for now.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      97e4f42d
    • N
      md: move md_k.h from include/linux/raid/ to drivers/md/ · 43b2e5d8
      NeilBrown 提交于
      It really is nicer to keep related code together..
      Signed-off-by: NNeilBrown <neilb@suse.de>
      43b2e5d8
    • N
      md: move lots of #include lines out of .h files and into .c · bff61975
      NeilBrown 提交于
      This makes the includes more explicit, and is preparation for moving
      md_k.h to drivers/md/md.h
      
      Remove include/raid/md.h as its only remaining use was to #include
      other files.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bff61975
    • C
      md: move headers out of include/linux/raid/ · ef740c37
      Christoph Hellwig 提交于
      Move the headers with the local structures for the disciplines and
      bitmap.h into drivers/md/ so that they are more easily grepable for
      hacking and not far away.  md.h is left where it is for now as there
      are some uses from the outside.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ef740c37
    • N
      md: write bitmap information to devices that are undergoing recovery. · 355a43e6
      NeilBrown 提交于
      When we add some spares to an array and start recovery, and we have
      a bitmap which is stored 'internally' on all devices, we call
      bitmap_write_all to make sure the bitmap is correct on the new
      device(s).
      However that doesn't work as write_sb_page only writes to
      'In_sync' devices, and devices undergoing recovery are not
      'In_sync' until recovery finishes.
      
      So extend write_sb_page (actually next_active_rdev) to include devices
      that are under recovery.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      355a43e6
    • N
      md: never clear bit from the write-intent bitmap when the array is degraded. · d0a4bb49
      NeilBrown 提交于
      
      It is safe to clear a bit from the write-intent bitmap for a raid1
      if we know the data has been written to all devices, which is
      what the current test does.
      
      But it is not always safe to update the 'events_cleared' counter in
      that case.  This is because one request could complete successfully
      after some other request has partially failed.
      
      So simply disable the clearing and updating of events_cleared whenever
      the array is degraded.  This might end up not clearing some bits that
      could safely be cleared, but it is safest approach.
      
      Note that the bug fixed here did not risk corrupting data by letting
      the array get out-of-sync.  Rather it meant that when a device is
      removed and re-added to the array, it might incorrectly require a full
      recovery rather than just recovering based on the bitmap.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d0a4bb49
    • N
      md: Allow write-intent bitmaps to have chunksize < PAGE_SIZE · 1187cf0a
      NeilBrown 提交于
      md currently insists that the chunk size used for write-intent
      bitmaps (the amount of data that corresponds to one chunk)
      be at least one page.
      
      The reason for this restriction is lost in the mists of time,
      but a review of the code (and a vague memory) suggests that the only
      problem would be related to resync.  Resync tries very hard to
      work in multiples of a page, but also needs to sync with units
      of a bitmap_chunk too.
      
      This connection comes out in the bitmap_start_sync call.
      
      So change bitmap_start_sync to always work in multiples of a page.
      If the bitmap chunk size is less that one page, we flag multiple
      chunks as 'syncing' and generally make them all appear to the
      resync routines like one chunk.
      
      All other code either already works with data ranges that could
      span multiple chunks, or explicitly only cares about a single chunk.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      1187cf0a
  13. 09 1月, 2009 2 次提交
    • C
      md: use list_for_each_entry macro directly · 159ec1fc
      Cheng Renquan 提交于
      The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
      list_for_each_entry_safe, from <linux/list.h>, it should be defined to
      use list_for_each_entry_safe, instead of reinventing the wheel.
      
      But some calls to each_entry_safe don't really need a safe version,
      just a direct list_for_each_entry is enough, this could save a temp
      variable (tmp) in every function that used rdev_for_each.
      
      In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
      totally save many tmp vars; and only in the other situations that will call
      list_del to delete an entry, the safe version is used.
      Signed-off-by: NCheng Renquan <crquan@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      159ec1fc
    • N
      md: fix bitmap-on-external-file bug. · 53845270
      NeilBrown 提交于
      commit a2ed9615
      fixed a bug with 'internal' bitmaps, but in the process broke
      'in a file' bitmaps.  So they are broken in 2.6.28
      
      This fixes it, and needs to go in 2.6.28-stable.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@kernel.org
      53845270
  14. 19 12月, 2008 1 次提交
    • N
      md: Don't read past end of bitmap when reading bitmap. · a2ed9615
      NeilBrown 提交于
      When we read the write-intent-bitmap off the device, we currently
      read a whole number of pages.
      When PAGE_SIZE is 4K, this works due to the alignment we enforce
      on the superblock and bitmap.
      When PAGE_SIZE is 64K, this case read past the end-of-device
      which causes an error.
      
      When we write the superblock, we ensure to clip the last page
      to just be the required size.  Copy that code into the read path
      to just read the required number of sectors.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Cc: stable@kernel.org
      a2ed9615
  15. 01 9月, 2008 1 次提交
    • N
      Fix problem with waiting while holding rcu read lock in md/bitmap.c · b2d2c4ce
      NeilBrown 提交于
      A recent patch to protect the rdev list with rcu locking leaves us
      with a problem because we can sleep on memalloc while holding the
      rcu lock.
      
      The rcu lock is only needed while walking the linked list as
      uninteresting devices (failed or spares) can be removed at any time.
      
      So only take the rcu lock while actually walking the linked list.
      Take a refcount on the rdev during the time when we drop the lock
      and do the memalloc to start IO.
      When we return to the locked code, all the interesting devices
      on the list will not have moved, so we can simply use
      list_for_each_continue_rcu to pick up where we left off.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b2d2c4ce
  16. 02 8月, 2008 1 次提交
  17. 21 7月, 2008 1 次提交
    • N
      md: Protect access to mddev->disks list using RCU · 4b80991c
      NeilBrown 提交于
      All modifications and most access to the mddev->disks list are made
      under the reconfig_mutex lock.  However there are three places where
      the list is walked without any locking.  If a reconfig happens at this
      time, havoc (and oops) can ensue.
      
      So use RCU to protect these accesses:
        - wrap them in rcu_read_{,un}lock()
        - use list_for_each_entry_rcu
        - add to the list with list_add_rcu
        - delete from the list with list_del_rcu
        - delay the 'free' with call_rcu rather than schedule_work
      
      Note that export_rdev did a list_del_init on this list.  In almost all
      cases the entry was not in the list anymore so it was a no-op and so
      safe.  It is no longer safe as after list_del_rcu we may not touch
      the list_head.
      An audit shows that export_rdev is called:
        - after unbind_rdev_from_array, in which case the delete has
           already been done,
        - after bind_rdev_to_array fails, in which case the delete isn't needed.
        - before the device has been put on a list at all (e.g. in
            add_new_disk where reading the superblock fails).
        - and in autorun devices after a failure when the device is on a
            different list.
      
      So remove the list_del_init call from export_rdev, and add it back
      immediately before the called to export_rdev for that last case.
      
      Note also that ->same_set is sometimes used for lists other than
      mddev->list (e.g. candidates).  In these cases rcu is not needed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4b80991c
  18. 11 7月, 2008 1 次提交
  19. 28 6月, 2008 1 次提交
    • N
      Improve setting of "events_cleared" for write-intent bitmaps. · a0da84f3
      Neil Brown 提交于
      When an array is degraded, bits in the write-intent bitmap are not
      cleared, so that if the missing device is re-added, it can be synced
      by only updated those parts of the device that have changed since
      it was removed.
      
      The enable this a 'events_cleared' value is stored. It is the event
      counter for the array the last time that any bits were cleared.
      
      Sometimes - if a device disappears from an array while it is 'clean' -
      the events_cleared value gets updated incorrectly (there are subtle
      ordering issues between updateing events in the main metadata and the
      bitmap metadata) resulting in the missing device appearing to require
      a full resync when it is re-added.
      
      With this patch, we update events_cleared precisely when we are about
      to clear a bit in the bitmap.  We record events_cleared when we clear
      the bit internally, and copy that to the superblock which is written
      out before the bit on storage.  This makes it more "obviously correct".
      
      We also need to update events_cleared when the event_count is going
      backwards (as happens on a dirty->clean transition of a non-degraded
      array).
      
      Thanks to Mike Snitzer for identifying this problem and testing early
      "fixes".
      
      Cc:  "Mike Snitzer" <snitzer@gmail.com>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      a0da84f3