1. 31 3月, 2009 27 次提交
    • N
      md: add ->takeover method to support changing the personality managing an array · 245f46c2
      NeilBrown 提交于
      Implement this for RAID6 to be able to 'takeover' a RAID5 array.  The
      new RAID6 will use a layout which places Q on the last device, and
      that device will be missing.
      If there are any available spares, one will immediately have Q
      recovered onto it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      245f46c2
    • N
      md: enable suspend/resume of md devices. · 409c57f3
      NeilBrown 提交于
      To be able to change the 'level' of an md/raid array, we need to
      suspend the device so that no requests are active - then move some
      pointers around etc.
      
      The code already keeps counts of active requests and the ->quiesce
      function can be used to wait until those counts hit zero.
      However the quiesce function blocks new requests once they are all
      ready 'inside' the personality module, and that is too late if we want
      to replace the personality modules.
      
      So make all md requests come in through a common md_make_request
      function that keeps track of how many requests have entered the
      modules but may not yet be on the internal reference counts.
      Allow md_make_request to be blocked when we want to suspend the
      device, and make it possible to wait for all those in-transit requests
      to be added to internal lists so that ->quiesce can wait for them.
      
      There is still a problem that when a request completes, we drop the
      ref count inside the personality code so there is a short time between
      when the refcount hits zero, and when the personality code is no
      longer being used.
      The personality code never blocks (schedule or spinlock) between
      dropping the refcount and exiting the routine, so this should be safe
      (as put_module calls synchronize_sched() before unmapping the module
      code).
      Signed-off-by: NNeilBrown <neilb@suse.de>
      409c57f3
    • N
      md: md_unregister_thread should cope with being passed NULL · e0cf8f04
      NeilBrown 提交于
      Mostly md_unregister_thread is only called when we know that the
      thread is NULL, but sometimes we need to check first.  It is safer
      to put the check inside md_unregister_thread itself.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e0cf8f04
    • N
      md/raid5: refactor raid5 "run" · 91adb564
      NeilBrown 提交于
      .. so that the code to create the private data structures is separate.
      This will help with future code to change the level of an active
      array.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      91adb564
    • N
      md: make sure new_level, new_chunksize, new_layout always have sensible values. · 34817e8c
      NeilBrown 提交于
      When an md array is undergoing a change, we have new_* fields that
      show the new values.
      When no change is happening, it is least confusing if these have
      the same value as the normal fields.
      This is true in most cases, but not when the values are set via sysfs.
      
      So fix this up.
      
      A subsequent patch will BUG_ON if these things aren't consistent.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      34817e8c
    • N
      md/raid5: finish support for DDF/raid6 · 67cc2b81
      NeilBrown 提交于
      DDF requires RAID6 calculations over different devices in a different
      order.
      For md/raid6, we calculate over just the data devices, starting
      immediately after the 'Q' block.
      For ddf/raid6 we calculate over all devices, using zeros in place of
      the P and Q blocks.
      
      This requires unfortunately complex loops...
      Signed-off-by: NNeilBrown <neilb@suse.de>
      67cc2b81
    • N
      md/raid5: Add support for new layouts for raid5 and raid6. · 99c0fb5f
      NeilBrown 提交于
      DDF uses different layouts for P and Q blocks than current md/raid6
      so add those that are missing.
      Also add support for RAID6 layouts that are identical to various
      raid5 layouts with the simple addition of one device to hold all of
      the 'Q' blocks.
      Finally add 'raid5' layouts to match raid4.
      These last to will allow online level conversion.
      
      Note that this does not provide correct support for DDF/raid6 yet
      as the order in which data blocks are summed to produce the Q block
      is significant and different between current md code and DDF
      requirements.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      99c0fb5f
    • N
      md/raid5: simplify raid5_compute_sector interface · 911d4ee8
      NeilBrown 提交于
      Rather than passing 'pd_idx' and 'qd_idx' to be filled in, pass
      a 'struct stripe_head *' and fill in the relevant fields.  This is
      more extensible.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      911d4ee8
    • N
      md/raid6: remove expectation that Q device is immediately after P device. · d0dabf7e
      NeilBrown 提交于
      
      Code currently assumes that the devices in a raid6 stripe are
        0 1 ... N-1 P Q
      in some rotated order.  We will shortly add new layouts in which
      this strict pattern is broken.
      So remove this expectation.  We still assume that the data disks
      are roughly in-order.  However P and Q can be inserted anywhere within
      that order.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d0dabf7e
    • N
      md/raid5: change raid5_compute_sector and stripe_to_pdidx to take a 'previous' argument · 112bf897
      NeilBrown 提交于
      This similar to the recent change to get_active_stripe.
      There is no functional change, just come rearrangement to make
      future patches cleaner.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      112bf897
    • N
      md/raid5: simplify interface for init_stripe and get_active_stripe · b5663ba4
      NeilBrown 提交于
      Rather than passing 'pd_idx' and 'disks' to these functions, just pass
      'previous' which tells whether to use the 'previous' or 'current'
      geometry during a reshape, and let init_stripe calculate
      disks and pd_idx and anything else it might need.
      
      This is not a substantial simplification and even adds a division.
      However we will shortly be adding more complexity to init_stripe
      to handle more interesting 'reshape' activities, and without this
      change, the interface to these functions would get very complex.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b5663ba4
    • A
      md: Represent raid device size in sectors. · dd8ac336
      Andre Noll 提交于
      This patch renames the "size" field of struct mdk_rdev_s to
      "sectors" and changes this field to store sectors instead of
      blocks.
      
      All users of this field, linear.c, raid0.c and md.c, are fixed up
      accordingly which gets rid of many multiplications and divisions.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dd8ac336
    • A
      md: Make mddev->size sector-based. · 58c0fed4
      Andre Noll 提交于
      This patch renames the "size" field of struct mddev_s to "dev_sectors"
      and stores the number of 512-byte sectors instead of the number of
      1K-blocks in it.
      
      All users of that field, including raid levels 1,4-6,10, are adjusted
      accordingly. This simplifies the code a bit because it allows to get
      rid of a couple of divisions/multiplications by two.
      
      In order to make checkpatch happy, some minor coding style issues
      have also been addressed. In particular, size_store() now uses
      strict_strtoull() instead of simple_strtoull().
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      58c0fed4
    • N
      md: be more consistent about setting WriteMostly flag when adding a drive to an array · 575a80fa
      NeilBrown 提交于
      When a drive is added to an array using ADD_NEW_DISK, there are two
      places we can get certain flags from:  the metadata on the disk or the
      flags passed through the IOCTL.
      
      For the WriteMostly flag (aka MD_DISK_WRITEMOSTLY) we take the value
      from either of those sources depending on if it is set (i.e. we
      effectively 'or' the two sources together).
      
      This makes it awkward to clear, and is at best inconsistent.
      
      As documented code (in mdadm) requires that setting
      MD_DISK_WRITEMOSTLY in the ioctl will be effective, we resolve the
      inconsistency by always using the value for this flag from the ioctl,
      and ignoring the value on disk.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      575a80fa
    • N
      md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash · 97e4f42d
      NeilBrown 提交于
      Version 1.x metadata has the ability to record the status of a
      partially completed drive recovery.
      However we only update that record on a clean shutdown.
      It would be nice to update it on unclean shutdowns too, particularly
      when using a bitmap that removes much to the 'sync' effort after an
      unclean shutdown.
      
      One complication with checkpointing recovery is that we only know
      where we are up to in terms of IO requests started, not which ones
      have completed.  And we need to know what has completed to record
      how much is recovered.  So occasionally pause the recovery until all
      submitted requests are completed, then update the record of where
      we are up to.
      
      When we have a bitmap, we already do that pause occasionally to keep
      the bitmap up-to-date.  So enhance that code to record the recovery
      offset and schedule a superblock update.
      And when there is no bitmap, just pause 16 times during the resync to
      do a checkpoint.
      '16' is a fairly arbitrary number.  But we don't really have any good
      way to judge how often is acceptable, and it seems like a reasonable
      number for now.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      97e4f42d
    • N
      md: move md_k.h from include/linux/raid/ to drivers/md/ · 43b2e5d8
      NeilBrown 提交于
      It really is nicer to keep related code together..
      Signed-off-by: NNeilBrown <neilb@suse.de>
      43b2e5d8
    • N
      md: move lots of #include lines out of .h files and into .c · bff61975
      NeilBrown 提交于
      This makes the includes more explicit, and is preparation for moving
      md_k.h to drivers/md/md.h
      
      Remove include/raid/md.h as its only remaining use was to #include
      other files.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bff61975
    • N
      md: move most content from md.h to md_k.h · 92022950
      NeilBrown 提交于
      The extern function definitions are kernel-internal definitions, so
      they belong in md_k.h
      
      The MD_*_VERSION values could reasonably go in a number of places,
      but md_u.h seems most reasonable.
      
      This leaves almost nothing in md.h.  It will go soon.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      92022950
    • N
      md: move LEVEL_* definition from md_k.h to md_u.h · 8b2b5c21
      NeilBrown 提交于
      .. as they are part of the user-space interface.
      Also move MdpMinorShift into there so we can remove duplication.
      
      Lastly move mdp_major in.  It is less obviously part of the user-space
      interface, but do_mounts_md.c uses it, and it is acting a bit like
      user-space.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8b2b5c21
    • C
      md: move headers out of include/linux/raid/ · ef740c37
      Christoph Hellwig 提交于
      Move the headers with the local structures for the disciplines and
      bitmap.h into drivers/md/ so that they are more easily grepable for
      hacking and not far away.  md.h is left where it is for now as there
      are some uses from the outside.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ef740c37
    • C
      cleanup drivers/md/Makefile · 2a40a8ae
      Christoph Hellwig 提交于
      Use the -y variables instead of the old -objs so we can easily add
      conditional objects to the modules.  Also always use += to add
      subobjects to avoid problems when placing additional objects in
      some place in the file.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2a40a8ae
    • C
      md: stop defining MAJOR_NR · 3dbd8c2e
      Christoph Hellwig 提交于
      MAJOR_NR was only required for magic in linux/blk.h in 2.4 or earlier
      kernels, so no need to keep it around.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3dbd8c2e
    • M
      MD data integrity support · 3f9d99c1
      Martin K. Petersen 提交于
      md: Add support for data integrity to MD
      
      If all subdevices support the same protection format the MD device is
      flagged as integrity capable.
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3f9d99c1
    • N
      md: write bitmap information to devices that are undergoing recovery. · 355a43e6
      NeilBrown 提交于
      When we add some spares to an array and start recovery, and we have
      a bitmap which is stored 'internally' on all devices, we call
      bitmap_write_all to make sure the bitmap is correct on the new
      device(s).
      However that doesn't work as write_sb_page only writes to
      'In_sync' devices, and devices undergoing recovery are not
      'In_sync' until recovery finishes.
      
      So extend write_sb_page (actually next_active_rdev) to include devices
      that are under recovery.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      355a43e6
    • N
      md: never clear bit from the write-intent bitmap when the array is degraded. · d0a4bb49
      NeilBrown 提交于
      
      It is safe to clear a bit from the write-intent bitmap for a raid1
      if we know the data has been written to all devices, which is
      what the current test does.
      
      But it is not always safe to update the 'events_cleared' counter in
      that case.  This is because one request could complete successfully
      after some other request has partially failed.
      
      So simply disable the clearing and updating of events_cleared whenever
      the array is degraded.  This might end up not clearing some bits that
      could safely be cleared, but it is safest approach.
      
      Note that the bug fixed here did not risk corrupting data by letting
      the array get out-of-sync.  Rather it meant that when a device is
      removed and re-added to the array, it might incorrectly require a full
      recovery rather than just recovering based on the bitmap.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d0a4bb49
    • N
      md: Allow write-intent bitmaps to have chunksize < PAGE_SIZE · 1187cf0a
      NeilBrown 提交于
      md currently insists that the chunk size used for write-intent
      bitmaps (the amount of data that corresponds to one chunk)
      be at least one page.
      
      The reason for this restriction is lost in the mists of time,
      but a review of the code (and a vague memory) suggests that the only
      problem would be related to resync.  Resync tries very hard to
      work in multiples of a page, but also needs to sync with units
      of a bitmap_chunk too.
      
      This connection comes out in the bitmap_start_sync call.
      
      So change bitmap_start_sync to always work in multiples of a page.
      If the bitmap chunk size is less that one page, we flag multiple
      chunks as 'syncing' and generally make them all appear to the
      resync routines like one chunk.
      
      All other code either already works with data ranges that could
      span multiple chunks, or explicitly only cares about a single chunk.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      1187cf0a
    • N
      md: Fix is_mddev_idle test (again). · eea1bf38
      NeilBrown 提交于
      There are two problems with is_mddev_idle.
      
      1/ sync_io is 'atomic_t' and hence 'int'.  curr_events and all the
         rest are 'long'.
         So if sync_io were to wrap on a 64bit host, the value of
         curr_events would go very negative suddenly, and take a very
         long time to return to positive.
      
         So do all calculations as 'int'.  That gives us plenty of precision
         for what we need.
      
      2/ To initialise rdev->last_events we simply call is_mddev_idle, on
         the assumption that it will make sure that last_events is in a
         suitable range.  It used to do this, but now it does not.
         So now we need to be more explicit about initialisation.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      eea1bf38
  2. 10 3月, 2009 7 次提交
  3. 09 3月, 2009 6 次提交
    • C
      Btrfs: fix spinlock assertions on UP systems · b9447ef8
      Chris Mason 提交于
      btrfs_tree_locked was being used to make sure a given extent_buffer was
      properly locked in a few places.  But, it wasn't correct for UP compiled
      kernels.
      
      This switches it to using assert_spin_locked instead, and renames it to
      btrfs_assert_tree_locked to better reflect how it was really being used.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9447ef8
    • H
      Fix fixpoint divide exception in acct_update_integrals · 6d5b5acc
      Heiko Carstens 提交于
      Frans Pop reported the crash below when running an s390 kernel under Hercules:
      
        Kernel BUG at 000738b4  verbose debug info unavailable!
        fixpoint divide exception: 0009  #1! SMP
        Modules linked in: nfs lockd nfs_acl sunrpc ctcm fsm tape_34xx
           cu3088 tape ccwgroup tape_class ext3 jbd mbcache dm_mirror dm_log dm_snapshot
           dm_mod dasd_eckd_mod dasd_mod
        CPU: 0 Not tainted 2.6.27.19 #13
        Process awk (pid: 2069, task: 0f9ed9b8, ksp: 0f4f7d18)
        Krnl PSW : 070c1000 800738b4 (acct_update_integrals+0x4c/0x118)
                   R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0
        Krnl GPRS: 00000000 000007d0 7fffffff fffff830
                   00000000 ffffffff 00000002 0f9ed9b8
                   00000000 00008ca0 00000000 0f9ed9b8
                   0f9edda4 8007386e 0f4f7ec8 0f4f7e98
        Krnl Code: 800738aa: a71807d0         lhi     %r1,2000
                   800738ae: 8c200001         srdl    %r2,1
                   800738b2: 1d21             dr      %r2,%r1
                  >800738b4: 5810d10e         l       %r1,270(%r13)
                   800738b8: 1823             lr      %r2,%r3
                   800738ba: 4130f060         la      %r3,96(%r15)
                   800738be: 0de1             basr    %r14,%r1
                   800738c0: 5800f060         l       %r0,96(%r15)
        Call Trace:
        ( <000000000004fdea>! blocking_notifier_call_chain+0x1e/0x2c)
          <0000000000038502>! do_exit+0x106/0x7c0
          <0000000000038c36>! do_group_exit+0x7a/0xb4
          <0000000000038c8e>! SyS_exit_group+0x1e/0x30
          <0000000000021c28>! sysc_do_restart+0x12/0x16
          <0000000077e7e924>! 0x77e7e924
      
      Reason for this is that cpu time accounting usually only happens from
      interrupt context, but acct_update_integrals gets also called from
      process context with interrupts enabled.
      
      So in acct_update_integrals we may end up with the following scenario:
      
      Between reading tsk->stime/tsk->utime and tsk->acct_timexpd an interrupt
      happens which updates accouting values.  This causes acct_timexpd to be
      greater than the former stime + utime.  The subsequent calculation of
      
      	dtime = cputime_sub(time, tsk->acct_timexpd);
      
      will be negative and the division performed by
      
      	cputime_to_jiffies(dtime)
      
      will generate an exception since the result won't fit into a 32 bit
      register.
      
      In order to fix this just always disable interrupts while accessing any
      of the accounting values.
      
      Reported by: Frans Pop <elendil@planet.nl>
      Tested by: Frans Pop <elendil@planet.nl>
      Cc: stable@kernel.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d5b5acc
    • R
      lguest: fix for CONFIG_SPARSE_IRQ=y · 6db6a5f3
      Rusty Russell 提交于
      Impact: remove lots of lguest boot WARN_ON() when CONFIG_SPARSE_IRQ=y
      
      We now need to call irq_to_desc_alloc_cpu() before
      set_irq_chip_and_handler_name(), but we can't do that from init_IRQ (no
      kmalloc available).
      
      So do it as we use interrupts instead.  Also means we only alloc for
      irqs we use, which was the intent of CONFIG_SPARSE_IRQ anyway.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Ingo Molnar <mingo@redhat.com>
      6db6a5f3
    • R
      lguest: fix crash 'unhandled trap 13 at <native_read_msr_safe>' · cbd88c8e
      Rusty Russell 提交于
      Impact: fix lguest boot crash on modern Intel machines
      
      The code in early_init_intel does:
      
      	if (c->x86 > 6 || (c->x86 == 6 && c->x86_model >= 0xd)) {
      		u64 misc_enable;
      
      		rdmsrl(MSR_IA32_MISC_ENABLE, misc_enable);
      
      And that rdmsr faults (not allowed from non-0 PL).  We can get around
      this by mugging the family ID part of the cpuid.  5 seems like a good
      number.
      
      Of course, this is a hack (how very lguest!).  We could just indicate
      that we don't support MSRs, or implement lguest_rdmst.
      Reported-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Tested-by: NPatrick McHardy <kaber@trash.net>
      cbd88c8e
    • L
      7a203f3b
    • L
      Merge branch 'core-fixes-for-linus' of... · dbb9be8a
      Linus Torvalds 提交于
      Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
      
      * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
        rcu: increment quiescent state counter in ksoftirqd()
      dbb9be8a