1. 18 6月, 2009 1 次提交
  2. 16 6月, 2009 21 次提交
  3. 10 6月, 2009 1 次提交
    • L
      tracing/events: convert block trace points to TRACE_EVENT() · 55782138
      Li Zefan 提交于
      TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
      these new capabilities to this tracepoint:
      
        - zero-copy and per-cpu splice() tracing
        - binary tracing without printf overhead
        - structured logging records exposed under /debug/tracing/events
        - trace events embedded in function tracer output and other plugins
        - user-defined, per tracepoint filter expressions
        ...
      
      Cons:
      
        - no dev_t info for the output of plug, unplug_timer and unplug_io events.
          no dev_t info for getrq and sleeprq events if bio == NULL.
          no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
      
          This is mainly because we can't get the deivce from a request queue.
          But this may change in the future.
      
        - A packet command is converted to a string in TP_assign, not TP_print.
          While blktrace do the convertion just before output.
      
          Since pc requests should be rather rare, this is not a big issue.
      
        - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
          has a unique format, which means we have some unused data in a trace entry.
      
          The overhead is minimized by using __dynamic_array() instead of __array().
      
      I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
      
            dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
      1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
      2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
      3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s
      
      So the overhead of tracing is very small, and no regression when using
      those trace events vs blktrace.
      
      And the binary output of TRACE_EVENT is much smaller than blktrace:
      
       # ls -l -h
       -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
       -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
       -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
      
      Following are some comparisons between TRACE_EVENT and blktrace:
      
      plug:
        kjournald-480   [000]   303.084981: block_plug: [kjournald]
        kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]
      
      unplug_io:
        kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
        kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1
      
      remap:
        kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
        kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384
      
      bio_backmerge:
        kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
        kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]
      
      getrq:
        kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]
      
        bash-2066  [001]  1072.953770:   8,0    G   N [bash]
        bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
      
      rq_complete:
        konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
        konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]
      
        ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
        ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
      
      rq_insert:
        kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]
      
      Changelog from v2 -> v3:
      
      - use the newly introduced __dynamic_array().
      
      Changelog from v1 -> v2:
      
      - use __string() instead of __array() to minimize the memory required
        to store hex dump of rq->cmd().
      
      - support large pc requests.
      
      - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
      
      - some cleanups.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      55782138
  4. 09 6月, 2009 4 次提交
    • N
      md/raid5: fix bug in reshape code when chunk_size decreases. · 0e6e0271
      NeilBrown 提交于
      Now that we support changing the chunksize, we calculate
      "reshape_sectors" to be the max of number of sectors in old
      and new chunk size.
      However there is one please where we still use 'chunksize'
      rather than 'reshape_sectors'.
      This causes a reshape that reduces the size of chunks to freeze.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0e6e0271
    • N
      md/raid5 - avoid deadlocks in get_active_stripe during reshape · a8c906ca
      NeilBrown 提交于
      md has functionality to 'quiesce' and array so that all pending
      IO completed and no new IO starts.  This is used to achieve a
      stable state before making internal changes.
      
      Currently this quiescing applies equally to normal IO, resync
      IO, and reshape IO.
      However there is a problem with applying it to reshape IO.
      Reshape can have multiple 'stripe_heads' that must be active together.
      If the quiesce come between allocating the first and the last of
      such a collection, then we deadlock, as the last will not be allocated
      until the quiesce is lifted, the quiesce will not be lifted until the
      first (which has been allocated) gets used, and that first cannot be
      used until the last is allocated.
      
      It is not necessary to inhibit reshape IO when a quiesce is
      requested.  Those places in the code that require a full quiesce will
      ensure the reshape thread is not running at all.
      
      So allow reshape requests to get access to new stripe_heads without
      being blocked by a 'quiesce'.
      
      This only affects in-place reshapes (i.e. where the array does not
      grow or shrink) and these are only newly supported.  So this patch is
      not needed in earlier kernels.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a8c906ca
    • N
      md/raid5: use conf->raid_disks in preference to mddev->raid_disk · f001a70c
      NeilBrown 提交于
      mddev->raid_disks can be changed and any time by a request from
      user-space.  It is a suggestion as to what number of raid_disks is
      desired.
      
      conf->raid_disks can only be changed by the raid5 module with suitable
      locks in place.  It is a statement as to the current number of
      raid_disks.
      
      There are two places where the latter should be used, but the former
      is used.  This can lead to a crash when reshaping an array.
      
      This patch changes to mddev-> to conf->
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f001a70c
    • J
      Revert "block: Fix bounce limit setting in DM" · 9df1bb9b
      Jens Axboe 提交于
      This reverts commit a05c0205.
      
      DM doesn't need to access the bounce_pfn directly.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      9df1bb9b
  5. 03 6月, 2009 1 次提交
  6. 27 5月, 2009 1 次提交
  7. 26 5月, 2009 7 次提交
  8. 23 5月, 2009 2 次提交
  9. 07 5月, 2009 2 次提交
    • N
      md: remove rd%d links immediately after stopping an array. · c4647292
      NeilBrown 提交于
      md maintains link in sys/mdXX/md/ to identify which device has
      which role in the array. e.g.
         rd2 -> dev-sda
      
      indicates that the device with role '2' in the array is sda.
      
      These links are only present when the array is active.  They are
      created immediately after ->run is called, and so should be removed
      immediately after ->stop is called.
      However they are currently removed a little bit later, and it is
      possible for ->run to be called again, thus adding these links, before
      they are removed.
      
      So move the removal earlier so they are consistently only present when
      the array is active.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c4647292
    • N
      md: remove ability to explicit set an inactive array to 'clean'. · 5bf29597
      NeilBrown 提交于
      Being able to write 'clean' to an 'array_state' of an inactive array
      to activate it in 'clean' mode is both unnecessary and inconvenient.
      
      It is unnecessary because the same can be achieved by writing
      'active'.  This activates and array, but it still remains 'clean'
      until the first write.
      
      It is inconvenient because writing 'clean' is more often used to
      cause an 'active' array to revert to 'clean' mode (thus blocking
      any writes until a 'write-pending' is promoted to 'active').
      
      Allowing 'clean' to both activate an array and mark an active array as
      clean can lead to races:  One program writes 'clean' to mark the
      active array as clean at the same time as another program writes
      'inactive' to deactivate (stop) and active array.  Depending on which
      writes first, the array could be deactivated and immediately
      reactivated which isn't what was desired.
      
      So just disable the use of 'clean' to activate an array.
      
      This avoids a race that can be triggered with mdadm-3.0 and external
      metadata, so it suitable for -stable.
      Reported-by: NRafal Marszewski <rafal.marszewski@intel.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5bf29597