1. 22 4月, 2015 21 次提交
    • N
      md/raid5: change ->inactive_blocked to a bit-flag. · 5423399a
      NeilBrown 提交于
      This allows us to easily add more (atomic) flags.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5423399a
    • N
      md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe · 486f0644
      NeilBrown 提交于
      Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe()
      succeeds, do it inside the functions.
      
      Also choose the correct hash to handle next inside the functions.
      
      This removes duplication and will help with future new uses of
      {grow,drop}_one_stripe.
      
      This also fixes a minor bug where the "md/raid:%md: allocate XXkB"
      message always said "0kB".
      Signed-off-by: NNeilBrown <neilb@suse.de>
      486f0644
    • N
      md/raid5: pass gfp_t arg to grow_one_stripe() · a9683a79
      NeilBrown 提交于
      This is needed for future improvement to stripe cache management.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a9683a79
    • M
      md/raid5: introduce configuration option rmw_level · d06f191f
      Markus Stockhausen 提交于
      Depending on the available coding we allow optimized rmw logic for write
      operations. To support easier testing this patch allows manual control
      of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.
      
      The configuration can handle three levels of control.
      
      rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
      calculation has no implementation path yet to factor in/out chunks of
      a syndrome. Enforcing this level can be benefical for slow CPUs with
      hardware syndrome support and fast SSDs.
      
      rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
      save IOs. This equals the "old" unpatched behaviour and will be the
      default.
      
      rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
      equal. We might have higher CPU consumption because of calculating the
      parity twice but it can be benefical otherwise. E.g. RAID4 with fast
      dedicated parity disk/SSD. The option is implemented just to be
      forward-looking and will ONLY work with this patch!
      Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d06f191f
    • M
      md/raid5: activate raid6 rmw feature · 584acdd4
      Markus Stockhausen 提交于
      Glue it altogehter. The raid6 rmw path should work the same as the
      already existing raid5 logic. So emulate the prexor handling/flags
      and split functions as needed.
      
      1) Enable xor_syndrome() in the async layer.
      
      2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
      at the start of a rmw run as we did it before for the single parity.
      
      3) Take care of rmw run in ops_run_reconstruct6(). Again process only
      the changed pages to get syndrome back into sync.
      
      4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
      run. The lower layers will calculate start & end pages from that and
      call the xor_syndrome() correspondingly.
      
      5) Adapt the several places where we ignored Q handling up to now.
      
      Performance numbers for a single E5630 system with a mix of 10 7200k
      desktop/server disks. 300 seconds random write with 8 threads onto a
      3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
      
      bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
              skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
         4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
         8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
        16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
        32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
        64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
       128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
       256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
       512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s
      Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      584acdd4
    • S
      raid5: handle expansion/resync case with stripe batching · dabc4ec6
      shli@kernel.org 提交于
      expansion/resync can grab a stripe when the stripe is in batch list. Since all
      stripes in batch list must be in the same state, we can't allow some stripes
      run into expansion/resync. So we delay expansion/resync for stripe in batch
      list.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dabc4ec6
    • S
      raid5: handle io error of batch list · 72ac7330
      shli@kernel.org 提交于
      If io error happens in any stripe of a batch list, the batch list will be
      split, then normal process will run for the stripes in the list.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      72ac7330
    • S
      RAID5: batch adjacent full stripe write · 59fc630b
      shli@kernel.org 提交于
      stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
      unit. Idealy we should use big size for adjacent full stripe writes. Bigger
      stripe cache size means less stripes runing in the state machine so can reduce
      cpu overhead. And also bigger size can cause bigger IO size dispatched to under
      layer disks.
      
      With below patch, we will automatically batch adjacent full stripe write
      together. Such stripes will be added to the batch list. Only the first stripe
      of the list will be put to handle_list and so run handle_stripe(). Some steps
      of handle_stripe() are extended to cover all stripes of the list, including
      ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
      running in handle_stripe() and we send IO of whole stripe list together to
      increase IO size.
      
      Stripes added to a batch list have some limitations. A batch list can only
      include full stripe write and can't cross chunk boundary to make sure stripes
      have the same parity disks. Stripes in a batch list must be in the same state
      (no written, toread and so on). If a stripe is in a batch list, all new
      read/write to add_stripe_bio will be blocked to overlap conflict till the batch
      list is handled. The limitations will make sure stripes in a batch list be in
      exactly the same state in the life circly.
      
      I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
      PCIe SSD. This patch improves around 30% performance and IO size to under layer
      disk is exactly 32k. I also run a 4k randwrite test in the same array to make
      sure the performance isn't changed with the patch.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      59fc630b
    • S
      raid5: track overwrite disk count · 7a87f434
      shli@kernel.org 提交于
      Track overwrite disk count, so we can know if a stripe is a full stripe write.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7a87f434
    • S
      raid5: add a new flag to track if a stripe can be batched · da41ba65
      shli@kernel.org 提交于
      A freshly new stripe with write request can be batched. Any time the stripe is
      handled or new read is queued, the flag will be cleared.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      da41ba65
    • S
      raid5: use flex_array for scribble data · 46d5b785
      shli@kernel.org 提交于
      Use flex_array for scribble data. Next patch will batch several stripes
      together, so scribble data should be able to cover several stripes, so this
      patch also allocates scribble data for stripes across a chunk.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      46d5b785
    • H
      md raid0: access mddev->queue (request queue member) conditionally because it... · 753f2856
      Heinz Mauelshagen 提交于
      md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid
      
      The patch makes 3 references to mddev->queue in the raid0 personality
      conditional in order to allow for it to be accessed from dm-raid.
      Mandatory, because md instances underneath dm-raid don't manage
      a request queue of their own which'd lead to oopses without the patch.
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Tested-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      753f2856
    • N
      md: allow resync to go faster when there is competing IO. · ac8fa419
      NeilBrown 提交于
      When md notices non-sync IO happening while it is trying
      to resync (or reshape or recover) it slows down to the
      set minimum.
      
      The default minimum might have made sense many years ago
      but the drives have become faster.  Changing the default
      to match the times isn't really a long term solution.
      
      This patch changes the code so that instead of waiting until the speed
      has dropped to the target, it just waits until pending requests
      have completed.
      This means that the delay inserted is a function of the speed
      of the devices.
      
      Testing shows that:
       - for some loads, the resync speed is unchanged.  For those loads
         increasing the minimum doesn't change the speed either.
         So this is a good result.  To increase resync speed under such
         loads we would probably need to increase the resync window
         size.
      
       - for other loads, resync speed does increase to a reasonable
         fraction (e.g. 20%) of maximum possible, and throughput of
         the load only drops a little bit (e.g. 10%)
      
       - for other loads, throughput of the non-sync load drops quite a bit
         more.  These seem to be latency-sensitive loads.
      
      So it isn't a perfect solution, but it is mostly an improvement.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ac8fa419
    • N
      md: remove 'go_faster' option from ->sync_request() · 09314799
      NeilBrown 提交于
      This option is not well justified and testing suggests that
      it hardly ever makes any difference.
      
      The comment suggests there might be a need to wait for non-resync
      activity indicated by ->nr_waiting, however raise_barrier()
      already waits for all of that.
      
      So just remove it to simplify reasoning about speed limiting.
      
      This allows us to remove a 'FIXME' comment from raid5.c as that
      never used the flag.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      09314799
    • N
      md: don't require sync_min to be a multiple of chunk_size. · 50c37b13
      NeilBrown 提交于
      There is really no need for sync_min to be a multiple of
      chunk_size, and values read from here often aren't.
      That means you cannot read a value and expect to be able
      to write it back later.
      
      So remove the chunk_size check, and round down to a multiple
      of 4K, to be sure everything works with 4K-sector devices.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      50c37b13
    • G
      md-cluster: re-add capabilities · 97f6cd39
      Goldwyn Rodrigues 提交于
      When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
      the clustered md:
      
      1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
         clear the Faulty bit in their respective rdev->flags.
      2. The node initiating re-add, gathers the bitmaps of all nodes
         and copies them into the local bitmap. It does not clear the bitmap
         from which it is copying.
      3. Initiating node schedules a md recovery to sync the devices.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      97f6cd39
    • G
      md: re-add a failed disk · a6da4ef8
      Goldwyn Rodrigues 提交于
      This adds the capability of re-adding a failed disk by
      writing "re-add" to /sys/block/mdXX/md/dev-YYY/state.
      
      This facilitates adding disks which have encountered a temporary
      error such as a network disconnection/hiccup in an iSCSI device,
      or a SAN cable disconnection which has been restored. In such
      a situation, you do not need to remove and re-add the device.
      Writing re-add to the failed device's state would add it again
      to the array and perform the recovery of only the blocks which
      were written after the device failed.
      
      This works for generic md, and is not related to clustering. However,
      this patch is to ease re-add operations listed above in clustering
      environments.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a6da4ef8
    • G
      md-cluster: remove capabilities · 88bcfef7
      Goldwyn Rodrigues 提交于
      This adds "remove" capabilities for the clustered environment.
      When a user initiates removal of a device from the array, a
      REMOVE message with disk number in the array is sent to all
      the nodes which kick the respective device in their own array.
      
      This facilitates the removal of failed devices.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      88bcfef7
    • G
      md: Export and rename find_rdev_nr_rcu · 57d051dc
      Goldwyn Rodrigues 提交于
      This is required by the clustering module (patches to follow) to
      find the device to remove or re-add.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      57d051dc
    • G
      md: Export and rename kick_rdev_from_array · fb56dfef
      Goldwyn Rodrigues 提交于
      This export is required for clustering module in order to
      co-ordinate remove/readd a rdev from all nodes.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fb56dfef
    • G
      md-cluster: correct the num for comparison · 8c58f02e
      Guoqing Jiang 提交于
      
      Since the node num of md-cluster is from zero, and
      cinfo->slot_number represents the slot num of dlm,
      no need to check for equality.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8c58f02e
  2. 10 4月, 2015 1 次提交
  3. 08 4月, 2015 1 次提交
    • G
      md: fix md io stats accounting broken · 74672d06
      Gu Zheng 提交于
      Simon reported the md io stats accounting issue:
      "
      I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
      It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
      the other two being write-mostly:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00   345.00    0.00    0.00    0.00   0.00 100.00
      md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00 58779.00    0.00    0.00    0.00   0.00 100.00
      md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00    12.00    0.00    0.00    0.00   0.00 100.00
      "
      The cause is commit "18c0b223" uses the
      generic_start_io_acct to account the disk stats rather than the open code,
      but it also introduced the increase to .in_flight[rw] which is needless to
      md. So we re-use the open code here to fix it.
      Reported-by: NSimon Kirby <sim@hostway.ca>
      Cc: <stable@vger.kernel.org> 3.19
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      74672d06
  4. 07 4月, 2015 2 次提交
  5. 06 4月, 2015 2 次提交
  6. 03 4月, 2015 2 次提交
    • J
      xen-netfront: transmit fully GSO-sized packets · 0c36820e
      Jonathan Davies 提交于
      xen-netfront limits transmitted skbs to be at most 44 segments in size. However,
      GSO permits up to 65536 bytes, which means a maximum of 45 segments of 1448
      bytes each. This slight reduction in the size of packets means a slight loss in
      efficiency.
      
      Since c/s 9ecd1a75, xen-netfront sets gso_max_size to
          XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER,
      where XEN_NETIF_MAX_TX_SIZE is 65535 bytes.
      
      The calculation used by tcp_tso_autosize (and also tcp_xmit_size_goal since c/s
      6c09fa09) in determining when to split an skb into two is
          sk->sk_gso_max_size - 1 - MAX_TCP_HEADER.
      
      So the maximum permitted size of an skb is calculated to be
          (XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER) - 1 - MAX_TCP_HEADER.
      
      Intuitively, this looks like the wrong formula -- we don't need two TCP headers.
      Instead, there is no need to deviate from the default gso_max_size of 65536 as
      this already accommodates the size of the header.
      
      Currently, the largest skb transmitted by netfront is 63712 bytes (44 segments
      of 1448 bytes each), as observed via tcpdump. This patch makes netfront send
      skbs of up to 65160 bytes (45 segments of 1448 bytes each).
      
      Similarly, the maximum allowable mtu does not need to subtract MAX_TCP_HEADER as
      it relates to the size of the whole packet, including the header.
      
      Fixes: 9ecd1a75 ("xen-netfront: reduce gso_max_size to account for max TCP header")
      Signed-off-by: NJonathan Davies <jonathan.davies@citrix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c36820e
    • S
      IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic · 8494057a
      Shachar Raindel 提交于
      Properly verify that the resulting page aligned end address is larger
      than both the start address and the length of the memory area requested.
      
      Both the start and length arguments for ib_umem_get are controlled by
      the user. A misbehaving user can provide values which will cause an
      integer overflow when calculating the page aligned end address.
      
      This overflow can cause also miscalculation of the number of pages
      mapped, and additional logic issues.
      
      Addresses: CVE-2014-8159
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NShachar Raindel <raindel@mellanox.com>
      Signed-off-by: NJack Morgenstein <jackm@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      8494057a
  7. 02 4月, 2015 9 次提交
  8. 01 4月, 2015 2 次提交