1. 20 10月, 2014 1 次提交
  2. 09 8月, 2014 1 次提交
  3. 22 5月, 2014 3 次提交
    • B
      ore: Support for raid 6 · ce5d36aa
      Boaz Harrosh 提交于
      This simple patch adds support for raid6 to the ORE.
      Most operations and calculations where already for the general
      case. Only things left:
      * call async_gen_syndrome() in the case of raid6
        (NOTE that the raid6 math is the one supported by the Linux Kernel
         see: crypto/async_tx/async_pq.c)
      * call _ore_add_parity_unit() twice with only last call generating
        the redundancy pages.
      
      * Fix couple BUGS in old code
        a. In reads when parity==2 it can happen that per_dev->length=0
           but per_dev->offset was set and adjusted by _ore_add_sg_seg().
           Don't let it be overwritten.
        b. The all 'cur_comp > starting_dev' thing to determine if:
             "per_dev->offset is in the current stripe number or the
             next one."
           Was a complete raid5/4 accident. When parity==2 this is not
           at all true usually. All we need to do is increment si->ob_offset
           once we pass by the first parity device.
           (This also greatly simplifies the code, amen)
        c. Calculation of si->dev rotation can overflow when parity==2.
      
      * Then last enable raid6 in ore_verify_layout()
      
      I want to deeply thank Daniel Gryniewicz who found first all the
      bugs in the old raid code, and inspired these patches:
      	Inspired-by Daniel Gryniewicz <dang@linuxbox.com>
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      ce5d36aa
    • B
      ore: Remove redundant dev_order(), more cleanups · 455682ce
      Boaz Harrosh 提交于
      Two cleanups:
      * si->cur_comp, si->cur_pg where always calculated after
        the call to ore_calc_stripe_info() with the help of
        _dev_order(...). But these are already calculated by
        ore_calc_stripe_info() and can be just set there.
        (This is left over from the time that si->cur_comp, si->cur_pg
         were only used by raid code, but now the main loop manages
         them anyway even though they are ultimately not used in
         none raid code)
      
      * si->cur_comp - For the very last stripe case, was set inside
        _ore_add_parity_unit(). This is not clear and will be wrong
        for coming raid6 so move this to only caller. Now si->cur_comp
        is only manipulated within _prepare_for_striping(), always next
        to the manipulation of cur_dev.
        Which is much easier to understand and follow.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      455682ce
    • B
      ore: (trivial) reformat some code · 101a6427
      Boaz Harrosh 提交于
      rearrange some source lines. Nothing changed.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      101a6427
  4. 03 4月, 2014 1 次提交
  5. 24 3月, 2013 1 次提交
    • K
      block: Add bio_for_each_segment_all() · d74c6d51
      Kent Overstreet 提交于
      __bio_for_each_segment() iterates bvecs from the specified index
      instead of bio->bv_idx.  Currently, the only usage is to walk all the
      bvecs after the bio has been advanced by specifying 0 index.
      
      For immutable bvecs, we need to split these apart;
      bio_for_each_segment() is going to have a different implementation.
      This will also help document the intent of code that's using it -
      bio_for_each_segment_all() is only legal to use for code that owns the
      bio.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Neil Brown <neilb@suse.de>
      CC: Boaz Harrosh <bharrosh@panasas.com>
      d74c6d51
  6. 04 10月, 2012 1 次提交
  7. 20 7月, 2012 2 次提交
    • B
      ore: Unlock r4w pages in exact reverse order of locking · 537632e0
      Boaz Harrosh 提交于
      The read-4-write pages are locked in address ascending order.
      But where unlocked in a way easiest for coding. Fix that,
      locks should be released in opposite order of locking, .i.e
      descending address order.
      
      I have not hit this dead-lock. It was found by inspecting the
      dbug print-outs. I suspect there is an higher lock at caller that
      protects us, but fix it regardless.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      537632e0
    • B
      ore: Fix NFS crash by supporting any unaligned RAID IO · 9ff19309
      Boaz Harrosh 提交于
      In RAID_5/6 We used to not permit an IO that it's end
      byte is not stripe_size aligned and spans more than one stripe.
      .i.e the caller must check if after submission the actual
      transferred bytes is shorter, and would need to resubmit
      a new IO with the remainder.
      
      Exofs supports this, and NFS was supposed to support this
      as well with it's short write mechanism. But late testing has
      exposed a CRASH when this is used with none-RPC layout-drivers.
      
      The change at NFS is deep and risky, in it's place the fix
      at ORE to lift the limitation is actually clean and simple.
      So here it is below.
      
      The principal here is that in the case of unaligned IO on
      both ends, beginning and end, we will send two read requests
      one like old code, before the calculation of the first stripe,
      and also a new site, before the calculation of the last stripe.
      If any "boundary" is aligned or the complete IO is within a single
      stripe. we do a single read like before.
      
      The code is clean and simple by splitting the old _read_4_write
      into 3 even parts:
      1._read_4_write_first_stripe
      2. _read_4_write_last_stripe
      3. _read_4_write_execute
      
      And calling 1+3 at the same place as before. 2+3 before last
      stripe, and in the case of all in a single stripe then 1+2+3
      is preformed additively.
      
      Why did I not think of it before. Well I had a strike of
      genius because I have stared at this code for 2 years, and did
      not find this simple solution, til today. Not that I did not try.
      
      This solution is much better for NFS than the previous supposedly
      solution because the short write was dealt  with out-of-band after
      IO_done, which would cause for a seeky IO pattern where as in here
      we execute in order. At both solutions we do 2 separate reads, only
      here we do it within a single IO request. (And actually combine two
      writes into a single submission)
      
      NFS/exofs code need not change since the ORE API communicates the new
      shorter length on return, what will happen is that this case would not
      occur anymore.
      
      hurray!!
      
      [Stable this is an NFS bug since 3.2 Kernel should apply cleanly]
      CC: Stable Tree <stable@kernel.org>
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      9ff19309
  8. 08 1月, 2012 1 次提交
    • B
      ore: Must support none-PAGE-aligned IO · 724577ca
      Boaz Harrosh 提交于
      NFS might send us offsets that are not PAGE aligned. So
      we must read in the reminder of the first/last pages, in cases
      we need it for Parity calculations.
      
      We only add an sg segments to read the partial page. But
      we don't mark it as read=true because it is a lock-for-write
      page.
      
      TODO: In some cases (IO spans a single unit) we can just
      adjust the raid_unit offset/length, but this is left for
      later Kernels.
      
      [Bug in 3.2.0 Kernel]
      CC: Stable Tree <stable@kernel.org>
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      724577ca
  9. 06 1月, 2012 1 次提交
    • B
      ore: fix BUG_ON, too few sgs when reading · 361aba56
      Boaz Harrosh 提交于
      When reading RAID5 files, in rare cases, we calculated too
      few sg segments. There should be two extra for the beginning
      and end partial units.
      
      Also "too few sg segments" should not be a BUG_ON there is
      all the mechanics in place to handle it, as a short read.
      So just return -ENOMEM and the rest of the code will gracefully
      split the IO.
      
      [Bug in 3.2.0 Kernel]
      CC: Stable Tree <stable@kernel.org>
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      361aba56
  10. 25 10月, 2011 2 次提交
    • B
      ore: RAID5 Write · 769ba8d9
      Boaz Harrosh 提交于
      This is finally the RAID5 Write support.
      
      The bigger part of this patch is not the XOR engine itself, But the
      read4write logic, which is a complete mini prepare_for_striping
      reading engine that can read scattered pages of a stripe into cache
      so it can be used for XOR calculation. That is, if the write was not
      stripe aligned.
      
      The main algorithm behind the XOR engine is the 2 dimensional array:
      	struct __stripe_pages_2d.
      A drawing might save 1000 words
      ---
      
      __stripe_pages_2d
             |
       n = pages_in_stripe_unit;
       w = group_width - parity;
             |                            pages array presented to the XOR lib
             |                                                |
             V                                                |
       __1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---|
             |                                                |
       __1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <---
             |
      ...    |                         ...
             |
       __1_page_stripe[n].pages --> [c0][c1]..[cw][c_par]
                                     ^
                                     |
                 data added columns first then row
      
      ---
      The pages are put on this array columns first. .i.e:
      	p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ...
      So we are doing a corner turn of the pages.
      
      Note that pages will zigzag down and left. but are put sequentially
      in growing order. So when the time comes to XOR the stripe, only the
      beginning and end of the array need be checked. We scan the array
      and any NULL spot will be field by pages-to-be-read.
      
      The FS that wants to support RAID5 needs to supply an
      operations-vector that searches a given page in cache, and specifies
      if the page is uptodate or need reading. All these pages to be read
      are put on a slave ore_io_state and synchronously read. All the pages
      of a stripe are read in one IO, using the scatter gather mechanism.
      
      In write we constrain our IO to only be incomplete on a single
      stripe. Meaning either the complete IO is within a single stripe so
      we might have pages to read from both beginning  or end of the
      strip. Or we have some reading to do at beginning but end at strip
      boundary. The left over pages are pushed to the next IO by the API
      already established by previous work, where an IO offset/length
      combination presented to the ORE might get the length truncated and
      the user must re-submit the leftover pages. (Both exofs and NFS
      support this)
      
      But any ORE user should make it's best effort to align it's IO
      before hand and avoid complications. A cached ore_layout->stripe_size
      member can be used for that calculation. (NOTE: that ORE demands
      that stripe_size may not be bigger then 32bit)
      
      What else? Well read it and tell me.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      769ba8d9
    • B
      ore: RAID5 read · a1fec1db
      Boaz Harrosh 提交于
      This patch introduces the first stage of RAID5 support
      mainly the skip-over-raid-units when reading. For
      writes it inserts BLANK units, into where XOR blocks
      should be calculated and written to.
      
      It introduces the new "general raid maths", and the main
      additional parameters and components needed for raid5.
      
      Since at this stage it could corrupt future version that
      actually do support raid5. The enablement of raid5
      mounting and setting of parity-count > 0 is disabled. So
      the raid5 code will never be used. Mounting of raid5 is
      only enabled later once the basic XOR write is also in.
      But if the patch "enable RAID5" is applied this code has
      been tested to be able to properly read raid5 volumes
      and is according to standard.
      
      Also it has been tested that the new maths still properly
      supports RAID0 and grouping code just as before.
      (BTW: I have found more bugs in the pnfs-obj RAID math
       fixed here)
      
      The ore.c file is getting too big, so new ore_raid.[hc]
      files are added that will include the special raid stuff
      that are not used in striping and mirrors. In future write
      support these will get bigger.
      When adding the ore_raid.c to Kbuild file I was forced to
      rename ore.ko to libore.ko. Is it possible to keep source
      file, say ore.c and module file ore.ko the same even if there
      are multiple files inside ore.ko?
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      a1fec1db