1. 01 11月, 2011 2 次提交
    • C
      Cross Memory Attach · fcf63409
      Christopher Yeoh 提交于
      The basic idea behind cross memory attach is to allow MPI programs doing
      intra-node communication to do a single copy of the message rather than a
      double copy of the message via shared memory.
      
      The following patch attempts to achieve this by allowing a destination
      process, given an address and size from a source process, to copy memory
      directly from the source process into its own address space via a system
      call.  There is also a symmetrical ability to copy from the current
      process's address space into a destination process's address space.
      
      - Use of /proc/pid/mem has been considered, but there are issues with
        using it:
        - Does not allow for specifying iovecs for both src and dest, assuming
          preadv or pwritev was implemented either the area read from or
        written to would need to be contiguous.
        - Currently mem_read allows only processes who are currently
        ptrace'ing the target and are still able to ptrace the target to read
        from the target. This check could possibly be moved to the open call,
        but its not clear exactly what race this restriction is stopping
        (reason  appears to have been lost)
        - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
        domain socket is a bit ugly from a userspace point of view,
        especially when you may have hundreds if not (eventually) thousands
        of processes  that all need to do this with each other
        - Doesn't allow for some future use of the interface we would like to
        consider adding in the future (see below)
        - Interestingly reading from /proc/pid/mem currently actually
        involves two copies! (But this could be fixed pretty easily)
      
      As mentioned previously use of vmsplice instead was considered, but has
      problems.  Since you need the reader and writer working co-operatively if
      the pipe is not drained then you block.  Which requires some wrapping to
      do non blocking on the send side or polling on the receive.  In all to all
      communication it requires ordering otherwise you can deadlock.  And in the
      example of many MPI tasks writing to one MPI task vmsplice serialises the
      copying.
      
      There are some cases of MPI collectives where even a single copy interface
      does not get us the performance gain we could.  For example in an
      MPI_Reduce rather than copy the data from the source we would like to
      instead use it directly in a mathops (say the reduce is doing a sum) as
      this would save us doing a copy.  We don't need to keep a copy of the data
      from the source.  I haven't implemented this, but I think this interface
      could in the future do all this through the use of the flags - eg could
      specify the math operation and type and the kernel rather than just
      copying the data would apply the specified operation between the source
      and destination and store it in the destination.
      
      Although we don't have a "second user" of the interface (though I've had
      some nibbles from people who may be interested in using it for intra
      process messaging which is not MPI).  This interface is something which
      hardware vendors are already doing for their custom drivers to implement
      fast local communication.  And so in addition to this being useful for
      OpenMPI it would mean the driver maintainers don't have to fix things up
      when the mm changes.
      
      There was some discussion about how much faster a true zero copy would
      go. Here's a link back to the email with some testing I did on that:
      
      http://marc.info/?l=linux-mm&m=130105930902915&w=2
      
      There is a basic man page for the proposed interface here:
      
      http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
      
      This has been implemented for x86 and powerpc, other architecture should
      mainly (I think) just need to add syscall numbers for the process_vm_readv
      and process_vm_writev. There are 32 bit compatibility versions for
      64-bit kernels.
      
      For arch maintainers there are some simple tests to be able to quickly
      verify that the syscalls are working correctly here:
      
      http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: NChris Yeoh <yeohc@au1.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <linux-man@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcf63409
    • A
      /proc/self/numa_maps: restore "huge" tag for hugetlb vmas · fc360bd9
      Andrew Morton 提交于
      The display of the "huge" tag was accidentally removed in 29ea2f69 ("mm:
      use walk_page_range() instead of custom page table walking code").
      Reported-by: NStephen Hemminger <shemminger@vyatta.com>
      Tested-by: NStephen Hemminger <shemminger@vyatta.com>
      Reviewed-by: NStephen Wilson <wilsons@start.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: <stable@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc360bd9
  2. 28 10月, 2011 20 次提交
  3. 27 10月, 2011 1 次提交
  4. 26 10月, 2011 11 次提交
  5. 25 10月, 2011 6 次提交
    • E
      sysfs: Remove support for tagged directories with untagged members (again) · b9e2780d
      Eric W. Biederman 提交于
      In commit 8a9ea323 ("Merge git://.../davem/net-next") where my sysfs
      changes from the net tree merged with the sysfs rbtree changes from
      Mickulas Patocka the conflict resolution failed to preserve the
      simplified property that was the point of my changes.
      
      That is sysfs_find_dirent can now say something is a match if and only
      s_name and s_ns match what we are looking for, and sysfs_readdir can
      simply return all of the directory entries where s_ns matches the
      directory that we should be returning.
      
      Now that we are back to exact matches we can tweak sysfs_find_dirent and
      the name rb_tree to order sysfs_dirents by s_ns s_name and remove the
      second loop in sysfs_find_dirent.  However that change seems a bit much
      for a conflict resolution so it can come later.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9e2780d
    • B
      ore: Enable RAID5 mounts · 44231e68
      Boaz Harrosh 提交于
      Now that we support raid5 Enable it at mount. Raid6 will come next
      raid4 is not demanded for so it will probably not be enabled.
      (Until some one wants it)
      
      NOTE: That mkfs.exofs had support for raid5/6 since long time
      ago. (Making an empty raidX FS is just as easy as raid0 ;-} )
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      44231e68
    • B
      exofs: Support for RAID5 read-4-write interface. · dd296619
      Boaz Harrosh 提交于
      The ore need suplied a r4w_get_page/r4w_put_page API
      from Filesystem so it can get cache pages to read-into when
      writing parial stripes.
      
      Also I commented out and NULLed the .writepage (singular)
      vector. Because it gives terrible write pattern to raid
      and is apparently not needed. Even in OOM conditions the
      system copes (even better) with out it.
      
      TODO: How to specify to write_cache_pages() to start
            or include a certain page?
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      dd296619
    • B
      ore: RAID5 Write · 769ba8d9
      Boaz Harrosh 提交于
      This is finally the RAID5 Write support.
      
      The bigger part of this patch is not the XOR engine itself, But the
      read4write logic, which is a complete mini prepare_for_striping
      reading engine that can read scattered pages of a stripe into cache
      so it can be used for XOR calculation. That is, if the write was not
      stripe aligned.
      
      The main algorithm behind the XOR engine is the 2 dimensional array:
      	struct __stripe_pages_2d.
      A drawing might save 1000 words
      ---
      
      __stripe_pages_2d
             |
       n = pages_in_stripe_unit;
       w = group_width - parity;
             |                            pages array presented to the XOR lib
             |                                                |
             V                                                |
       __1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---|
             |                                                |
       __1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <---
             |
      ...    |                         ...
             |
       __1_page_stripe[n].pages --> [c0][c1]..[cw][c_par]
                                     ^
                                     |
                 data added columns first then row
      
      ---
      The pages are put on this array columns first. .i.e:
      	p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ...
      So we are doing a corner turn of the pages.
      
      Note that pages will zigzag down and left. but are put sequentially
      in growing order. So when the time comes to XOR the stripe, only the
      beginning and end of the array need be checked. We scan the array
      and any NULL spot will be field by pages-to-be-read.
      
      The FS that wants to support RAID5 needs to supply an
      operations-vector that searches a given page in cache, and specifies
      if the page is uptodate or need reading. All these pages to be read
      are put on a slave ore_io_state and synchronously read. All the pages
      of a stripe are read in one IO, using the scatter gather mechanism.
      
      In write we constrain our IO to only be incomplete on a single
      stripe. Meaning either the complete IO is within a single stripe so
      we might have pages to read from both beginning  or end of the
      strip. Or we have some reading to do at beginning but end at strip
      boundary. The left over pages are pushed to the next IO by the API
      already established by previous work, where an IO offset/length
      combination presented to the ORE might get the length truncated and
      the user must re-submit the leftover pages. (Both exofs and NFS
      support this)
      
      But any ORE user should make it's best effort to align it's IO
      before hand and avoid complications. A cached ore_layout->stripe_size
      member can be used for that calculation. (NOTE: that ORE demands
      that stripe_size may not be bigger then 32bit)
      
      What else? Well read it and tell me.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      769ba8d9
    • B
      ore: RAID5 read · a1fec1db
      Boaz Harrosh 提交于
      This patch introduces the first stage of RAID5 support
      mainly the skip-over-raid-units when reading. For
      writes it inserts BLANK units, into where XOR blocks
      should be calculated and written to.
      
      It introduces the new "general raid maths", and the main
      additional parameters and components needed for raid5.
      
      Since at this stage it could corrupt future version that
      actually do support raid5. The enablement of raid5
      mounting and setting of parity-count > 0 is disabled. So
      the raid5 code will never be used. Mounting of raid5 is
      only enabled later once the basic XOR write is also in.
      But if the patch "enable RAID5" is applied this code has
      been tested to be able to properly read raid5 volumes
      and is according to standard.
      
      Also it has been tested that the new maths still properly
      supports RAID0 and grouping code just as before.
      (BTW: I have found more bugs in the pnfs-obj RAID math
       fixed here)
      
      The ore.c file is getting too big, so new ore_raid.[hc]
      files are added that will include the special raid stuff
      that are not used in striping and mirrors. In future write
      support these will get bigger.
      When adding the ore_raid.c to Kbuild file I was forced to
      rename ore.ko to libore.ko. Is it possible to keep source
      file, say ore.c and module file ore.ko the same even if there
      are multiple files inside ore.ko?
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      a1fec1db
    • B
      fs/Makefile: Always inspect exofs/ · 3e335672
      Boaz Harrosh 提交于
      fs/exofs directory has multiple targets now, of which the
      ore.ko will be needed by the pnfs-objects-layout-driver
      (fs/nfs/objlayout).
      
      As suggested by: Michal Marek <mmarek@suse.cz>  convert
      inclusion of exofs/ from obj-$(CONFIG_EXOFS_FS) => obj-$(y).
      So ORE can be selected also from fs/nfs/Kconfig
      
      CC: Michal Marek <mmarek@suse.cz>
      CC: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      3e335672