提交 · c9f01245b6a7d77d17deaa71af10f6aca14fa24e · openeuler / raspberrypi-kernel

01 11月, 2011 3 次提交

oom: remove oom_disable_count · c9f01245

由 David Rientjes 提交于 10月 31, 2011

This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy.  The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.

The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing.  The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:

 - it is possible that the OOM_DISABLE thread sharing memory with the
   victim is waiting on that thread to exit and will actually cause
   future memory freeing, and

 - there is no guarantee that a thread is disabled from oom killing just
   because another thread sharing its mm is oom disabled.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Reported-by: NOleg Nesterov <oleg@redhat.com>
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c9f01245

Cross Memory Attach · fcf63409

由 Christopher Yeoh 提交于 10月 31, 2011

The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.

The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call.  There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.

- Use of /proc/pid/mem has been considered, but there are issues with
  using it:
  - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
  written to would need to be contiguous.
  - Currently mem_read allows only processes who are currently
  ptrace'ing the target and are still able to ptrace the target to read
  from the target. This check could possibly be moved to the open call,
  but its not clear exactly what race this restriction is stopping
  (reason  appears to have been lost)
  - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
  domain socket is a bit ugly from a userspace point of view,
  especially when you may have hundreds if not (eventually) thousands
  of processes  that all need to do this with each other
  - Doesn't allow for some future use of the interface we would like to
  consider adding in the future (see below)
  - Interestingly reading from /proc/pid/mem currently actually
  involves two copies! (But this could be fixed pretty easily)

As mentioned previously use of vmsplice instead was considered, but has
problems.  Since you need the reader and writer working co-operatively if
the pipe is not drained then you block.  Which requires some wrapping to
do non blocking on the send side or polling on the receive.  In all to all
communication it requires ordering otherwise you can deadlock.  And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.

There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could.  For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy.  We don't need to keep a copy of the data
from the source.  I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.

Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI).  This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication.  And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.

There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:

http://marc.info/?l=linux-mm&m=130105930902915&w=2

There is a basic man page for the proposed interface here:

http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.

For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:

http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: NChris Yeoh <yeohc@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fcf63409

/proc/self/numa_maps: restore "huge" tag for hugetlb vmas · fc360bd9

由 Andrew Morton 提交于 10月 31, 2011

The display of the "huge" tag was accidentally removed in 29ea2f69 ("mm:
use walk_page_range() instead of custom page table walking code").
Reported-by: NStephen Hemminger <shemminger@vyatta.com>
Tested-by: NStephen Hemminger <shemminger@vyatta.com>
Reviewed-by: NStephen Wilson <wilsons@start.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fc360bd9

28 10月, 2011 20 次提交

leases: fix write-open/read-lease race · f3c7691e

由 J. Bruce Fields 提交于 9月 21, 2011

In setlease, we use i_writecount to decide whether we can give out a
read lease.

In open, we break leases before incrementing i_writecount.

There is therefore a window between the break lease and the i_writecount
increment when setlease could add a new read lease.

This would leave us with a simultaneous write open and read lease, which
shouldn't happen.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

f3c7691e

nfs: drop unnecessary locking in llseek · 79835a71

由 Andi Kleen 提交于 9月 15, 2011

This makes NFS follow the standard generic_file_llseek locking scheme.

Cc: Trond.Myklebust@netapp.com
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

79835a71

ext4: replace cut'n'pasted llseek code with generic_file_llseek_size · 4cce0e28

由 Andi Kleen 提交于 9月 15, 2011

This gives ext4 the benefits of unlocked llseek.

Cc: tytso@mit.edu
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

4cce0e28

vfs: add generic_file_llseek_size · 5760495a

由 Andi Kleen 提交于 9月 15, 2011

Add a generic_file_llseek variant to the VFS that allows passing in
the maximum file size of the file system, instead of always
using maxbytes from the superblock.

This can be used to eliminate some cut'n'paste seek code in ext4.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5760495a

vfs: do (nearly) lockless generic_file_llseek · ef3d0fd2

由 Andi Kleen 提交于 9月 15, 2011

The i_mutex lock use of generic _file_llseek hurts.  Independent processes
accessing the same file synchronize over a single lock, even though
they have no need for synchronization at all.

Under high utilization this can cause llseek to scale very poorly on larger
systems.

This patch does some rethinking of the llseek locking model:

First the 64bit f_pos is not necessarily atomic without locks
on 32bit systems. This can already cause races with read() today.
This was discussed on linux-kernel in the past and deemed acceptable.
The patch does not change that.

Let's look at the different seek variants:

SEEK_SET: Doesn't really need any locking.
If there's a race one writer wins, the other loses.

For 32bit the non atomic update races against read()
stay the same. Without a lock they can also happen
against write() now.  The read() race was deemed
acceptable in past discussions, and I think if it's
ok for read it's ok for write too.

=> Don't need a lock.

SEEK_END: This behaves like SEEK_SET plus it reads
the maximum size too. Reading the maximum size would have the
32bit atomic problem. But luckily we already have a way to read
the maximum size without locking (i_size_read), so we
can just use that instead.

Without i_mutex there is no synchronization with write() anymore,
however since the write() update is atomic on 64bit it just behaves
like another racy SEEK_SET.  On non atomic 32bit it's the same
as SEEK_SET.

=> Don't need a lock, but need to use i_size_read()

SEEK_CUR: This has a read-modify-write race window
on the same file. One could argue that any application
doing unsynchronized seeks on the same file is already broken.
But for the sake of not adding a regression here I'm
using the file->f_lock to synchronize this. Using this
lock is much better than the inode mutex because it doesn't
synchronize between processes.

=> So still need a lock, but can use a f_lock.

This patch implements this new scheme in generic_file_llseek.
I dropped generic_file_llseek_unlocked and changed all callers.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ef3d0fd2

direct-io: merge direct_io_walker into __blockdev_direct_IO · 847cc637

由 Andi Kleen 提交于 8月 01, 2011

This doesn't change anything for the compiler, but hch thought it would
make the code clearer.

I moved the reference counting into its own little inline.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

847cc637

direct-io: inline the complete submission path · ba253fbf

由 Andi Kleen 提交于 8月 01, 2011

Add inlines to all the submission path functions. While this increases
code size it also gives gcc a lot of optimization opportunities
in this critical hotpath.

In particular -- together with some other changes -- this
allows gcc to get rid of the unnecessary clearing of
sdio at the beginning and optimize the messy parameter passing.
Any non inlining of a function which takes a sdio parameter
would break this optimization because they cannot be done if the
address of a structure is taken.

Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING
and CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off.

This gives about 2.2% improvement on a large database benchmark
with a high IOPS rate.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ba253fbf

direct-io: separate map_bh from dio · 18772641

由 Andi Kleen 提交于 8月 01, 2011

Only a single b_private field in the map_bh buffer head is needed after
the submission path. Move map_bh separately to avoid storing
this information in the long term slab.

This avoids the weird 104 byte hole in struct dio_submit which also needed
to be memseted early.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

18772641

direct-io: use a slab cache for struct dio · 6e8267f5

由 Andi Kleen 提交于 8月 01, 2011

A direct slab call is slightly faster than kmalloc and can be better cached
per CPU. It also avoids rounding to the next kmalloc slab.

In addition this enforces cache line alignment for struct dio to avoid
any false sharing.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

6e8267f5

direct-io: rearrange fields in dio/dio_submit to avoid holes · 0dc2bc49

由 Andi Kleen 提交于 8月 01, 2011

Fix most problems reported by pahole.

There is still a weird 104 byte hole after map_bh. I'm not sure what
causes this.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

0dc2bc49

direct-io: fix a wrong comment · cde1ecb3

由 Andi Kleen 提交于 8月 01, 2011

There's nothing on the stack, even before my changes.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

cde1ecb3

direct-io: separate fields only used in the submission path from struct dio · eb28be2b

由 Andi Kleen 提交于 8月 01, 2011

This large, but largely mechanic, patch moves all fields in struct dio
that are only used in the submission path into a separate on stack
data structure. This has the advantage that the memory is very likely
cache hot, which is not guaranteed for memory fresh out of kmalloc.

This also gives gcc more optimization potential because it can easier
determine that there are no external aliases for these variables.

The sdio initialization is a initialization now instead of memset.
This allows gcc to break sdio into individual fields and optimize
away unnecessary zeroing (after all the functions are inlined)
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

eb28be2b

vfs: fix spinning prevention in prune_icache_sb · 62a3ddef

由 Christoph Hellwig 提交于 10月 28, 2011

We need to move the inode to the end of the list to actually make the
spinning prevention explained in the comment above it work.  With a
plain list_move it will simply stay in place as we're always reclaiming
from the head of the list.
Signed-off-by: NChristoph Hellwig <hch@lst.de>

62a3ddef

vfs: add a comment to inode_permission() · 948409c7

由 Andreas Gruenbacher 提交于 10月 23, 2011

Acked-by: NJ. Bruce Fields <bfields@redhat.com>
Acked-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAndreas Gruenbacher <agruen@kernel.org>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

948409c7

vfs: pass all mask flags check_acl and posix_acl_permission · d124b60a

由 Andreas Gruenbacher 提交于 10月 23, 2011

Acked-by: NJ. Bruce Fields <bfields@redhat.com>
Acked-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAndreas Gruenbacher <agruen@kernel.org>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d124b60a

vfs: indicate that the permission functions take all the MAY_* flags · 8fd90c8d

由 Andreas Gruenbacher 提交于 10月 23, 2011

Acked-by: NJ. Bruce Fields <bfields@redhat.com>
Acked-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAndreas Gruenbacher <agruen@kernel.org>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

8fd90c8d

compat: sync compat_stats with statfs. · 1448c721

由 Eric W. Biederman 提交于 10月 17, 2011

This was found by inspection while tracking a similar
bug in compat_statfs64, that has been fixed in mainline
since decemeber.

- This fixes a bug where not all of the f_spare fields
  were cleared on mips and s390.
- Add the f_flags field to struct compat_statfs
- Copy f_flags to userspace in case someone cares.
- Use __clear_user to copy the f_spare field to userspace
  to ensure that all of the elements of f_spare are cleared.
  On some architectures f_spare is has 5 ints and on some
  architectures f_spare only has 4 ints.  Which makes
  the previous technique of clearing each int individually
  broken.

I don't expect anyone actually uses the old statfs system
call anymore but if they do let them benefit from having
the compat and the native version working the same.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

1448c721

vfs: add "device" tag to /proc/self/mountstats · a877ee03

由 Bryan Schumaker 提交于 10月 07, 2011

nfsiostat was failing to find mounted filesystems on kernels after
2.6.38 because of changes to show_vfsstat() by commit
c7f404b4.  This patch adds back the
"device" tag before the nfs server entry so scripts can parse the
mountstats file correctly.
Signed-off-by: NBryan Schumaker <bjschuma@netapp.com>
CC: stable@kernel.org [>=2.6.39]
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a877ee03

cleanup: vfs: small comment fix for block_invalidatepage · 814e1d25

由 Wang Sheng-Hui 提交于 9月 01, 2011

The patch is aganist 3.1-rc3.
Signed-off-by: NWang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

814e1d25

Add definition for share encryption · 96814ecb

由 Steve French 提交于 10月 24, 2011

Samba supports a setfs info level to negotiate encrypted
shares.  This patch adds the defines so we recognize
this info level.  Later patches will add the enablement
for it.
Acked-by: NJeremy Allison <jra@samba.org>
Signed-off-by: NSteve French <smfrench@gmail.com>

96814ecb

27 10月, 2011 1 次提交

fs/Makefile: Stupid typo breakage of exofs inclusion · 60325f0c

由 Boaz Harrosh 提交于 10月 26, 2011

In my last patch I did a stupid mistake and broke the exofs
compilation completely. Fix it ASAP.

Instead of obj-y I did obj-$(y)

Really Really sorry. Me totally blushing :-{|
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

60325f0c

26 10月, 2011 11 次提交

libceph: fix double-free of page vector · 33957340

由 Sage Weil 提交于 10月 24, 2011

ceph_release_page_vector() kfrees the vector; we shouldn't do it here too.
Reported-by: NJeff Wu <cpwu@tnsoft.com.cn>
Signed-off-by: NSage Weil <sage@newdream.net>

33957340

ceph: fix 32-bit ino numbers · 3310f754

由 Amon Ott 提交于 10月 20, 2011

Fix 32-bit ino generation to not always be 1.
Signed-off-by: NAmon Ott <a.ott@m-privacy.de>

3310f754

ceph: let the set_layout ioctl set single traits · a35eca95

由 Greg Farnum 提交于 8月 25, 2011

Previously we were validating the passed-in stripe unit, object size,
and stripe count against each other (and not testing most other stuff).
Instead, make sure that the composed previous layout and new values are valid,
and only send the new values to the MDS. This lets users change the
pool without setting the whole layout, for instance.
Signed-off-by: NGreg Farnum <gregory.farnum@dreamhost.com>

a35eca95

Revert "ceph: don't truncate dirty pages in invalidate work thread" · 83eaea22

由 Sage Weil 提交于 8月 24, 2011

This reverts commit c9af9fb6.

We need to block and truncate all pages in order to reliably invalidate
them.  Otherwise, we could:

 - have some uptodate pages in the cache
 - queue an invalidate
 - write(2) locks some pages
 - invalidate_work skips them
 - write(2) only overwrites part of the page
 - page now dirty and uptodate
 -> partial leakage of invalidated data

It's not entirely clear why we started skipping locked pages in the first
place.  I just ran this through fsx and didn't see any problems.
Signed-off-by: NSage Weil <sage@newdream.net>

83eaea22

ceph: replace leading spaces with tabs · 80db8bea

由 Noah Watkins 提交于 8月 22, 2011

Trivial formatting fix.
Signed-off-by: NNoah Watkins <noahwatkins@gmail.com>
Signed-off-by: NSage Weil <sage@newdream.net>

80db8bea

libceph: don't complain on msgpool alloc failures · b61c2763

由 Sage Weil 提交于 8月 09, 2011

The pool allocation failures are masked by the pool; there is no need to
spam the console about them.  (That's the whole point of having the pool
in the first place.)

Mark msg allocations whose failure is safely handled as such.
Signed-off-by: NSage Weil <sage@newdream.net>

b61c2763

libceph: create messenger with client · 6ab00d46

由 Sage Weil 提交于 8月 09, 2011

This simplifies the init/shutdown paths, and makes client->msgr available
during the rest of the setup process.
Signed-off-by: NSage Weil <sage@newdream.net>

6ab00d46

ceph: document ioctls · 6a8ea470

由 Sage Weil 提交于 8月 04, 2011

...after some prodding by Christoph.
Signed-off-by: NSage Weil <sage@newdream.net>

6a8ea470

ceph: implement (optional) max read size · 0d66a487

由 Sage Weil 提交于 8月 04, 2011

The 'rsize' mount option limits the maximum size of an individual
read(ahead) operation that is sent off to an OSD.  This is distinct from
'rasize', which controls the size of the readahead window.
Signed-off-by: NSage Weil <sage@newdream.net>

0d66a487

S
ceph: rename rsize -> rasize · 83817e35
由 Sage Weil 提交于 8月 04, 2011
```
It controls readahead.
Signed-off-by: NSage Weil <sage@newdream.net>
```
83817e35

ceph: make readpages fully async · 7c272194

由 Sage Weil 提交于 8月 03, 2011

When we get a ->readpages() aop, submit async reads for all page ranges
in the provided page list.  Lock the pages immediately, so that VFS/MM
will block until the reads complete.
Signed-off-by: NSage Weil <sage@newdream.net>

7c272194

25 10月, 2011 5 次提交

sysfs: Remove support for tagged directories with untagged members (again) · b9e2780d

由 Eric W. Biederman 提交于 10月 25, 2011

In commit 8a9ea323 ("Merge git://.../davem/net-next") where my sysfs
changes from the net tree merged with the sysfs rbtree changes from
Mickulas Patocka the conflict resolution failed to preserve the
simplified property that was the point of my changes.

That is sysfs_find_dirent can now say something is a match if and only
s_name and s_ns match what we are looking for, and sysfs_readdir can
simply return all of the directory entries where s_ns matches the
directory that we should be returning.

Now that we are back to exact matches we can tweak sysfs_find_dirent and
the name rb_tree to order sysfs_dirents by s_ns s_name and remove the
second loop in sysfs_find_dirent. However that change seems a bit much
for a conflict resolution so it can come later.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b9e2780d

ore: Enable RAID5 mounts · 44231e68

由 Boaz Harrosh 提交于 11月 21, 2010

Now that we support raid5 Enable it at mount. Raid6 will come next
raid4 is not demanded for so it will probably not be enabled.
(Until some one wants it)

NOTE: That mkfs.exofs had support for raid5/6 since long time
ago. (Making an empty raidX FS is just as easy as raid0 ;-} )
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

44231e68

exofs: Support for RAID5 read-4-write interface. · dd296619

由 Boaz Harrosh 提交于 10月 12, 2011

The ore need suplied a r4w_get_page/r4w_put_page API
from Filesystem so it can get cache pages to read-into when
writing parial stripes.

Also I commented out and NULLed the .writepage (singular)
vector. Because it gives terrible write pattern to raid
and is apparently not needed. Even in OOM conditions the
system copes (even better) with out it.

TODO: How to specify to write_cache_pages() to start
      or include a certain page?
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

dd296619

ore: RAID5 Write · 769ba8d9

由 Boaz Harrosh 提交于 10月 14, 2011

This is finally the RAID5 Write support.

The bigger part of this patch is not the XOR engine itself, But the
read4write logic, which is a complete mini prepare_for_striping
reading engine that can read scattered pages of a stripe into cache
so it can be used for XOR calculation. That is, if the write was not
stripe aligned.

The main algorithm behind the XOR engine is the 2 dimensional array:
	struct __stripe_pages_2d.
A drawing might save 1000 words
---

__stripe_pages_2d
       |
 n = pages_in_stripe_unit;
 w = group_width - parity;
       |                            pages array presented to the XOR lib
       |                                                |
       V                                                |
 __1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---|
       |                                                |
 __1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <---
       |
...    |                         ...
       |
 __1_page_stripe[n].pages --> [c0][c1]..[cw][c_par]
                               ^
                               |
           data added columns first then row

---
The pages are put on this array columns first. .i.e:
	p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ...
So we are doing a corner turn of the pages.

Note that pages will zigzag down and left. but are put sequentially
in growing order. So when the time comes to XOR the stripe, only the
beginning and end of the array need be checked. We scan the array
and any NULL spot will be field by pages-to-be-read.

The FS that wants to support RAID5 needs to supply an
operations-vector that searches a given page in cache, and specifies
if the page is uptodate or need reading. All these pages to be read
are put on a slave ore_io_state and synchronously read. All the pages
of a stripe are read in one IO, using the scatter gather mechanism.

In write we constrain our IO to only be incomplete on a single
stripe. Meaning either the complete IO is within a single stripe so
we might have pages to read from both beginning  or end of the
strip. Or we have some reading to do at beginning but end at strip
boundary. The left over pages are pushed to the next IO by the API
already established by previous work, where an IO offset/length
combination presented to the ORE might get the length truncated and
the user must re-submit the leftover pages. (Both exofs and NFS
support this)

But any ORE user should make it's best effort to align it's IO
before hand and avoid complications. A cached ore_layout->stripe_size
member can be used for that calculation. (NOTE: that ORE demands
that stripe_size may not be bigger then 32bit)

What else? Well read it and tell me.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

769ba8d9

ore: RAID5 read · a1fec1db

由 Boaz Harrosh 提交于 10月 12, 2011

This patch introduces the first stage of RAID5 support
mainly the skip-over-raid-units when reading. For
writes it inserts BLANK units, into where XOR blocks
should be calculated and written to.

It introduces the new "general raid maths", and the main
additional parameters and components needed for raid5.

Since at this stage it could corrupt future version that
actually do support raid5. The enablement of raid5
mounting and setting of parity-count > 0 is disabled. So
the raid5 code will never be used. Mounting of raid5 is
only enabled later once the basic XOR write is also in.
But if the patch "enable RAID5" is applied this code has
been tested to be able to properly read raid5 volumes
and is according to standard.

Also it has been tested that the new maths still properly
supports RAID0 and grouping code just as before.
(BTW: I have found more bugs in the pnfs-obj RAID math
 fixed here)

The ore.c file is getting too big, so new ore_raid.[hc]
files are added that will include the special raid stuff
that are not used in striping and mirrors. In future write
support these will get bigger.
When adding the ore_raid.c to Kbuild file I was forced to
rename ore.ko to libore.ko. Is it possible to keep source
file, say ore.c and module file ore.ko the same even if there
are multiple files inside ore.ko?
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

a1fec1db