提交 · 89e6054836a7b1e7500cd70a14b5579e752c9250 · openeuler / raspberrypi-kernel

07 1月, 2011 27 次提交

fs: dcache reduce prune_one_dentry locking · 89e60548

由 Nick Piggin 提交于 1月 07, 2011

prune_one_dentry can avoid quite a bit of locking in the common case where
ancestors have an elevated refcount. Alternatively, we could have gone the
other way and made fewer trylocks in the case where d_count goes to zero, but
is probably less common.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

89e60548

fs: dcache reduce d_parent locking · a734eb45

由 Nick Piggin 提交于 1月 07, 2011

Use RCU to simplify locking in dget_parent.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

a734eb45

fs: dcache rationalise dget variants · dc0474be

由 Nick Piggin 提交于 1月 07, 2011

dget_locked was a shortcut to avoid the lazy lru manipulation when we already
held dcache_lock (lru manipulation was relatively cheap at that point).
However, how that the lru lock is an innermost one, we never hold it at any
caller, so the lock cost can now be avoided. We already have well working lazy
dcache LRU, so it should be fine to defer LRU manipulations to scan time.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

dc0474be

fs: dcache reduce dcache_inode_lock · 357f8e65

由 Nick Piggin 提交于 1月 07, 2011

dcache_inode_lock can be avoided in d_delete() and d_materialise_unique()
in cases where it is not required.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

357f8e65

N
fs: dcache reduce locking in d_alloc · 89ad485f
由 Nick Piggin 提交于 1月 07, 2011
```
Signed-off-by: NNick Piggin <npiggin@kernel.dk>
```
89ad485f

fs: dcache reduce dput locking · 61f3dee4

由 Nick Piggin 提交于 1月 07, 2011

It is possible to run dput without taking data structure locks up-front. In
many cases where we don't kill the dentry anyway, these locks are not required.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

61f3dee4

fs: dcache avoid starvation in dcache multi-step operations · 58db63d0

由 Nick Piggin 提交于 1月 07, 2011

Long lived dcache "multi-step" operations which retry on rename seq can
be starved with a lot of rename activity. If they fail after the 1st pass,
take the rename_lock for writing to avoid further starvation.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

58db63d0

fs: dcache remove dcache_lock · b5c84bf6

由 Nick Piggin 提交于 1月 07, 2011

dcache_lock no longer protects anything. remove it.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

b5c84bf6

fs: Use rename lock and RCU for multi-step operations · 949854d0

由 Nick Piggin 提交于 1月 07, 2011

The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.

This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.

Solve this instead by using the rename seqlock for multi-step read-side
operations, retry in case of a rename so we don't walk up the wrong parent.
Concurrent dentry insertions are not serialised against. Concurrent deletes
are tricky when walking up the directory: our parent might have been deleted
when dropping locks so also need to check and retry for that.

We can also use the rename lock in cases where livelock is a worry (and it
is introduced in subsequent patch).
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

949854d0

fs: increase d_name lock coverage · 9abca360

由 Nick Piggin 提交于 1月 07, 2011

Cover d_name with d_lock in more cases, where there may be concurrent
modification to it.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

9abca360

fs: scale inode alias list · b23fb0a6

由 Nick Piggin 提交于 1月 07, 2011

Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
from concurrent modification. d_alias is also protected by d_lock.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

b23fb0a6

fs: dcache scale subdirs · 2fd6b7f5

由 Nick Piggin 提交于 1月 07, 2011

Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).

Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

2fd6b7f5

fs: dcache scale d_unhashed · da502956

由 Nick Piggin 提交于 1月 07, 2011

Protect d_unhashed(dentry) condition with d_lock. This means keeping
DCACHE_UNHASHED bit in synch with hash manipulations.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

da502956

fs: dcache scale dentry refcount · b7ab39f6

由 Nick Piggin 提交于 1月 07, 2011

Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

b7ab39f6

fs: dcache scale lru · 23044507

由 Nick Piggin 提交于 1月 07, 2011

Add a new lock, dcache_lru_lock, to protect the dcache LRU list from concurrent
modification. d_lru is also protected by d_lock, which allows LRU lists to be
accessed without the lru lock, using RCU in future patches.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

23044507

fs: dcache scale hash · 789680d1

由 Nick Piggin 提交于 1月 07, 2011

Add a new lock, dcache_hash_lock, to protect the dcache hash table from
concurrent modification. d_hash is also protected by d_lock.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

789680d1

hostfs: simplify locking · ec2447c2

由 Nick Piggin 提交于 1月 07, 2011

Remove dcache_lock locking from hostfs filesystem, and move it into dcache
helpers. All that is required is a coherent path name. Protection from
concurrent modification of the namespace after path name generation is not
provided in current code, because dcache_lock is dropped before the path is
used.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

ec2447c2

fs: change d_hash for rcu-walk · b1e6a015

由 Nick Piggin 提交于 1月 07, 2011

Change d_hash so it may be called from lock-free RCU lookups. See similar
patch for d_compare for details.

For in-tree filesystems, this is just a mechanical change.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

b1e6a015

fs: change d_compare for rcu-walk · 621e155a

由 Nick Piggin 提交于 1月 07, 2011

Change d_compare so it may be called from lock-free RCU lookups. This
does put significant restrictions on what may be done from the callback,
however there don't seem to have been any problems with in-tree fses.
If some strange use case pops up that _really_ cannot cope with the
rcu-walk rules, we can just add new rcu-unaware callbacks, which would
cause name lookup to drop out of rcu-walk mode.

For in-tree filesystems, this is just a mechanical change.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

621e155a

fs: name case update method · fb2d5b86

由 Nick Piggin 提交于 1月 07, 2011

smpfs and ncpfs want to update a live dentry name in-place. Rather than
have them open code the locking, provide a documented dcache API.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

fb2d5b86

jfs: dont overwrite dentry name in d_revalidate · 2bc334dc

由 Nick Piggin 提交于 1月 07, 2011

Use vfat's method for dealing with negative dentries to preserve case,
rather than overwrite dentry name in d_revalidate, which is a bit ugly
and also gets in the way of doing lock-free path walking.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

2bc334dc

cifs: dont overwrite dentry name in d_revalidate · 79eb4dde

由 Nick Piggin 提交于 1月 07, 2011

79eb4dde

fs: change d_delete semantics · fe15ce44

由 Nick Piggin 提交于 1月 07, 2011

Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.

This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

fe15ce44

config fs: avoid switching ->d_op on live dentry · fbc8d4c0

由 Nick Piggin 提交于 1月 07, 2011

Switching d_op on a live dentry is racy in general, so avoid it. In this case
it is a negative dentry, which is safer, but there are still concurrent ops
which may be called on d_op in that case (eg. d_revalidate). So in general
a filesystem may not do this. Fix configfs so as not to do this.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

fbc8d4c0

fs: use fast counters for vfs caches · 3e880fb5

由 Nick Piggin 提交于 1月 07, 2011

percpu_counter library generates quite nasty code, so unless you need
to dynamically allocate counters or take fast approximate value, a
simple per cpu set of counters is much better.

The percpu_counter can never be made to work as well, because it has an
indirection from pointer to percpu memory, and it can't use direct
this_cpu_inc interfaces because it doesn't use static PER_CPU data, so
code will always be worse.

In the fastpath, it is the difference between this:

        incl %gs:nr_dentry      # nr_dentry

and this:

        movl    percpu_counter_batch(%rip), %edx        # percpu_counter_batch,
        movl    $1, %esi        #,
        movq    $nr_dentry, %rdi        #,
        call    __percpu_counter_add    # (plus I clobber registers)

__percpu_counter_add:
        pushq   %rbp    #
        movq    %rsp, %rbp      #,
        subq    $32, %rsp       #,
        movq    %rbx, -24(%rbp) #,
        movq    %r12, -16(%rbp) #,
        movq    %r13, -8(%rbp)  #,
        movq    %rdi, %rbx      # fbc, fbc
#APP
# 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
        movq %gs:kernel_stack,%rax      #, pfo_ret__
# 0 "" 2
#NO_APP
        incl    -8124(%rax)     # <variable>.preempt_count
        movq    32(%rdi), %r12  # <variable>.counters, tcp_ptr__
#APP
# 78 "lib/percpu_counter.c" 1
        add %gs:this_cpu_off, %r12      # this_cpu_off, tcp_ptr__
# 0 "" 2
#NO_APP
        movslq  (%r12),%r13     #* tcp_ptr__, tmp73
        movslq  %edx,%rax       # batch, batch
        addq    %rsi, %r13      # amount, count
        cmpq    %rax, %r13      # batch, count
        jge     .L27    #,
        negl    %edx    # tmp76
        movslq  %edx,%rdx       # tmp76, tmp77
        cmpq    %rdx, %r13      # tmp77, count
        jg      .L28    #,
.L27:
        movq    %rbx, %rdi      # fbc,
        call    _raw_spin_lock  #
        addq    %r13, 8(%rbx)   # count, <variable>.count
        movq    %rbx, %rdi      # fbc,
        movl    $0, (%r12)      #,* tcp_ptr__
        call    _raw_spin_unlock        #
.L29:
#APP
# 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
        movq %gs:kernel_stack,%rax      #, pfo_ret__
# 0 "" 2
#NO_APP
        decl    -8124(%rax)     # <variable>.preempt_count
        movq    -8136(%rax), %rax       #, D.14625
        testb   $8, %al #, D.14625
        jne     .L32    #,
.L31:
        movq    -24(%rbp), %rbx #,
        movq    -16(%rbp), %r12 #,
        movq    -8(%rbp), %r13  #,
        leave
        ret
        .p2align 4,,10
        .p2align 3
.L28:
        movl    %r13d, (%r12)   # count,*
        jmp     .L29    #
.L32:
        call    preempt_schedule        #
        .p2align 4,,6
        jmp     .L31    #
        .size   __percpu_counter_add, .-__percpu_counter_add
        .p2align 4,,15
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

3e880fb5

vfs: revert per-cpu nr_unused counters for dentry and inodes · 86c8749e

由 Nick Piggin 提交于 1月 07, 2011

The nr_unused counters count the number of objects on an LRU, and as such they
are synchronized with LRU object insertion and removal and scanning, and
protected under the LRU lock.

Making it per-cpu does not actually get any concurrency improvements because of
this lock, and summing the counter is much slower, and
incrementing/decrementing it costs more code size and is slower too.

These counters should stay per-LRU, which currently means global.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

86c8749e

fs: d_validate fixes · 786a5e15

由 Nick Piggin 提交于 1月 07, 2011

d_validate has been broken for a long time.

kmem_ptr_validate does not guarantee that a pointer can be dereferenced
if it can go away at any time. Even rcu_read_lock doesn't help, because
the pointer might be queued in RCU callbacks but not executed yet.

So the parent cannot be checked, nor the name hashed. The dentry pointer
can not be touched until it can be verified under lock. Hashing simply
cannot be used.

Instead, verify the parent/child relationship by traversing parent's
d_child list. It's slow, but only ncpfs and the destaged smbfs care
about it, at this point.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

786a5e15

05 1月, 2011 1 次提交

Revert "fs: use RCU read side protection in d_validate" · d3a23e16

由 Nick Piggin 提交于 1月 05, 2011

This reverts commit 3825bdb7.

You cannot dget() a dentry without having a reference, or holding
a lock that guarantees it remains valid.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

d3a23e16

24 12月, 2010 1 次提交

ext4: fix on-line resizing regression · 8a7411a2

由 Theodore Ts'o 提交于 12月 20, 2010

https://bugzilla.kernel.org/show_bug.cgi?id=25352

This regression was caused by commit a31437b8: "ext4: use
sb_issue_zeroout in setup_new_group_blocks", by accidentally dropping
the code which reserved the block group descriptor and inode table
blocks.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

8a7411a2

23 12月, 2010 2 次提交

logfs: fix "Kernel BUG at readwrite.c:1193" · f06328d7

由 Prasad Joshi 提交于 12月 21, 2010

This happens when __logfs_create() tries to write a new inode to the disk
which is full.

__logfs_create() associates the transaction pointer with inode.  During
the logfs_write_inode() function call chain this transaction pointer is
moved from inode to page->private using function move_inode_to_page
(do_write_inode() -> inode_to_page() -> move_inode_to_page)

When the write inode fails, the transaction is aborted and iput is called
on the failed inode.  During delete_inode the same transaction pointer
associated with the page is getting used.  Thus causing kernel BUG.

The patch checks for error in write_inode() and restores the page->private
to NULL.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=20162Signed-off-by: NPrasad Joshi <prasadjoshi124@gmail.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Florian Mickler <florian@mickler.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f06328d7

logfs: fix deadlock in logfs_get_wblocks, hold and wait on super->s_write_mutex · eabb26ca

由 Prasad Joshi 提交于 12月 21, 2010

do_logfs_journal_wl_pass() should use GFP_NOFS for memory allocation GC
code calls btree_insert32 with GFP_KERNEL while holding a mutex
super->s_write_mutex.

The same mutex is used in address_space_operations->writepage(), and a
call to writepage() could be triggered as a result of memory allocation
in btree_insert32, causing a deadlock.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=20342Signed-off-by: NPrasad Joshi <prasadjoshi124@gmail.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Florian Mickler <florian@mickler.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

eabb26ca

22 12月, 2010 1 次提交

ocfs2: Fix system inodes cache overflow. · 7d8f9876

由 Tao Ma 提交于 12月 22, 2010

When we store system inodes cache in ocfs2_super,
we use a array for global system inodes. But unfortunately,
the range is calculated wrongly which makes it overflow and
pollute ocfs2_super->local_system_inodes.
This patch fix it by setting the range properly.

The corresponding bug is ossbug1303.
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1303

Cc: stable@kernel.org
Signed-off-by: NTao Ma <boyu.mt@taobao.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

7d8f9876

21 12月, 2010 1 次提交

Fix btrfs b0rkage · 3cb50ddf

由 Al Viro 提交于 12月 20, 2010

Buggered-in: 76dda93c ("Btrfs: add snapshot/subvolume destroy
ioctl")
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Acked-by: NChris Mason <chris.mason@oracle.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3cb50ddf

18 12月, 2010 2 次提交

ceph: mark user pages dirty on direct-io reads · b6aa5901

由 Henry C Chang 提交于 12月 15, 2010

For read operation, we have to set the argument _write_ of get_user_pages
to 1 since we will write data to pages. Also, we need to SetPageDirty before
releasing these pages.
Signed-off-by: NHenry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: NSage Weil <sage@newdream.net>

b6aa5901

ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport · 92cf7652

由 Sage Weil 提交于 12月 17, 2010

The fh_to_dentry etc. methods use ceph_init_dentry(), which assumes that
d_parent is defined.  It isn't for those callers, so check!
Signed-off-by: NSage Weil <sage@newdream.net>

92cf7652

16 12月, 2010 5 次提交

ocfs2: Hold ip_lock when set/clear flags for indexed dir. · 8ac33dc8

由 Tao Ma 提交于 12月 15, 2010

When we set/clear the dyn_features for an inode we hold the ip_lock.
So do it when we set/clear OCFS2_INDEXED_DIR_FL also.
Signed-off-by: NTao Ma <boyu.mt@taobao.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

8ac33dc8

ocfs2: Adjust masklog flag values · 41b41a26

由 Sunil Mushran 提交于 12月 09, 2010

Two masklogs had the same flag value.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

41b41a26

nilfs2: fix regression of garbage collection ioctl · 947b10ae

由 Ryusuke Konishi 提交于 12月 16, 2010

On 2.6.37-rc1, garbage collection ioctl of nilfs was broken due to the
commit 263d90ce ("nilfs2: remove own inode hash used for GC"),
and leading to filesystem corruption.

The patch doesn't queue gc-inodes for log writer if they are reused
through the vfs inode cache.  Here, gc-inode is the inode which
buffers blocks to be relocated on GC.  That patch queues gc-inodes in
nilfs_init_gcinode() function, but this function is not called when
they don't have I_NEW flag.  Thus, some of live blocks are wrongly
overrode without being moved to new logs.

This resolves the problem by moving the gc-inode queueing to an outer
function to ensure it's done right.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

947b10ae

ceph: fix direct-io on non-page-aligned buffers · ab226e21

由 Henry C Chang 提交于 12月 15, 2010

The user buffer may be 512-byte aligned, not page-aligned.  We were
assuming the buffer was page-aligned and only accounting for
non-page-aligned io offsets.
Signed-off-by: NHenry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: NSage Weil <sage@newdream.net>

ab226e21

install_special_mapping skips security_file_mmap check. · 462e635e

由 Tavis Ormandy 提交于 12月 09, 2010

The install_special_mapping routine (used, for example, to setup the
vdso) skips the security check before insert_vm_struct, allowing a local
attacker to bypass the mmap_min_addr security restriction by limiting
the available pages for special mappings.

bprm_mm_init() also skips the check, and although I don't think this can
be used to bypass any restrictions, I don't see any reason not to have
the security check.

  $ uname -m
  x86_64
  $ cat /proc/sys/vm/mmap_min_addr
  65536
  $ cat install_special_mapping.s
  section .bss
      resb BSS_SIZE
  section .text
      global _start
      _start:
          mov     eax, __NR_pause
          int     0x80
  $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
  $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
  $ ./install_special_mapping &
  [1] 14303
  $ cat /proc/14303/maps
  0000f000-00010000 r-xp 00000000 00:00 0                                  [vdso]
  00010000-00011000 r-xp 00001000 00:19 2453665                            /home/taviso/install_special_mapping
  00011000-ffffe000 rwxp 00000000 00:00 0                                  [stack]

It's worth noting that Red Hat are shipping with mmap_min_addr set to
4096.
Signed-off-by: NTavis Ormandy <taviso@google.com>
Acked-by: NKees Cook <kees@ubuntu.com>
Acked-by: NRobert Swiecki <swiecki@google.com>
[ Changed to not drop the error code - akpm ]
Reviewed-by: NJames Morris <jmorris@namei.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

462e635e