1. 07 1月, 2011 19 次提交
    • N
      fs: Use rename lock and RCU for multi-step operations · 949854d0
      Nick Piggin 提交于
      The remaining usages for dcache_lock is to allow atomic, multi-step read-side
      operations over the directory tree by excluding modifications to the tree.
      Also, to walk in the leaf->root direction in the tree where we don't have
      a natural d_lock ordering.
      
      This could be accomplished by taking every d_lock, but this would mean a
      huge number of locks and actually gets very tricky.
      
      Solve this instead by using the rename seqlock for multi-step read-side
      operations, retry in case of a rename so we don't walk up the wrong parent.
      Concurrent dentry insertions are not serialised against.  Concurrent deletes
      are tricky when walking up the directory: our parent might have been deleted
      when dropping locks so also need to check and retry for that.
      
      We can also use the rename lock in cases where livelock is a worry (and it
      is introduced in subsequent patch).
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      949854d0
    • N
      fs: increase d_name lock coverage · 9abca360
      Nick Piggin 提交于
      Cover d_name with d_lock in more cases, where there may be concurrent
      modification to it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      9abca360
    • N
      fs: scale inode alias list · b23fb0a6
      Nick Piggin 提交于
      Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
      from concurrent modification. d_alias is also protected by d_lock.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b23fb0a6
    • N
      fs: dcache scale subdirs · 2fd6b7f5
      Nick Piggin 提交于
      Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
      using dcache_lock for these anyway (eg. using i_mutex).
      
      Note: if we change the locking rule in future so that ->d_child protection is
      provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
      But it would be an exception to an otherwise regular locking scheme, so we'd
      have to see some good results. Probably not worthwhile.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      2fd6b7f5
    • N
      fs: dcache scale d_unhashed · da502956
      Nick Piggin 提交于
      Protect d_unhashed(dentry) condition with d_lock. This means keeping
      DCACHE_UNHASHED bit in synch with hash manipulations.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      da502956
    • N
      fs: dcache scale dentry refcount · b7ab39f6
      Nick Piggin 提交于
      Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
      0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
      we start protecting many other dentry members with d_lock.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b7ab39f6
    • N
      fs: dcache scale lru · 23044507
      Nick Piggin 提交于
      Add a new lock, dcache_lru_lock, to protect the dcache LRU list from concurrent
      modification. d_lru is also protected by d_lock, which allows LRU lists to be
      accessed without the lru lock, using RCU in future patches.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      23044507
    • N
      fs: dcache scale hash · 789680d1
      Nick Piggin 提交于
      Add a new lock, dcache_hash_lock, to protect the dcache hash table from
      concurrent modification. d_hash is also protected by d_lock.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      789680d1
    • N
      hostfs: simplify locking · ec2447c2
      Nick Piggin 提交于
      Remove dcache_lock locking from hostfs filesystem, and move it into dcache
      helpers. All that is required is a coherent path name. Protection from
      concurrent modification of the namespace after path name generation is not
      provided in current code, because dcache_lock is dropped before the path is
      used.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      ec2447c2
    • N
      fs: change d_hash for rcu-walk · b1e6a015
      Nick Piggin 提交于
      Change d_hash so it may be called from lock-free RCU lookups. See similar
      patch for d_compare for details.
      
      For in-tree filesystems, this is just a mechanical change.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b1e6a015
    • N
      fs: change d_compare for rcu-walk · 621e155a
      Nick Piggin 提交于
      Change d_compare so it may be called from lock-free RCU lookups. This
      does put significant restrictions on what may be done from the callback,
      however there don't seem to have been any problems with in-tree fses.
      If some strange use case pops up that _really_ cannot cope with the
      rcu-walk rules, we can just add new rcu-unaware callbacks, which would
      cause name lookup to drop out of rcu-walk mode.
      
      For in-tree filesystems, this is just a mechanical change.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      621e155a
    • N
      fs: name case update method · fb2d5b86
      Nick Piggin 提交于
      smpfs and ncpfs want to update a live dentry name in-place. Rather than
      have them open code the locking, provide a documented dcache API.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fb2d5b86
    • N
      jfs: dont overwrite dentry name in d_revalidate · 2bc334dc
      Nick Piggin 提交于
      Use vfat's method for dealing with negative dentries to preserve case,
      rather than overwrite dentry name in d_revalidate, which is a bit ugly
      and also gets in the way of doing lock-free path walking.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      2bc334dc
    • N
      cifs: dont overwrite dentry name in d_revalidate · 79eb4dde
      Nick Piggin 提交于
      Use vfat's method for dealing with negative dentries to preserve case,
      rather than overwrite dentry name in d_revalidate, which is a bit ugly
      and also gets in the way of doing lock-free path walking.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      79eb4dde
    • N
      fs: change d_delete semantics · fe15ce44
      Nick Piggin 提交于
      Change d_delete from a dentry deletion notification to a dentry caching
      advise, more like ->drop_inode. Require it to be constant and idempotent,
      and not take d_lock. This is how all existing filesystems use the callback
      anyway.
      
      This makes fine grained dentry locking of dput and dentry lru scanning
      much simpler.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fe15ce44
    • N
      config fs: avoid switching ->d_op on live dentry · fbc8d4c0
      Nick Piggin 提交于
      Switching d_op on a live dentry is racy in general, so avoid it. In this case
      it is a negative dentry, which is safer, but there are still concurrent ops
      which may be called on d_op in that case (eg. d_revalidate). So in general
      a filesystem may not do this. Fix configfs so as not to do this.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fbc8d4c0
    • N
      fs: use fast counters for vfs caches · 3e880fb5
      Nick Piggin 提交于
      percpu_counter library generates quite nasty code, so unless you need
      to dynamically allocate counters or take fast approximate value, a
      simple per cpu set of counters is much better.
      
      The percpu_counter can never be made to work as well, because it has an
      indirection from pointer to percpu memory, and it can't use direct
      this_cpu_inc interfaces because it doesn't use static PER_CPU data, so
      code will always be worse.
      
      In the fastpath, it is the difference between this:
      
              incl %gs:nr_dentry      # nr_dentry
      
      and this:
      
              movl    percpu_counter_batch(%rip), %edx        # percpu_counter_batch,
              movl    $1, %esi        #,
              movq    $nr_dentry, %rdi        #,
              call    __percpu_counter_add    # (plus I clobber registers)
      
      __percpu_counter_add:
              pushq   %rbp    #
              movq    %rsp, %rbp      #,
              subq    $32, %rsp       #,
              movq    %rbx, -24(%rbp) #,
              movq    %r12, -16(%rbp) #,
              movq    %r13, -8(%rbp)  #,
              movq    %rdi, %rbx      # fbc, fbc
      #APP
      # 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
              movq %gs:kernel_stack,%rax      #, pfo_ret__
      # 0 "" 2
      #NO_APP
              incl    -8124(%rax)     # <variable>.preempt_count
              movq    32(%rdi), %r12  # <variable>.counters, tcp_ptr__
      #APP
      # 78 "lib/percpu_counter.c" 1
              add %gs:this_cpu_off, %r12      # this_cpu_off, tcp_ptr__
      # 0 "" 2
      #NO_APP
              movslq  (%r12),%r13     #* tcp_ptr__, tmp73
              movslq  %edx,%rax       # batch, batch
              addq    %rsi, %r13      # amount, count
              cmpq    %rax, %r13      # batch, count
              jge     .L27    #,
              negl    %edx    # tmp76
              movslq  %edx,%rdx       # tmp76, tmp77
              cmpq    %rdx, %r13      # tmp77, count
              jg      .L28    #,
      .L27:
              movq    %rbx, %rdi      # fbc,
              call    _raw_spin_lock  #
              addq    %r13, 8(%rbx)   # count, <variable>.count
              movq    %rbx, %rdi      # fbc,
              movl    $0, (%r12)      #,* tcp_ptr__
              call    _raw_spin_unlock        #
      .L29:
      #APP
      # 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
              movq %gs:kernel_stack,%rax      #, pfo_ret__
      # 0 "" 2
      #NO_APP
              decl    -8124(%rax)     # <variable>.preempt_count
              movq    -8136(%rax), %rax       #, D.14625
              testb   $8, %al #, D.14625
              jne     .L32    #,
      .L31:
              movq    -24(%rbp), %rbx #,
              movq    -16(%rbp), %r12 #,
              movq    -8(%rbp), %r13  #,
              leave
              ret
              .p2align 4,,10
              .p2align 3
      .L28:
              movl    %r13d, (%r12)   # count,*
              jmp     .L29    #
      .L32:
              call    preempt_schedule        #
              .p2align 4,,6
              jmp     .L31    #
              .size   __percpu_counter_add, .-__percpu_counter_add
              .p2align 4,,15
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      3e880fb5
    • N
      vfs: revert per-cpu nr_unused counters for dentry and inodes · 86c8749e
      Nick Piggin 提交于
      The nr_unused counters count the number of objects on an LRU, and as such they
      are synchronized with LRU object insertion and removal and scanning, and
      protected under the LRU lock.
      
      Making it per-cpu does not actually get any concurrency improvements because of
      this lock, and summing the counter is much slower, and
      incrementing/decrementing it costs more code size and is slower too.
      
      These counters should stay per-LRU, which currently means global.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      86c8749e
    • N
      fs: d_validate fixes · 786a5e15
      Nick Piggin 提交于
      d_validate has been broken for a long time.
      
      kmem_ptr_validate does not guarantee that a pointer can be dereferenced
      if it can go away at any time. Even rcu_read_lock doesn't help, because
      the pointer might be queued in RCU callbacks but not executed yet.
      
      So the parent cannot be checked, nor the name hashed. The dentry pointer
      can not be touched until it can be verified under lock. Hashing simply
      cannot be used.
      
      Instead, verify the parent/child relationship by traversing parent's
      d_child list. It's slow, but only ncpfs and the destaged smbfs care
      about it, at this point.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      786a5e15
  2. 05 1月, 2011 1 次提交
  3. 24 12月, 2010 1 次提交
  4. 23 12月, 2010 2 次提交
  5. 22 12月, 2010 1 次提交
  6. 21 12月, 2010 1 次提交
  7. 18 12月, 2010 2 次提交
  8. 16 12月, 2010 6 次提交
  9. 15 12月, 2010 2 次提交
    • A
      ext4: fix typo which broke '..' detection in ext4_find_entry() · 6d5c3aa8
      Aaro Koskinen 提交于
      There should be a check for the NUL character instead of '0'.
      
      Fortunately the only thing that cares about this is NFS serving, which
      is why we didn't notice this in the merge window testing.
      Reported-by: NPhil Carmody <ext-phil.2.carmody@nokia.com>
      Signed-off-by: NAaro Koskinen <aaro.koskinen@nokia.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6d5c3aa8
    • T
      ext4: Turn off multiple page-io submission by default · 1449032b
      Theodore Ts'o 提交于
      Jon Nelson has found a test case which causes postgresql to fail with
      the error:
      
      psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581
      
      Under memory pressure, it looks like part of a file can end up getting
      replaced by zero's.  Until we can figure out the cause, we'll roll
      back the change and use block_write_full_page() instead of
      ext4_bio_write_page().  The new, more efficient writing function can
      be used via the mount option mblk_io_submit, so we can test and fix
      the new page I/O code.
      
      To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
      memory such that the system just at the end of triggering writeback
      before running the following sql script:
      
      begin;
      create temporary table foo as select x as a, ARRAY[x] as b FROM
      generate_series(1, 10000000 ) AS x;
      create index foo_a_idx on foo (a);
      create index foo_b_idx on foo USING GIN (b);
      rollback;
      
      If the temporary table is created on a hard drive partition which is
      encrypted using dm_crypt, then under memory pressure, approximately
      30-40% of the time, pgsql will issue the above failure.
      
      This patch should fix this problem, and the problem will come back if
      the file system is mounted with the mblk_io_submit mount option.
      Reported-by: NJon Nelson <jnelson@jamponi.net>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1449032b
  10. 14 12月, 2010 3 次提交
    • C
      Btrfs: prevent RAID level downgrades when space is low · 83a50de9
      Chris Mason 提交于
      The extent allocator has code that allows us to fill
      allocations from any available block group, even if it doesn't
      match the raid level we've requested.
      
      This was put in because adding a new drive to a filesystem
      made with the default mkfs options actually upgrades the metadata from
      single spindle dup to full RAID1.
      
      But, the code also allows us to allocate from a raid0 chunk when we
      really want a raid1 or raid10 chunk.  This can cause big trouble because
      mkfs creates a small (4MB) raid0 chunk for data and metadata which then
      goes unused for raid1/raid10 installs.
      
      The allocator will happily wander in and allocate from that chunk when
      things get tight, which is not correct.
      
      The fix here is to make sure that we provide duplication when the
      caller has asked for it.  It does all the dups to be any raid level,
      which preserves the dup->raid1 upgrade abilities.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      83a50de9
    • C
      Btrfs: account for missing devices in RAID allocation profiles · cd02dca5
      Chris Mason 提交于
      When we mount in RAID degraded mode without adding a new device to
      replace the failed one, we can end up using the wrong RAID flags for
      allocations.
      
      This results in strange combinations of block groups (raid1 in a raid10
      filesystem) and corruptions when we try to allocate blocks from single
      spindle chunks on drives that are actually missing.
      
      The first device has two small 4MB chunks in it that mkfs creates and
      these are usually unused in a raid1 or raid10 setup.  But, in -o degraded,
      the allocator will fall back to these because the mask of desired raid groups
      isn't correct.
      
      The fix here is to count the missing devices as we build up the list
      of devices in the system.  This count is used when picking the
      raid level to make sure we continue using the same levels that were
      in place before we lost a drive.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cd02dca5
    • C
      Btrfs: EIO when we fail to read tree roots · 68433b73
      Chris Mason 提交于
      If we just get a plain IO error when we read tree roots, the code
      wasn't properly sending that error up the chain.  This allowed mounts to
      continue when they should failed, and allowed operations
      on partially setup root structs.  The end result was usually oopsen
      on spinlocks that hadn't been spun up correctly.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      68433b73
  11. 11 12月, 2010 2 次提交