1. 07 1月, 2011 7 次提交
    • N
      jfs: dont overwrite dentry name in d_revalidate · 2bc334dc
      Nick Piggin 提交于
      Use vfat's method for dealing with negative dentries to preserve case,
      rather than overwrite dentry name in d_revalidate, which is a bit ugly
      and also gets in the way of doing lock-free path walking.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      2bc334dc
    • N
      cifs: dont overwrite dentry name in d_revalidate · 79eb4dde
      Nick Piggin 提交于
      Use vfat's method for dealing with negative dentries to preserve case,
      rather than overwrite dentry name in d_revalidate, which is a bit ugly
      and also gets in the way of doing lock-free path walking.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      79eb4dde
    • N
      fs: change d_delete semantics · fe15ce44
      Nick Piggin 提交于
      Change d_delete from a dentry deletion notification to a dentry caching
      advise, more like ->drop_inode. Require it to be constant and idempotent,
      and not take d_lock. This is how all existing filesystems use the callback
      anyway.
      
      This makes fine grained dentry locking of dput and dentry lru scanning
      much simpler.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fe15ce44
    • N
      config fs: avoid switching ->d_op on live dentry · fbc8d4c0
      Nick Piggin 提交于
      Switching d_op on a live dentry is racy in general, so avoid it. In this case
      it is a negative dentry, which is safer, but there are still concurrent ops
      which may be called on d_op in that case (eg. d_revalidate). So in general
      a filesystem may not do this. Fix configfs so as not to do this.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fbc8d4c0
    • N
      fs: use fast counters for vfs caches · 3e880fb5
      Nick Piggin 提交于
      percpu_counter library generates quite nasty code, so unless you need
      to dynamically allocate counters or take fast approximate value, a
      simple per cpu set of counters is much better.
      
      The percpu_counter can never be made to work as well, because it has an
      indirection from pointer to percpu memory, and it can't use direct
      this_cpu_inc interfaces because it doesn't use static PER_CPU data, so
      code will always be worse.
      
      In the fastpath, it is the difference between this:
      
              incl %gs:nr_dentry      # nr_dentry
      
      and this:
      
              movl    percpu_counter_batch(%rip), %edx        # percpu_counter_batch,
              movl    $1, %esi        #,
              movq    $nr_dentry, %rdi        #,
              call    __percpu_counter_add    # (plus I clobber registers)
      
      __percpu_counter_add:
              pushq   %rbp    #
              movq    %rsp, %rbp      #,
              subq    $32, %rsp       #,
              movq    %rbx, -24(%rbp) #,
              movq    %r12, -16(%rbp) #,
              movq    %r13, -8(%rbp)  #,
              movq    %rdi, %rbx      # fbc, fbc
      #APP
      # 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
              movq %gs:kernel_stack,%rax      #, pfo_ret__
      # 0 "" 2
      #NO_APP
              incl    -8124(%rax)     # <variable>.preempt_count
              movq    32(%rdi), %r12  # <variable>.counters, tcp_ptr__
      #APP
      # 78 "lib/percpu_counter.c" 1
              add %gs:this_cpu_off, %r12      # this_cpu_off, tcp_ptr__
      # 0 "" 2
      #NO_APP
              movslq  (%r12),%r13     #* tcp_ptr__, tmp73
              movslq  %edx,%rax       # batch, batch
              addq    %rsi, %r13      # amount, count
              cmpq    %rax, %r13      # batch, count
              jge     .L27    #,
              negl    %edx    # tmp76
              movslq  %edx,%rdx       # tmp76, tmp77
              cmpq    %rdx, %r13      # tmp77, count
              jg      .L28    #,
      .L27:
              movq    %rbx, %rdi      # fbc,
              call    _raw_spin_lock  #
              addq    %r13, 8(%rbx)   # count, <variable>.count
              movq    %rbx, %rdi      # fbc,
              movl    $0, (%r12)      #,* tcp_ptr__
              call    _raw_spin_unlock        #
      .L29:
      #APP
      # 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
              movq %gs:kernel_stack,%rax      #, pfo_ret__
      # 0 "" 2
      #NO_APP
              decl    -8124(%rax)     # <variable>.preempt_count
              movq    -8136(%rax), %rax       #, D.14625
              testb   $8, %al #, D.14625
              jne     .L32    #,
      .L31:
              movq    -24(%rbp), %rbx #,
              movq    -16(%rbp), %r12 #,
              movq    -8(%rbp), %r13  #,
              leave
              ret
              .p2align 4,,10
              .p2align 3
      .L28:
              movl    %r13d, (%r12)   # count,*
              jmp     .L29    #
      .L32:
              call    preempt_schedule        #
              .p2align 4,,6
              jmp     .L31    #
              .size   __percpu_counter_add, .-__percpu_counter_add
              .p2align 4,,15
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      3e880fb5
    • N
      vfs: revert per-cpu nr_unused counters for dentry and inodes · 86c8749e
      Nick Piggin 提交于
      The nr_unused counters count the number of objects on an LRU, and as such they
      are synchronized with LRU object insertion and removal and scanning, and
      protected under the LRU lock.
      
      Making it per-cpu does not actually get any concurrency improvements because of
      this lock, and summing the counter is much slower, and
      incrementing/decrementing it costs more code size and is slower too.
      
      These counters should stay per-LRU, which currently means global.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      86c8749e
    • N
      fs: d_validate fixes · 786a5e15
      Nick Piggin 提交于
      d_validate has been broken for a long time.
      
      kmem_ptr_validate does not guarantee that a pointer can be dereferenced
      if it can go away at any time. Even rcu_read_lock doesn't help, because
      the pointer might be queued in RCU callbacks but not executed yet.
      
      So the parent cannot be checked, nor the name hashed. The dentry pointer
      can not be touched until it can be verified under lock. Hashing simply
      cannot be used.
      
      Instead, verify the parent/child relationship by traversing parent's
      d_child list. It's slow, but only ncpfs and the destaged smbfs care
      about it, at this point.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      786a5e15
  2. 05 1月, 2011 1 次提交
  3. 24 12月, 2010 1 次提交
  4. 23 12月, 2010 2 次提交
  5. 22 12月, 2010 1 次提交
  6. 21 12月, 2010 1 次提交
  7. 18 12月, 2010 2 次提交
  8. 16 12月, 2010 6 次提交
  9. 15 12月, 2010 2 次提交
    • A
      ext4: fix typo which broke '..' detection in ext4_find_entry() · 6d5c3aa8
      Aaro Koskinen 提交于
      There should be a check for the NUL character instead of '0'.
      
      Fortunately the only thing that cares about this is NFS serving, which
      is why we didn't notice this in the merge window testing.
      Reported-by: NPhil Carmody <ext-phil.2.carmody@nokia.com>
      Signed-off-by: NAaro Koskinen <aaro.koskinen@nokia.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6d5c3aa8
    • T
      ext4: Turn off multiple page-io submission by default · 1449032b
      Theodore Ts'o 提交于
      Jon Nelson has found a test case which causes postgresql to fail with
      the error:
      
      psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581
      
      Under memory pressure, it looks like part of a file can end up getting
      replaced by zero's.  Until we can figure out the cause, we'll roll
      back the change and use block_write_full_page() instead of
      ext4_bio_write_page().  The new, more efficient writing function can
      be used via the mount option mblk_io_submit, so we can test and fix
      the new page I/O code.
      
      To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
      memory such that the system just at the end of triggering writeback
      before running the following sql script:
      
      begin;
      create temporary table foo as select x as a, ARRAY[x] as b FROM
      generate_series(1, 10000000 ) AS x;
      create index foo_a_idx on foo (a);
      create index foo_b_idx on foo USING GIN (b);
      rollback;
      
      If the temporary table is created on a hard drive partition which is
      encrypted using dm_crypt, then under memory pressure, approximately
      30-40% of the time, pgsql will issue the above failure.
      
      This patch should fix this problem, and the problem will come back if
      the file system is mounted with the mblk_io_submit mount option.
      Reported-by: NJon Nelson <jnelson@jamponi.net>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1449032b
  10. 14 12月, 2010 3 次提交
    • C
      Btrfs: prevent RAID level downgrades when space is low · 83a50de9
      Chris Mason 提交于
      The extent allocator has code that allows us to fill
      allocations from any available block group, even if it doesn't
      match the raid level we've requested.
      
      This was put in because adding a new drive to a filesystem
      made with the default mkfs options actually upgrades the metadata from
      single spindle dup to full RAID1.
      
      But, the code also allows us to allocate from a raid0 chunk when we
      really want a raid1 or raid10 chunk.  This can cause big trouble because
      mkfs creates a small (4MB) raid0 chunk for data and metadata which then
      goes unused for raid1/raid10 installs.
      
      The allocator will happily wander in and allocate from that chunk when
      things get tight, which is not correct.
      
      The fix here is to make sure that we provide duplication when the
      caller has asked for it.  It does all the dups to be any raid level,
      which preserves the dup->raid1 upgrade abilities.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      83a50de9
    • C
      Btrfs: account for missing devices in RAID allocation profiles · cd02dca5
      Chris Mason 提交于
      When we mount in RAID degraded mode without adding a new device to
      replace the failed one, we can end up using the wrong RAID flags for
      allocations.
      
      This results in strange combinations of block groups (raid1 in a raid10
      filesystem) and corruptions when we try to allocate blocks from single
      spindle chunks on drives that are actually missing.
      
      The first device has two small 4MB chunks in it that mkfs creates and
      these are usually unused in a raid1 or raid10 setup.  But, in -o degraded,
      the allocator will fall back to these because the mask of desired raid groups
      isn't correct.
      
      The fix here is to count the missing devices as we build up the list
      of devices in the system.  This count is used when picking the
      raid level to make sure we continue using the same levels that were
      in place before we lost a drive.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cd02dca5
    • C
      Btrfs: EIO when we fail to read tree roots · 68433b73
      Chris Mason 提交于
      If we just get a plain IO error when we read tree roots, the code
      wasn't properly sending that error up the chain.  This allowed mounts to
      continue when they should failed, and allowed operations
      on partially setup root structs.  The end result was usually oopsen
      on spinlocks that hadn't been spun up correctly.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      68433b73
  11. 11 12月, 2010 8 次提交
  12. 10 12月, 2010 6 次提交
    • T
      Ocfs2: Teach 'coherency=full' O_DIRECT writes to correctly up_read i_alloc_sem. · 39c99f12
      Tristan Ye 提交于
      Due to newly-introduced 'coherency=full' O_DIRECT writes also takes the EX
      rw_lock like buffered writes did(rw_level == 1), it turns out messing the
      usage of 'level' in ocfs2_dio_end_io() up, which caused i_alloc_sem being
      failed to get up_read'd correctly.
      
      This patch tries to teach ocfs2_dio_end_io to understand well on all locking
      stuffs by explicitly introducing a new bit for i_alloc_sem in iocb's private
      data, just like what we did for rw_lock.
      Signed-off-by: NTristan Ye <tristan.ye@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      39c99f12
    • S
      ocfs2/dlm: Migrate lockres with no locks if it has a reference · 388c4bcb
      Sunil Mushran 提交于
      o2dlm was not migrating resources with zero locks because it assumed that that
      resource would get purged by dlm_thread. However, some usage patterns involve
      creating and dropping locks at a high rate leading to the migrate thread seeing
      zero locks but the purge thread seeing an active reference. When this happens,
      the dlm_thread cannot purge the resource and the migrate thread sees no reason
      to migrate that resource. The spell is broken when the migrate thread catches
      the resource with a lock.
      
      The fix is to make the migrate thread also consider the reference map.
      
      This usage pattern can be triggered by userspace on userdlm locks and flocks.
      Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      388c4bcb
    • C
      xfs: log timestamp changes to the source inode in rename · 05340d4a
      Christoph Hellwig 提交于
      Now that we don't mark VFS inodes dirty anymore for internal
      timestamp changes, but rely on the transaction subsystem to push
      them out, we need to explicitly log the source inode in rename after
      updating it's timestamps to make sure the changes actually get
      forced out by sync/fsync or an AIL push.
      
      We already account for the fourth inode in the log reservation, as a
      rename of directories needs to update the nlink field, so just
      adding the xfs_trans_log_inode call is enough.
      
      This fixes the xfsqa 065 regression introduced by:
      
      	"xfs: don't use vfs writeback for pure metadata modifications"
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      05340d4a
    • J
      Btrfs: fixup return code for btrfs_del_orphan_item · 7e1fea73
      Josef Bacik 提交于
      If the orphan item doesn't exist, we return 1, which doesn't make any sense to
      the callers.  Instead return -ENOENT if we didn't find the item.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7e1fea73
    • J
      Btrfs: do not do fast caching if we are allocating blocks for tree_root · b8399dee
      Josef Bacik 提交于
      Since the fast caching uses normal tree locking, we can possibly deadlock if we
      get to the caching via a btrfs_search_slot() on the tree_root.  So just check to
      see if the root we are on is the tree root, and just don't do the fast caching.
      Reported-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      b8399dee
    • J
      Btrfs: deal with space cache errors better · 2b20982e
      Josef Bacik 提交于
      Currently if the space cache inode generation number doesn't match the
      generation number in the space cache header we will just fail to load the space
      cache, but we won't mark the space cache as an error, so we'll keep getting that
      error each time somebody tries to cache that block group until we actually clear
      the thing.  Fix this by marking the space cache as having an error so we only
      get the message once.  This patch also makes it so that we don't try and setup
      space cache for a block group that isn't cached, since we won't be able to write
      it out anyway.  None of these problems are actual problems, they are just
      annoying and sub-optimal.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      2b20982e