1. 03 5月, 2012 5 次提交
  2. 26 4月, 2012 1 次提交
    • E
      userns: Rework the user_namespace adding uid/gid mapping support · 22d917d8
      Eric W. Biederman 提交于
      - Convert the old uid mapping functions into compatibility wrappers
      - Add a uid/gid mapping layer from user space uid and gids to kernel
        internal uids and gids that is extent based for simplicty and speed.
        * Working with number space after mapping uids/gids into their kernel
          internal version adds only mapping complexity over what we have today,
          leaving the kernel code easy to understand and test.
      - Add proc files /proc/self/uid_map /proc/self/gid_map
        These files display the mapping and allow a mapping to be added
        if a mapping does not exist.
      - Allow entering the user namespace without a uid or gid mapping.
        Since we are starting with an existing user our uids and gids
        still have global mappings so are still valid and useful they just don't
        have local mappings.  The requirement for things to work are global uid
        and gid so it is odd but perfectly fine not to have a local uid
        and gid mapping.
        Not requiring global uid and gid mappings greatly simplifies
        the logic of setting up the uid and gid mappings by allowing
        the mappings to be set after the namespace is created which makes the
        slight weirdness worth it.
      - Make the mappings in the initial user namespace to the global
        uid/gid space explicit.  Today it is an identity mapping
        but in the future we may want to twist this for debugging, similar
        to what we do with jiffies.
      - Document the memory ordering requirements of setting the uid and
        gid mappings.  We only allow the mappings to be set once
        and there are no pointers involved so the requirments are
        trivial but a little atypical.
      
      Performance:
      
      In this scheme for the permission checks the performance is expected to
      stay the same as the actuall machine instructions should remain the same.
      
      The worst case I could think of is ls -l on a large directory where
      all of the stat results need to be translated with from kuids and
      kgids to uids and gids.  So I benchmarked that case on my laptop
      with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.
      
      My benchmark consisted of going to single user mode where nothing else
      was running. On an ext4 filesystem opening 1,000,000 files and looping
      through all of the files 1000 times and calling fstat on the
      individuals files.  This was to ensure I was benchmarking stat times
      where the inodes were in the kernels cache, but the inode values were
      not in the processors cache.  My results:
      
      v3.4-rc1:         ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
      v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
      v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
      
      All of the configurations ran in roughly 120ns when I performed tests
      that ran in the cpu cache.
      
      So in summary the performance impact is:
      1ns improvement in the worst case with user namespace support compiled out.
      8ns aka 5% slowdown in the worst case with user namespace support compiled in.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      22d917d8
  3. 08 4月, 2012 3 次提交
  4. 03 4月, 2012 1 次提交
  5. 01 4月, 2012 20 次提交
  6. 30 3月, 2012 3 次提交
    • L
      Revert "ext4: don't release page refs in ext4_end_bio()" · 6268b325
      Linus Torvalds 提交于
      This reverts commit b43d17f3.
      
      Dave Jones reports that it causes lockups on his laptop, and his debug
      output showed a lot of processes hung waiting for page_writeback (or
      more commonly - processes hung waiting for a lock that was held during
      that writeback wait).
      
      The page_writeback hint made Ted suggest that Dave look at this commit,
      and Dave verified that reverting it makes his problems go away.
      
      Ted says:
       "That commit fixes a race which is seen when you write into fallocated
        (and hence uninitialized) disk blocks under *very* heavy memory
        pressure.  Furthermore, although theoretically it could trigger under
        normal direct I/O writes, it only seems to trigger if you are issuing
        a huge number of AIO writes, such that a just-written page can get
        evicted from memory, and then read back into memory, before the
        workqueue has a chance to update the extent tree.
      
        This race has been around for a little over a year, and no one noticed
        until two months ago; it only happens under fairly exotic conditions,
        and in fact even after trying very hard to create a simple repro under
        lab conditions, we could only reproduce the problem and confirm the
        fix on production servers running MySQL on very fast PCIe-attached
        flash devices.
      
        Given that Dave was able to hit this problem pretty quickly, if we
        confirm that this commit is at fault, the only reasonable thing to do
        is to revert it IMO."
      Reported-and-tested-by: NDave Jones <davej@redhat.com>
      Acked-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6268b325
    • N
      pagemap: remove remaining unneeded spin_lock() · 10bdfb5e
      Naoya Horiguchi 提交于
      Commit 025c5b24 ("thp: optimize away unnecessary page table
      locking") moves spin_lock() into pmd_trans_huge_lock() in order to avoid
      locking unless pmd is for thp.  So this spin_lock() is a bug.
      Reported-by: NSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10bdfb5e
    • C
      Btrfs: update the checks for mixed block groups with big metadata blocks · bc3f116f
      Chris Mason 提交于
      Dave Sterba had put in patches to look for mixed data/metadata groups
      with metadata bigger than 4KB.  But these ended up in the wrong place
      and it wasn't testing the feature flag correctly.
      
      This updates the tests to make sure our sizes are matching
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bc3f116f
  7. 29 3月, 2012 7 次提交
    • L
      Btrfs: update to the right index of defragment · e1f041e1
      Liu Bo 提交于
      When we use autodefrag, we forget to update the index which indicates
      the last page we've dirty.  And we'll set dirty flags on a same set of
      pages again and again.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e1f041e1
    • L
      Btrfs: do not bother to defrag an extent if it is a big real extent · 66c26892
      Liu Bo 提交于
      $ mkfs.btrfs /dev/sdb7
      $ mount /dev/sdb7 /mnt/btrfs/ -oautodefrag
      $ dd if=/dev/zero of=/mnt/btrfs/foobar bs=4k count=10 oflag=direct 2>/dev/null
      $ filefrag -v /mnt/btrfs/foobar
      Filesystem type is: 9123683e
      File size of /mnt/btrfs/foobar is 40960 (10 blocks, blocksize 4096)
       ext logical physical expected length flags
         0       0     3072              10 eof
      /mnt/btrfs/foobar: 1 extent found
      
      Now we have a big real extent [0, 40960), but autodefrag will still defrag it.
      
      $ sync
      $ filefrag -v /mnt/btrfs/foobar
      Filesystem type is: 9123683e
      File size of /mnt/btrfs/foobar is 40960 (10 blocks, blocksize 4096)
       ext logical physical expected length flags
         0       0     3082              10 eof
      /mnt/btrfs/foobar: 1 extent found
      
      So if we already find a big real extent, we're ok about that, just skip it.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      66c26892
    • L
      Btrfs: add a check to decide if we should defrag the range · 17ce6ef8
      Liu Bo 提交于
      If our file's layout is as follows:
      | hole | data1 | hole | data2 |
      
      we do not need to defrag this file, because this file has holes and
      cannot be merged into one extent.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      17ce6ef8
    • L
      Btrfs: fix recursive defragment with autodefrag option · 4cb13e5d
      Liu Bo 提交于
      $ mkfs.btrfs disk
      $ mount disk /mnt -o autodefrag
      $ dd if=/dev/zero of=/mnt/foobar bs=4k count=10 2>/dev/null && sync
      $ for i in `seq 9 -2 0`; do dd if=/dev/zero of=/mnt/foobar bs=4k count=1 \
        seek=$i conv=notrunc 2> /dev/null; done && sync
      
      then we'll get to defrag "foobar" again and again.
      So does option "-o autodefrag,compress".
      
      Reasons:
      When the cleaner kthread gets to fetch inodes from the defrag tree and defrag
      them, it will dirty pages and submit them, this will comes to another DATA COW
      where the processing inode will be inserted to the defrag tree again.
      
      This patch sets a rule for COW code, i.e. insert an inode when we're really
      going to make some defragments.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4cb13e5d
    • L
      Btrfs: fix the mismatch of page->mapping · 1f12bd06
      Liu Bo 提交于
      commit 600a45e1
      (Btrfs: fix deadlock on page lock when doing auto-defragment)
      fixes the deadlock on page, but it also introduces another bug.
      
      A page may have been truncated after unlock & lock.
      So we need to find it again to get the right one.
      
      And since we've held i_mutex lock, inode size remains unchanged and
      we can drop isize overflow checks.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1f12bd06
    • L
      Btrfs: fix race between direct io and autodefrag · ecb8bea8
      Liu Bo 提交于
      The bug is from running xfstests 209 with autodefrag.
      
      The race is as follows:
             t1                       t2(autodefrag)
         direct IO
           invalidate pagecache
           dio(old data)             add_inode_defrag
           invalidate pagecache
         endio
      
         direct IO
           invalidate pagecache
                                      run_defrag
                                        readpage(old data)
                                        set page dirty (old data)
           dio(new data, rewrite)
           invalidate pagecache (*)
           endio
      
      t2(autodefrag) will get old data into pagecache via readpage and set
      pagecache dirty.  Meanwhile, invalidate pagecache(*) will fail due to
      dirty flags in pages.  So the old data may be flushed into disk by
      flush thread, which will lead to data loss.
      
      And so does the case of user defragment progs.
      
      The patch fixes this race by holding i_mutex when we readpage and set page dirty.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ecb8bea8
    • L
      Btrfs: fix deadlock during allocating chunks · 15d1ff81
      Liu Bo 提交于
      This deadlock comes from xfstests 251.
      
      We'll hold the chunk_mutex throughout the whole of a chunk allocation.
      But if we find that we've used up system chunk space, we need to allocate a
      new system chunk, but this will lead to a recursion of chunk allocation and end
      up with a deadlock on chunk_mutex.
      So instead we need to allocate the system chunk first if we find we're in ENOSPC.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      15d1ff81