1. 31 3月, 2014 6 次提交
    • J
      locks: add new fcntl cmd values for handling file private locks · 5d50ffd7
      Jeff Layton 提交于
      Due to some unfortunate history, POSIX locks have very strange and
      unhelpful semantics. The thing that usually catches people by surprise
      is that they are dropped whenever the process closes any file descriptor
      associated with the inode.
      
      This is extremely problematic for people developing file servers that
      need to implement byte-range locks. Developers often need a "lock
      management" facility to ensure that file descriptors are not closed
      until all of the locks associated with the inode are finished.
      
      Additionally, "classic" POSIX locks are owned by the process. Locks
      taken between threads within the same process won't conflict with one
      another, which renders them useless for synchronization between threads.
      
      This patchset adds a new type of lock that attempts to address these
      issues. These locks conflict with classic POSIX read/write locks, but
      have semantics that are more like BSD locks with respect to inheritance
      and behavior on close.
      
      This is implemented primarily by changing how fl_owner field is set for
      these locks. Instead of having them owned by the files_struct of the
      process, they are instead owned by the filp on which they were acquired.
      Thus, they are inherited across fork() and are only released when the
      last reference to a filp is put.
      
      These new semantics prevent them from being merged with classic POSIX
      locks, even if they are acquired by the same process. These locks will
      also conflict with classic POSIX locks even if they are acquired by
      the same process or on the same file descriptor.
      
      The new locks are managed using a new set of cmd values to the fcntl()
      syscall. The initial implementation of this converts these values to
      "classic" cmd values at a fairly high level, and the details are not
      exposed to the underlying filesystem. We may eventually want to push
      this handing out to the lower filesystem code but for now I don't
      see any need for it.
      
      Also, note that with this implementation the new cmd values are only
      available via fcntl64() on 32-bit arches. There's little need to
      add support for legacy apps on a new interface like this.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      5d50ffd7
    • J
      locks: pass the cmd value to fcntl_getlk/getlk64 · c1e62b8f
      Jeff Layton 提交于
      Once we introduce file private locks, we'll need to know what cmd value
      was used, as that affects the ownership and whether a conflict would
      arise.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      c1e62b8f
    • J
      locks: make /proc/locks show IS_FILE_PVT locks as type "FLPVT" · c918d42a
      Jeff Layton 提交于
      In a later patch, we'll be adding a new type of lock that's owned by
      the struct file instead of the files_struct. Those sorts of locks
      will be flagged with a new FL_FILE_PVT flag.
      
      Report these types of locks as "FLPVT" in /proc/locks to distinguish
      them from "classic" POSIX locks.
      Acked-by: NJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      c918d42a
    • J
      locks: rename locks_remove_flock to locks_remove_file · 78ed8a13
      Jeff Layton 提交于
      This function currently removes leases in addition to flock locks and in
      a later patch we'll have it deal with file-private locks too. Rename it
      to locks_remove_file to indicate that it removes locks that are
      associated with a particular struct file, and not just flock locks.
      Acked-by: NJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      78ed8a13
    • J
      locks: fix posix lock range overflow handling · ef12e72a
      J. Bruce Fields 提交于
      In the 32-bit case fcntl assigns the 64-bit f_pos and i_size to a 32-bit
      off_t.
      
      The existing range checks also seem to depend on signed arithmetic
      wrapping when it overflows.  In practice maybe that works, but we can be
      more careful.  That also allows us to make a more reliable distinction
      between -EINVAL and -EOVERFLOW.
      
      Note that in the 32-bit case SEEK_CUR or SEEK_END might allow the caller
      to set a lock with starting point no longer representable as a 32-bit
      value.  We could return -EOVERFLOW in such cases, but the locks code is
      capable of handling such ranges, so we choose to be lenient here.  The
      only problem is that subsequent GETLK calls on such a lock will fail
      with EOVERFLOW.
      
      While we're here, do some cleanup including consolidating code for the
      flock and flock64 cases.
      Signed-off-by: NJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      ef12e72a
    • J
      locks: close potential race between setlease and open · 24cbe784
      Jeff Layton 提交于
      As Al Viro points out, there is an unlikely, but possible race between
      opening a file and setting a lease on it. generic_add_lease is done with
      the i_lock held, but the inode->i_flock check in break_lease is
      lockless. It's possible for another task doing an open to do the entire
      pathwalk and call break_lease between the point where generic_add_lease
      checks for a conflicting open and adds the lease to the list. If this
      occurs, we can end up with a lease set on the file with a conflicting
      open.
      
      To guard against that, check again for a conflicting open after adding
      the lease to the i_flock list. If the above race occurs, then we can
      simply unwind the lease setting and return -EAGAIN.
      
      Because we take dentry references and acquire write access on the file
      before calling break_lease, we know that if the i_flock list is empty
      when the open caller goes to check it then the necessary refcounts have
      already been incremented. Thus the additional check for a conflicting
      open will see that there is one and the setlease call will fail.
      
      Cc: Bruce Fields <bfields@fieldses.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@fieldses.org>
      24cbe784
  2. 01 2月, 2014 1 次提交
  3. 31 1月, 2014 6 次提交
    • Z
      xen/grant-table: Avoid m2p_override during mapping · 08ece5bb
      Zoltan Kiss 提交于
      The grant mapping API does m2p_override unnecessarily: only gntdev needs it,
      for blkback and future netback patches it just cause a lock contention, as
      those pages never go to userspace. Therefore this series does the following:
      - the original functions were renamed to __gnttab_[un]map_refs, with a new
        parameter m2p_override
      - based on m2p_override either they follow the original behaviour, or just set
        the private flag and call set_phys_to_machine
      - gnttab_[un]map_refs are now a wrapper to call __gnttab_[un]map_refs with
        m2p_override false
      - a new function gnttab_[un]map_refs_userspace provides the old behaviour
      
      It also removes a stray space from page.h and change ret to 0 if
      XENFEAT_auto_translated_physmap, as that is the only possible return value
      there.
      
      v2:
      - move the storing of the old mfn in page->index to gnttab_map_refs
      - move the function header update to a separate patch
      
      v3:
      - a new approach to retain old behaviour where it needed
      - squash the patches into one
      
      v4:
      - move out the common bits from m2p* functions, and pass pfn/mfn as parameter
      - clear page->private before doing anything with the page, so m2p_find_override
        won't race with this
      
      v5:
      - change return value handling in __gnttab_[un]map_refs
      - remove a stray space in page.h
      - add detail why ret = 0 now at some places
      
      v6:
      - don't pass pfn to m2p* functions, just get it locally
      Signed-off-by: NZoltan Kiss <zoltan.kiss@citrix.com>
      Suggested-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      08ece5bb
    • D
      mm: sl[uo]b: fix misleading comments · 433a91ff
      Dave Hansen 提交于
      On x86, SLUB creates and handles <=8192-byte allocations internally.
      It passes larger ones up to the allocator.  Saying "up to order 2" is,
      at best, ambiguous.  Is that order-1?  Or (order-2 bytes)?  Make
      it more clear.
      
      SLOB commits a similar sin.  It *handles* page-size requests, but the
      comment says that it passes up "all page size and larger requests".
      
      SLOB also swaps around the order of the very-similarly-named
      KMALLOC_SHIFT_HIGH and KMALLOC_SHIFT_MAX #defines.  Make it
      consistent with the order of the other two allocators.
      
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      433a91ff
    • M
      zsmalloc: add copyright · 31fc00bb
      Minchan Kim 提交于
      Add my copyright to the zsmalloc source code which I maintain.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31fc00bb
    • M
      zsmalloc: move it under mm · bcf1647d
      Minchan Kim 提交于
      This patch moves zsmalloc under mm directory.
      
      Before that, description will explain why we have needed custom
      allocator.
      
      Zsmalloc is a new slab-based memory allocator for storing compressed
      pages.  It is designed for low fragmentation and high allocation success
      rate on large object, but <= PAGE_SIZE allocations.
      
      zsmalloc differs from the kernel slab allocator in two primary ways to
      achieve these design goals.
      
      zsmalloc never requires high order page allocations to back slabs, or
      "size classes" in zsmalloc terms.  Instead it allows multiple
      single-order pages to be stitched together into a "zspage" which backs
      the slab.  This allows for higher allocation success rate under memory
      pressure.
      
      Also, zsmalloc allows objects to span page boundaries within the zspage.
      This allows for lower fragmentation than could be had with the kernel
      slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE.  With the
      kernel slab allocator, if a page compresses to 60% of it original size,
      the memory savings gained through compression is lost in fragmentation
      because another object of the same size can't be stored in the leftover
      space.
      
      This ability to span pages results in zsmalloc allocations not being
      directly addressable by the user.  The user is given an
      non-dereferencable handle in response to an allocation request.  That
      handle must be mapped, using zs_map_object(), which returns a pointer to
      the mapped region that can be used.  The mapping is necessary since the
      object data may reside in two different noncontigious pages.
      
      The zsmalloc fulfills the allocation needs for zram perfectly
      
      [sjenning@linux.vnet.ibm.com: borrow Seth's quote]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NNitin Gupta <ngupta@vflare.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcf1647d
    • C
      kernel: use lockless list for smp_call_function_single · 6897fc22
      Christoph Hellwig 提交于
      Make smp_call_function_single and friends more efficient by using a
      lockless list.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6897fc22
    • Y
      memblock, bootmem: restore goal for alloc_low · 07bacb38
      Yinghai Lu 提交于
      Now we have memblock_virt_alloc_low to replace original bootmem api in
      swiotlb.
      
      But we should not use BOOTMEM_LOW_LIMIT for arch that does not support
      CONFIG_NOBOOTMEM, as old api take 0.
      
      | #define alloc_bootmem_low(x) \
      |        __alloc_bootmem_low(x, SMP_CACHE_BYTES, 0)
      |#define alloc_bootmem_low_pages_nopanic(x) \
      |        __alloc_bootmem_low_nopanic(x, PAGE_SIZE, 0)
      
      and we have
       #define BOOTMEM_LOW_LIMIT __pa(MAX_DMA_ADDRESS)
      for CONFIG_NOBOOTMEM.
      
      Restore goal to 0 to fix ia64 crash, that Tony found.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Reported-by: NTony Luck <tony.luck@gmail.com>
      Tested-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07bacb38
  4. 30 1月, 2014 6 次提交
  5. 29 1月, 2014 7 次提交
    • J
      fsnotify: Do not return merged event from fsnotify_add_notify_event() · 83c0e1b4
      Jan Kara 提交于
      The event returned from fsnotify_add_notify_event() cannot ever be used
      safely as the event may be freed by the time the function returns (after
      dropping notification_mutex). So change the prototype to just return
      whether the event was added or merged into some existing event.
      Reported-and-tested-by: NJiri Kosina <jkosina@suse.cz>
      Reported-and-tested-by: NDave Jones <davej@fedoraproject.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      83c0e1b4
    • F
      Btrfs: add support for inode properties · 63541927
      Filipe David Borba Manana 提交于
      This change adds infrastructure to allow for generic properties for
      inodes. Properties are name/value pairs that can be associated with
      inodes for different purposes. They are stored as xattrs with the
      prefix "btrfs."
      
      Properties can be inherited - this means when a directory inode has
      inheritable properties set, these are added to new inodes created
      under that directory. Further, subvolumes can also have properties
      associated with them, and they can be inherited from their parent
      subvolume. Naturally, directory properties have priority over subvolume
      properties (in practice a subvolume property is just a regular
      property associated with the root inode, objectid 256, of the
      subvolume's fs tree).
      
      This change also adds one specific property implementation, named
      "compression", whose values can be "lzo" or "zlib" and it's an
      inheritable property.
      
      The corresponding changes to btrfs-progs were also implemented.
      A patch with xfstests for this feature will follow once there's
      agreement on this change/feature.
      
      Further, the script at the bottom of this commit message was used to
      do some benchmarks to measure any performance penalties of this feature.
      
      Basically the tests correspond to:
      
      Test 1 - create a filesystem and mount it with compress-force=lzo,
      then sequentially create N files of 64Kb each, measure how long it took
      to create the files, unmount the filesystem, mount the filesystem and
      perform an 'ls -lha' against the test directory holding the N files, and
      report the time the command took.
      
      Test 2 - create a filesystem and don't use any compression option when
      mounting it - instead set the compression property of the subvolume's
      root to 'lzo'. Then create N files of 64Kb, and report the time it took.
      The unmount the filesystem, mount it again and perform an 'ls -lha' like
      in the former test. This means every single file ends up with a property
      (xattr) associated to it.
      
      Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
      compression property, have no real effect other than adding more work
      when inheriting properties and taking more btree leaf space.
      
      Test 4 - same as test 3 but with 10 properties per file.
      
      Results (in seconds, and averages of 5 runs each), for different N
      numbers of files follow.
      
      * Without properties (test 1)
      
                          file creation time        ls -lha time
      10 000 files              3.49                   0.76
      100 000 files            47.19                   8.37
      1 000 000 files         518.51                 107.06
      
      * With 1 property (compression property set to lzo - test 2)
      
                          file creation time        ls -lha time
      10 000 files              3.63                    0.93
      100 000 files            48.56                    9.74
      1 000 000 files         537.72                  125.11
      
      * With 4 properties (test 3)
      
                          file creation time        ls -lha time
      10 000 files              3.94                    1.20
      100 000 files            52.14                   11.48
      1 000 000 files         572.70                  142.13
      
      * With 10 properties (test 4)
      
                          file creation time        ls -lha time
      10 000 files              4.61                    1.35
      100 000 files            58.86                   13.83
      1 000 000 files         656.01                  177.61
      
      The increased latencies with properties are essencialy because of:
      
      *) When creating an inode, we now synchronously write 1 more item
         (an xattr item) for each property inherited from the parent dir
         (or subvolume). This could be done in an asynchronous way such
         as we do for dir intex items (delayed-inode.c), which could help
         reduce the file creation latency;
      
      *) With properties, we now have larger fs trees. For this particular
         test each xattr item uses 75 bytes of leaf space in the fs tree.
         This could be less by using a new item for xattr items, instead of
         the current btrfs_dir_item, since we could cut the 'location' and
         'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
         total of 26 bytes per xattr item) from the btrfs_dir_item type.
      
      Also tried batching the xattr insertions (ignoring proper hash
      collision handling, since it didn't exist) when creating files that
      inherit properties from their parent inode/subvolume, but the end
      results were (surprisingly) essentially the same.
      
      Test script:
      
      $ cat test.pl
        #!/usr/bin/perl -w
      
        use strict;
        use Time::HiRes qw(time);
        use constant NUM_FILES => 10_000;
        use constant FILE_SIZES => (64 * 1024);
        use constant DEV => '/dev/sdb4';
        use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
        use constant TEST_DIR => (MNT_POINT . '/testdir');
      
        system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
      
        # following line for testing without properties
        #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        # following 2 lines for testing with properties
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
        system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
      
        system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
        my ($t1, $t2);
      
        $t1 = time();
        for (my $i = 1; $i <= NUM_FILES; $i++) {
            my $p = TEST_DIR . '/file_' . $i;
            open(my $f, '>', $p) or die "Error opening file!";
            $f->autoflush(1);
            for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
                print $f ('A' x 4096) or die "Error writing to file!";
            }
            close($f);
        }
        $t2 = time();
        print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        $t1 = time();
        system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
        $t2 = time();
        print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      63541927
    • J
      rwsem: add rwsem_is_contended · 4a444b1f
      Josef Bacik 提交于
      Btrfs needs a simple way to know if it needs to let go of it's read lock on a
      rwsem.  Introduce rwsem_is_contended to check to see if there are any waiters on
      this rwsem currently.  This is just a hueristic, it is meant to be light and not
      100% accurate and called by somebody already holding on to the rwsem in either
      read or write.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      4a444b1f
    • L
      Btrfs/tracepoint: update new flags for ordered extent TP · 792ddef0
      Liu Bo 提交于
      Flag BTRFS_ORDERED_TRUNCATED is a new one, update the tracepoint to
      support it.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      792ddef0
    • L
      Btrfs/tracepoint: fix to report right flags for ordered extent · 9d04a8ce
      Liu Bo 提交于
      We use set_bit() to assign ordered extent's flags, but in the related
      tracepoint we don't do the same thing, which makes the trace output
      not to parse flags correctly.
      
      Also, since the flags are bits stuff, we change to use __print_flags with
      a 'delim' instead of __print_symbolic.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9d04a8ce
    • J
      btrfs: add ioctl to export size of global metadata reservation · 01e219e8
      Jeff Mahoney 提交于
      btrfs filesystem df output will show the size of the metadata space
      and how much of it is used, and the user assumes that the difference
      is all usable space. Since that's not actually the case due to the
      global metadata reservation, we should provide the full picture to the
      user.
      
      This patch adds an ioctl that exports the size of the global metadata
      reservation so that btrfs filesystem df can report it.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      01e219e8
    • J
      btrfs: add ioctls to query/change feature bits online · 2eaa055f
      Jeff Mahoney 提交于
      There are some feature bits that require no offline setup and can
      be enabled online. I've only reviewed extended irefs, but there will
      probably be more.
      
      We introduce three new ioctls:
      - BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features.
      - BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs
        basis, as well as querying for which features are changeable with mounted.
      - BTRFS_IOC_SET_FEATURES: change features on a per-fs basis.
      
      We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that
      allow us to define which features are safe to change at runtime.
      
      The failure modes for BTRFS_IOC_SET_FEATURES are as follows:
      - Enabling a completely unsupported feature: warns and returns -ENOTSUPP
      - Enabling a feature that can only be done offline: warns and returns -EPERM
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2eaa055f
  6. 28 1月, 2014 14 次提交