1. 08 2月, 2014 14 次提交
    • T
      kernfs: implement kernfs_get_parent(), kernfs_name/path() and friends · 3eef34ad
      Tejun Heo 提交于
      kernfs_node->parent and ->name are currently marked as "published"
      indicating that kernfs users may access them directly; however, those
      fields may get updated by kernfs_rename[_ns]() and unrestricted access
      may lead to erroneous values or oops.
      
      Protect ->parent and ->name updates with a irq-safe spinlock
      kernfs_rename_lock and implement the following accessors for these
      fields.
      
      * kernfs_name()		- format the node's name into the specified buffer
      * kernfs_path()		- format the node's path into the specified buffer
      * pr_cont_kernfs_name()	- pr_cont a node's name (doesn't need buffer)
      * pr_cont_kernfs_path()	- pr_cont a node's path (doesn't need buffer)
      * kernfs_get_parent()	- pin and return a node's parent
      
      All can be called under any context.  The recursive sysfs_pathname()
      in fs/sysfs/dir.c is replaced with kernfs_path() and
      sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of
      dereferencing parent directly.
      
      v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing
          static inline making it cause a lot of build warnings.  Add it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3eef34ad
    • T
      kernfs: implement kernfs_node_from_dentry(), kernfs_root_from_sb() and kernfs_rename() · 0c23b225
      Tejun Heo 提交于
      Implement helpers to determine node from dentry and root from
      super_block.  Also add a kernfs_rename_ns() wrapper which assumes NULL
      namespace.  These generally make sense and will be used by cgroup.
      
      v2: Some dummy implementations for !CONFIG_SYSFS was missing.  Fixed.
          Reported by kbuild test robot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0c23b225
    • T
      kernfs: add kernfs_open_file->priv · 2536390d
      Tejun Heo 提交于
      Add a private data field to be used by kernfs file operations.  This
      generally makes sense and will be used by cgroup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2536390d
    • T
      kernfs: implement kernfs_ops->atomic_write_len · 4d3773c4
      Tejun Heo 提交于
      A write to a kernfs_node is buffered through a kernel buffer.  Writes
      <= PAGE_SIZE are performed atomically, while larger ones are executed
      in PAGE_SIZE chunks.  While this is enough for sysfs, cgroup which is
      scheduled to be converted to use kernfs needs a bit more control over
      it.
      
      This patch adds kernfs_ops->atomic_write_len.  If not set (zero), the
      behavior stays the same.  If set, writes upto the size are executed
      atomically and larger writes are rejected with -E2BIG.
      
      A different implementation strategy would be allowing configuring
      chunking size while making the original write size available to the
      write method; however, such strategy, while being more complicated,
      doesn't really buy anything.  If the write implementation has to
      handle chunking, the specific chunk size shouldn't matter all that
      much.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d3773c4
    • T
      kernfs: allow nodes to be created in the deactivated state · d35258ef
      Tejun Heo 提交于
      Currently, kernfs_nodes are made visible to userland on creation,
      which makes it difficult for kernfs users to atomically succeed or
      fail creation of multiple nodes.  In addition, if something fails
      after creating some nodes, the created nodes might already be in use
      and their active refs need to be drained for removal, which has the
      potential to introduce tricky reverse locking dependency on active_ref
      depending on how the error path is synchronized.
      
      This patch introduces per-root flag KERNFS_ROOT_CREATE_DEACTIVATED.
      If set, all nodes under the root are created in the deactivated state
      and stay invisible to userland until explicitly enabled by the new
      kernfs_activate() API.  Also, nodes which have never been activated
      are guaranteed to bypass draining on removal thus allowing error paths
      to not worry about lockding dependency on active_ref draining.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d35258ef
    • T
      kernfs: implement kernfs_syscall_ops->remount_fs() and ->show_options() · 6a7fed4e
      Tejun Heo 提交于
      Add two super_block related syscall callbacks ->remount_fs() and
      ->show_options() to kernfs_syscall_ops.  These simply forward the
      matching super_operations.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a7fed4e
    • T
      kernfs: rename kernfs_dir_ops to kernfs_syscall_ops · 90c07c89
      Tejun Heo 提交于
      We're gonna need non-dir syscall callbacks, which will make dir_ops a
      misnomer.  Let's rename kernfs_dir_ops to kernfs_syscall_ops.
      
      This is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      90c07c89
    • T
      kernfs: invoke dir_ops while holding active ref of the target node · 07c7530d
      Tejun Heo 提交于
      kernfs_dir_ops are currently being invoked without any active
      reference, which makes it tricky for the invoked operations to
      determine whether the objects associated those nodes are safe to
      access and will remain that way for the duration of such operations.
      
      kernfs already has active_ref mechanism to deal with this which makes
      the removal of a given node the synchronization point for gating the
      file operations.  There's no reason for dir_ops to be any different.
      Update the dir_ops handling so that active_ref is held while the
      dir_ops are executing.  This guarantees that while a dir_ops is
      executing the target nodes stay alive.
      
      As kernfs_dir_ops doesn't have any in-kernel user at this point, this
      doesn't affect anybody.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07c7530d
    • T
      sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner() · ce8b04aa
      Tejun Heo 提交于
      All device_schedule_callback_owner() users are converted to use
      device_remove_file_self().  Remove now unused
      {sysfs|device}_schedule_callback_owner().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ce8b04aa
    • T
      kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers · 6b0afc2a
      Tejun Heo 提交于
      Sometimes it's necessary to implement a node which wants to delete
      nodes including itself.  This isn't straightforward because of kernfs
      active reference.  While a file operation is in progress, an active
      reference is held and kernfs_remove() waits for all such references to
      drain before completing.  For a self-deleting node, this is a deadlock
      as kernfs_remove() ends up waiting for an active reference that itself
      is sitting on top of.
      
      This currently is worked around in the sysfs layer using
      sysfs_schedule_callback() which makes such removals asynchronous.
      While it works, it's rather cumbersome and inherently breaks
      synchronicity of the operation - the file operation which triggered
      the operation may complete before the removal is finished (or even
      started) and the removal may fail asynchronously.  If a removal
      operation is immmediately followed by another operation which expects
      the specific name to be available (e.g. removal followed by rename
      onto the same name), there's no way to make the latter operation
      reliable.
      
      The thing is there's no inherent reason for this to be asynchrnous.
      All that's necessary to do this synchronous is a dedicated operation
      which drops its own active ref and deactivates self.  This patch
      implements kernfs_remove_self() and its wrappers in sysfs and driver
      core.  kernfs_remove_self() is to be called from one of the file
      operations, drops the active ref the task is holding, removes the self
      node, and restores active ref to the dead node so that the ref is
      balanced afterwards.  __kernfs_remove() is updated so that it takes an
      early exit if the target node is already fully removed so that the
      active ref restored by kernfs_remove_self() after removal doesn't
      confuse the deactivation path.
      
      This makes implementing self-deleting nodes very easy.  The normal
      removal path doesn't even need to be changed to use
      kernfs_remove_self() for the self-deleting node.  The method can
      invoke kernfs_remove_self() on itself before proceeding the normal
      removal path.  kernfs_remove() invoked on the node by the normal
      deletion path will simply be ignored.
      
      This will replace sysfs_schedule_callback().  A subtle feature of
      sysfs_schedule_callback() is that it collapses multiple invocations -
      even if multiple removals are triggered, the removal callback is run
      only once.  An equivalent effect can be achieved by testing the return
      value of kernfs_remove_self() - only the one which gets %true return
      value should proceed with actual deletion.  All other instances of
      kernfs_remove_self() will wait till the enclosing kernfs operation
      which invoked the winning instance of kernfs_remove_self() finishes
      and then return %false.  This trivially makes all users of
      kernfs_remove_self() automatically show correct synchronous behavior
      even when there are multiple concurrent operations - all "echo 1 >
      delete" instances will finish only after the whole operation is
      completed by one of the instances.
      
      Note that manipulation of active ref is implemented in separate public
      functions - kernfs_[un]break_active_protection().
      kernfs_remove_self() is the only user at the moment but this will be
      used to cater to more complex cases.
      
      v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing
          and sysfs_remove_file_self() had incorrect return type.  Fix it.
          Reported by kbuild test bot.
      
      v3: kernfs_[un]break_active_protection() separated out from
          kernfs_remove_self() and exposed as public API.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6b0afc2a
    • T
      kernfs: remove KERNFS_REMOVED · 81c173cb
      Tejun Heo 提交于
      KERNFS_REMOVED is used to mark half-initialized and dying nodes so
      that they don't show up in lookups and deny adding new nodes under or
      renaming it; however, its role overlaps that of deactivation.
      
      It's necessary to deny addition of new children while removal is in
      progress; however, this role considerably intersects with deactivation
      - KERNFS_REMOVED prevents new children while deactivation prevents new
      file operations.  There's no reason to have them separate making
      things more complex than necessary.
      
      This patch removes KERNFS_REMOVED.
      
      * Instead of KERNFS_REMOVED, each node now starts its life
        deactivated.  This means that we now use both atomic_add() and
        atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN.  The compiler
        generates an overflow warnings when negating INT_MIN as the negation
        can't be represented as a positive number.  Nothing is actually
        broken but let's bump BIAS by one to avoid the warnings for archs
        which negates the subtrahend..
      
      * A new helper kernfs_active() which tests whether kn->active >= 0 is
        added for convenience and lockdep annotation.  All KERNFS_REMOVED
        tests are replaced with negated kernfs_active() tests.
      
      * __kernfs_remove() is updated to deactivate, but not drain, all nodes
        in the subtree instead of setting KERNFS_REMOVED.  This removes
        deactivation from kernfs_deactivate(), which is now renamed to
        kernfs_drain().
      
      * Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with
        checks on the active ref.
      
      * Some comment style updates in the affected area.
      
      v2: Reordered before removal path restructuring.  kernfs_active()
          dropped and kernfs_get/put_active() used instead.  RB_EMPTY_NODE()
          used in the lookup paths.
      
      v3: Reverted most of v2 except for creating a new node with
          KN_DEACTIVATED_BIAS.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      81c173cb
    • T
      kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep() · 182fd64b
      Tejun Heo 提交于
      There currently are two mechanisms gating active ref lockdep
      annotations - KERNFS_LOCKDEP flag and KERNFS_ACTIVE_REF type mask.
      The former disables lockdep annotations in kernfs_get/put_active()
      while the latter disables all of kernfs_deactivate().
      
      While KERNFS_ACTIVE_REF also behaves as an optimization to skip the
      deactivation step for non-file nodes, the benefit is marginal and it
      needlessly diverges code paths.  Let's drop KERNFS_ACTIVE_REF.
      
      While at it, add a test helper kernfs_lockdep() to test KERNFS_LOCKDEP
      flag so that it's more convenient and the related code can be compiled
      out when not enabled.
      
      v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
          KERNFS_LOCKDEP flag").  As the earlier patch already added
          KERNFS_LOCKDEP tests to kernfs_deactivate(), those additions are
          dropped from this patch and the existing ones are simply converted
          to kernfs_lockdep().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      182fd64b
    • T
      kernfs: remove kernfs_addrm_cxt · 988cd7af
      Tejun Heo 提交于
      kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were
      added because there were operations which should be performed outside
      kernfs_mutex after adding and removing kernfs_nodes.  The necessary
      operations were recorded in kernfs_addrm_cxt and performed by
      kernfs_addrm_finish(); however, after the recent changes which
      relocated deactivation and unmapping so that they're performed
      directly during removal, the only operation kernfs_addrm_finish()
      performs is kernfs_put(), which can be moved inside the removal path
      too.
      
      This patch moves the kernfs_put() of the base ref to __kernfs_remove()
      and remove kernfs_addrm_cxt and kernfs_addrm_start/finish().
      
      * kernfs_add_one() is updated to grab and release kernfs_mutex itself.
        sysfs_addrm_start/finish() invocations around it are removed from
        all users.
      
      * __kernfs_remove() puts an unlinked node directly instead of chaining
        it to kernfs_addrm_cxt.  Its callers are updated to grab and release
        kernfs_mutex instead of calling kernfs_addrm_start/finish() around
        it.
      
      v2: Rebased on top of "kernfs: associate a new kernfs_node with its
          parent on creation" which dropped @parent from kernfs_add_one().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      988cd7af
    • T
      kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq · abd54f02
      Tejun Heo 提交于
      kernfs_node->u.completion is used to notify deactivation completion
      from kernfs_put_active() to kernfs_deactivate().  We now allow
      multiple racing removals of the same node and the current removal
      scheme is no longer correct - kernfs_remove() invocation may return
      before the node is properly deactivated if it races against another
      removal.  The removal path will be restructured to address the issue.
      
      To help such restructure which requires supporting multiple waiters,
      this patch replaces kernfs_node->u.completion with
      kernfs_root->deactivate_waitq.  This makes deactivation event
      notifications share a per-root waitqueue_head; however, the wait path
      is quite cold and this will also allow shaving one pointer off
      kernfs_node.
      
      v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
          KERNFS_LOCKDEP flag").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      abd54f02
  2. 01 2月, 2014 1 次提交
  3. 31 1月, 2014 6 次提交
    • Z
      xen/grant-table: Avoid m2p_override during mapping · 08ece5bb
      Zoltan Kiss 提交于
      The grant mapping API does m2p_override unnecessarily: only gntdev needs it,
      for blkback and future netback patches it just cause a lock contention, as
      those pages never go to userspace. Therefore this series does the following:
      - the original functions were renamed to __gnttab_[un]map_refs, with a new
        parameter m2p_override
      - based on m2p_override either they follow the original behaviour, or just set
        the private flag and call set_phys_to_machine
      - gnttab_[un]map_refs are now a wrapper to call __gnttab_[un]map_refs with
        m2p_override false
      - a new function gnttab_[un]map_refs_userspace provides the old behaviour
      
      It also removes a stray space from page.h and change ret to 0 if
      XENFEAT_auto_translated_physmap, as that is the only possible return value
      there.
      
      v2:
      - move the storing of the old mfn in page->index to gnttab_map_refs
      - move the function header update to a separate patch
      
      v3:
      - a new approach to retain old behaviour where it needed
      - squash the patches into one
      
      v4:
      - move out the common bits from m2p* functions, and pass pfn/mfn as parameter
      - clear page->private before doing anything with the page, so m2p_find_override
        won't race with this
      
      v5:
      - change return value handling in __gnttab_[un]map_refs
      - remove a stray space in page.h
      - add detail why ret = 0 now at some places
      
      v6:
      - don't pass pfn to m2p* functions, just get it locally
      Signed-off-by: NZoltan Kiss <zoltan.kiss@citrix.com>
      Suggested-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      08ece5bb
    • D
      mm: sl[uo]b: fix misleading comments · 433a91ff
      Dave Hansen 提交于
      On x86, SLUB creates and handles <=8192-byte allocations internally.
      It passes larger ones up to the allocator.  Saying "up to order 2" is,
      at best, ambiguous.  Is that order-1?  Or (order-2 bytes)?  Make
      it more clear.
      
      SLOB commits a similar sin.  It *handles* page-size requests, but the
      comment says that it passes up "all page size and larger requests".
      
      SLOB also swaps around the order of the very-similarly-named
      KMALLOC_SHIFT_HIGH and KMALLOC_SHIFT_MAX #defines.  Make it
      consistent with the order of the other two allocators.
      
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      433a91ff
    • M
      zsmalloc: add copyright · 31fc00bb
      Minchan Kim 提交于
      Add my copyright to the zsmalloc source code which I maintain.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31fc00bb
    • M
      zsmalloc: move it under mm · bcf1647d
      Minchan Kim 提交于
      This patch moves zsmalloc under mm directory.
      
      Before that, description will explain why we have needed custom
      allocator.
      
      Zsmalloc is a new slab-based memory allocator for storing compressed
      pages.  It is designed for low fragmentation and high allocation success
      rate on large object, but <= PAGE_SIZE allocations.
      
      zsmalloc differs from the kernel slab allocator in two primary ways to
      achieve these design goals.
      
      zsmalloc never requires high order page allocations to back slabs, or
      "size classes" in zsmalloc terms.  Instead it allows multiple
      single-order pages to be stitched together into a "zspage" which backs
      the slab.  This allows for higher allocation success rate under memory
      pressure.
      
      Also, zsmalloc allows objects to span page boundaries within the zspage.
      This allows for lower fragmentation than could be had with the kernel
      slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE.  With the
      kernel slab allocator, if a page compresses to 60% of it original size,
      the memory savings gained through compression is lost in fragmentation
      because another object of the same size can't be stored in the leftover
      space.
      
      This ability to span pages results in zsmalloc allocations not being
      directly addressable by the user.  The user is given an
      non-dereferencable handle in response to an allocation request.  That
      handle must be mapped, using zs_map_object(), which returns a pointer to
      the mapped region that can be used.  The mapping is necessary since the
      object data may reside in two different noncontigious pages.
      
      The zsmalloc fulfills the allocation needs for zram perfectly
      
      [sjenning@linux.vnet.ibm.com: borrow Seth's quote]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NNitin Gupta <ngupta@vflare.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcf1647d
    • C
      kernel: use lockless list for smp_call_function_single · 6897fc22
      Christoph Hellwig 提交于
      Make smp_call_function_single and friends more efficient by using a
      lockless list.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6897fc22
    • Y
      memblock, bootmem: restore goal for alloc_low · 07bacb38
      Yinghai Lu 提交于
      Now we have memblock_virt_alloc_low to replace original bootmem api in
      swiotlb.
      
      But we should not use BOOTMEM_LOW_LIMIT for arch that does not support
      CONFIG_NOBOOTMEM, as old api take 0.
      
      | #define alloc_bootmem_low(x) \
      |        __alloc_bootmem_low(x, SMP_CACHE_BYTES, 0)
      |#define alloc_bootmem_low_pages_nopanic(x) \
      |        __alloc_bootmem_low_nopanic(x, PAGE_SIZE, 0)
      
      and we have
       #define BOOTMEM_LOW_LIMIT __pa(MAX_DMA_ADDRESS)
      for CONFIG_NOBOOTMEM.
      
      Restore goal to 0 to fix ia64 crash, that Tony found.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Reported-by: NTony Luck <tony.luck@gmail.com>
      Tested-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07bacb38
  4. 30 1月, 2014 6 次提交
  5. 29 1月, 2014 7 次提交
    • J
      fsnotify: Do not return merged event from fsnotify_add_notify_event() · 83c0e1b4
      Jan Kara 提交于
      The event returned from fsnotify_add_notify_event() cannot ever be used
      safely as the event may be freed by the time the function returns (after
      dropping notification_mutex). So change the prototype to just return
      whether the event was added or merged into some existing event.
      Reported-and-tested-by: NJiri Kosina <jkosina@suse.cz>
      Reported-and-tested-by: NDave Jones <davej@fedoraproject.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      83c0e1b4
    • F
      Btrfs: add support for inode properties · 63541927
      Filipe David Borba Manana 提交于
      This change adds infrastructure to allow for generic properties for
      inodes. Properties are name/value pairs that can be associated with
      inodes for different purposes. They are stored as xattrs with the
      prefix "btrfs."
      
      Properties can be inherited - this means when a directory inode has
      inheritable properties set, these are added to new inodes created
      under that directory. Further, subvolumes can also have properties
      associated with them, and they can be inherited from their parent
      subvolume. Naturally, directory properties have priority over subvolume
      properties (in practice a subvolume property is just a regular
      property associated with the root inode, objectid 256, of the
      subvolume's fs tree).
      
      This change also adds one specific property implementation, named
      "compression", whose values can be "lzo" or "zlib" and it's an
      inheritable property.
      
      The corresponding changes to btrfs-progs were also implemented.
      A patch with xfstests for this feature will follow once there's
      agreement on this change/feature.
      
      Further, the script at the bottom of this commit message was used to
      do some benchmarks to measure any performance penalties of this feature.
      
      Basically the tests correspond to:
      
      Test 1 - create a filesystem and mount it with compress-force=lzo,
      then sequentially create N files of 64Kb each, measure how long it took
      to create the files, unmount the filesystem, mount the filesystem and
      perform an 'ls -lha' against the test directory holding the N files, and
      report the time the command took.
      
      Test 2 - create a filesystem and don't use any compression option when
      mounting it - instead set the compression property of the subvolume's
      root to 'lzo'. Then create N files of 64Kb, and report the time it took.
      The unmount the filesystem, mount it again and perform an 'ls -lha' like
      in the former test. This means every single file ends up with a property
      (xattr) associated to it.
      
      Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
      compression property, have no real effect other than adding more work
      when inheriting properties and taking more btree leaf space.
      
      Test 4 - same as test 3 but with 10 properties per file.
      
      Results (in seconds, and averages of 5 runs each), for different N
      numbers of files follow.
      
      * Without properties (test 1)
      
                          file creation time        ls -lha time
      10 000 files              3.49                   0.76
      100 000 files            47.19                   8.37
      1 000 000 files         518.51                 107.06
      
      * With 1 property (compression property set to lzo - test 2)
      
                          file creation time        ls -lha time
      10 000 files              3.63                    0.93
      100 000 files            48.56                    9.74
      1 000 000 files         537.72                  125.11
      
      * With 4 properties (test 3)
      
                          file creation time        ls -lha time
      10 000 files              3.94                    1.20
      100 000 files            52.14                   11.48
      1 000 000 files         572.70                  142.13
      
      * With 10 properties (test 4)
      
                          file creation time        ls -lha time
      10 000 files              4.61                    1.35
      100 000 files            58.86                   13.83
      1 000 000 files         656.01                  177.61
      
      The increased latencies with properties are essencialy because of:
      
      *) When creating an inode, we now synchronously write 1 more item
         (an xattr item) for each property inherited from the parent dir
         (or subvolume). This could be done in an asynchronous way such
         as we do for dir intex items (delayed-inode.c), which could help
         reduce the file creation latency;
      
      *) With properties, we now have larger fs trees. For this particular
         test each xattr item uses 75 bytes of leaf space in the fs tree.
         This could be less by using a new item for xattr items, instead of
         the current btrfs_dir_item, since we could cut the 'location' and
         'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
         total of 26 bytes per xattr item) from the btrfs_dir_item type.
      
      Also tried batching the xattr insertions (ignoring proper hash
      collision handling, since it didn't exist) when creating files that
      inherit properties from their parent inode/subvolume, but the end
      results were (surprisingly) essentially the same.
      
      Test script:
      
      $ cat test.pl
        #!/usr/bin/perl -w
      
        use strict;
        use Time::HiRes qw(time);
        use constant NUM_FILES => 10_000;
        use constant FILE_SIZES => (64 * 1024);
        use constant DEV => '/dev/sdb4';
        use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
        use constant TEST_DIR => (MNT_POINT . '/testdir');
      
        system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
      
        # following line for testing without properties
        #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        # following 2 lines for testing with properties
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
        system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
      
        system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
        my ($t1, $t2);
      
        $t1 = time();
        for (my $i = 1; $i <= NUM_FILES; $i++) {
            my $p = TEST_DIR . '/file_' . $i;
            open(my $f, '>', $p) or die "Error opening file!";
            $f->autoflush(1);
            for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
                print $f ('A' x 4096) or die "Error writing to file!";
            }
            close($f);
        }
        $t2 = time();
        print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        $t1 = time();
        system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
        $t2 = time();
        print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      63541927
    • J
      rwsem: add rwsem_is_contended · 4a444b1f
      Josef Bacik 提交于
      Btrfs needs a simple way to know if it needs to let go of it's read lock on a
      rwsem.  Introduce rwsem_is_contended to check to see if there are any waiters on
      this rwsem currently.  This is just a hueristic, it is meant to be light and not
      100% accurate and called by somebody already holding on to the rwsem in either
      read or write.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      4a444b1f
    • L
      Btrfs/tracepoint: update new flags for ordered extent TP · 792ddef0
      Liu Bo 提交于
      Flag BTRFS_ORDERED_TRUNCATED is a new one, update the tracepoint to
      support it.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      792ddef0
    • L
      Btrfs/tracepoint: fix to report right flags for ordered extent · 9d04a8ce
      Liu Bo 提交于
      We use set_bit() to assign ordered extent's flags, but in the related
      tracepoint we don't do the same thing, which makes the trace output
      not to parse flags correctly.
      
      Also, since the flags are bits stuff, we change to use __print_flags with
      a 'delim' instead of __print_symbolic.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9d04a8ce
    • J
      btrfs: add ioctl to export size of global metadata reservation · 01e219e8
      Jeff Mahoney 提交于
      btrfs filesystem df output will show the size of the metadata space
      and how much of it is used, and the user assumes that the difference
      is all usable space. Since that's not actually the case due to the
      global metadata reservation, we should provide the full picture to the
      user.
      
      This patch adds an ioctl that exports the size of the global metadata
      reservation so that btrfs filesystem df can report it.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      01e219e8
    • J
      btrfs: add ioctls to query/change feature bits online · 2eaa055f
      Jeff Mahoney 提交于
      There are some feature bits that require no offline setup and can
      be enabled online. I've only reviewed extended irefs, but there will
      probably be more.
      
      We introduce three new ioctls:
      - BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features.
      - BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs
        basis, as well as querying for which features are changeable with mounted.
      - BTRFS_IOC_SET_FEATURES: change features on a per-fs basis.
      
      We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that
      allow us to define which features are safe to change at runtime.
      
      The failure modes for BTRFS_IOC_SET_FEATURES are as follows:
      - Enabling a completely unsupported feature: warns and returns -ENOTSUPP
      - Enabling a feature that can only be done offline: warns and returns -EPERM
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2eaa055f
  6. 28 1月, 2014 6 次提交