1. 22 1月, 2014 2 次提交
    • J
      fsnotify: do not share events between notification groups · 7053aee2
      Jan Kara 提交于
      Currently fsnotify framework creates one event structure for each
      notification event and links this event into all interested notification
      groups.  This is done so that we save memory when several notification
      groups are interested in the event.  However the need for event
      structure shared between inotify & fanotify bloats the event structure
      so the result is often higher memory consumption.
      
      Another problem is that fsnotify framework keeps path references with
      outstanding events so that fanotify can return open file descriptors
      with its events.  This has the undesirable effect that filesystem cannot
      be unmounted while there are outstanding events - a regression for
      inotify compared to a situation before it was converted to fsnotify
      framework.  For fanotify this problem is hard to avoid and users of
      fanotify should kind of expect this behavior when they ask for file
      descriptors from notified files.
      
      This patch changes fsnotify and its users to create separate event
      structure for each group.  This allows for much simpler code (~400 lines
      removed by this patch) and also smaller event structures.  For example
      on 64-bit system original struct fsnotify_event consumes 120 bytes, plus
      additional space for file name, additional 24 bytes for second and each
      subsequent group linking the event, and additional 32 bytes for each
      inotify group for private data.  After the conversion inotify event
      consumes 48 bytes plus space for file name which is considerably less
      memory unless file names are long and there are several groups
      interested in the events (both of which are uncommon).  Fanotify event
      fits in 56 bytes after the conversion (fanotify doesn't care about file
      names so its events don't have to have it allocated).  A win unless
      there are four or more fanotify groups interested in the event.
      
      The conversion also solves the problem with unmount when only inotify is
      used as we don't have to grab path references for inotify events.
      
      [hughd@google.com: fanotify: fix corruption preventing startup]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7053aee2
    • J
      inotify: provide function for name length rounding · e9fe6904
      Jan Kara 提交于
      Rounding of name length when passing it to userspace was done in several
      places.  Provide a function to do it and use it in all places.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9fe6904
  2. 18 1月, 2014 1 次提交
    • T
      kernfs: associate a new kernfs_node with its parent on creation · db4aad20
      Tejun Heo 提交于
      Once created, a kernfs_node is always destroyed by kernfs_put().
      Since ba7443bc ("sysfs, kernfs: implement
      kernfs_create/destroy_root()"), kernfs_put() depends on kernfs_root()
      to locate the ino_ida.  kernfs_root() in turn depends on
      kernfs_node->parent being set for !dir nodes.  This means that
      kernfs_put() of a !dir node requires its ->parent to be initialized.
      
      This leads to oops when a newly created !dir node is destroyed without
      going through kernfs_add_one() or after failing kernfs_add_one()
      before ->parent is set.  kernfs_root() invoked from kernfs_put() will
      try to dereference NULL parent.
      
      Fix it by moving parent association to kernfs_new_node() from
      kernfs_add_one().  kernfs_new_node() now takes @parent instead of
      @root and determines the root from the parent and also sets the new
      node's parent properly.  @parent parameter is removed from
      kernfs_add_one().  As there's no parent when creating the root node,
      __kernfs_new_node() which takes @root as before and doesn't set the
      parent is used in that case.
      
      This ensures that a kernfs_node in any stage in its life has its
      parent associated and thus can be put.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      db4aad20
  3. 15 1月, 2014 2 次提交
    • A
      nilfs2: fix segctor bug that causes file system corruption · 70f2fe3a
      Andreas Rohner 提交于
      There is a bug in the function nilfs_segctor_collect, which results in
      active data being written to a segment, that is marked as clean.  It is
      possible, that this segment is selected for a later segment
      construction, whereby the old data is overwritten.
      
      The problem shows itself with the following kernel log message:
      
        nilfs_sufile_do_cancel_free: segment 6533 must be clean
      
      Usually a few hours later the file system gets corrupted:
      
        NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0
        NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660)
      
      The issue can be reproduced with a file system that is nearly full and
      with the cleaner running, while some IO intensive task is running.
      Although it is quite hard to reproduce.
      
      This is what happens:
      
       1. The cleaner starts the segment construction
       2. nilfs_segctor_collect is called
       3. sc_stage is on NILFS_ST_SUFILE and segments are freed
       4. sc_stage is on NILFS_ST_DAT current segment is full
       5. nilfs_segctor_extend_segments is called, which
          allocates a new segment
       6. The new segment is one of the segments freed in step 3
       7. nilfs_sufile_cancel_freev is called and produces an error message
       8. Loop around and the collection starts again
       9. sc_stage is on NILFS_ST_SUFILE and segments are freed
          including the newly allocated segment, which will contain active
          data and can be allocated at a later time
      10. A few hours later another segment construction allocates the
          segment and causes file system corruption
      
      This can be prevented by simply reordering the statements.  If
      nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments
      the freed segments are marked as dirty and cannot be allocated any more.
      Signed-off-by: NAndreas Rohner <andreas.rohner@gmx.net>
      Reviewed-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: NAndreas Rohner <andreas.rohner@gmx.net>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70f2fe3a
    • T
      kernfs: fix get_active failure handling in kernfs_seq_*() · bb305947
      Tejun Heo 提交于
      When kernfs_seq_start() fails to obtain an active reference, it
      returns ERR_PTR(-ENODEV).  kernfs_seq_stop() is then invoked with the
      error pointer value; however, it still proceeds to invoke
      kernfs_put_active() on the node leading to unbalanced put.
      
      If kernfs_seq_stop() is called even after active ref failure, it
      should skip invocation of @ops->seq_stop() and put_active.
      Unfortunately, this is a bit complicated because active ref failure
      isn't the only thing which may fail with ERR_PTR(-ENODEV).
      @ops->seq_start/next() may also fail with the error value and
      kernfs_seq_stop() doesn't have a way to tell apart those failures.
      
      Work it around by factoring out the active part of kernfs_seq_stop()
      into kernfs_seq_stop_active() and invoking it directly if
      @ops->seq_start/next() fail with ERR_PTR(-ENODEV) and updating
      kernfs_seq_stop() to skip kernfs_seq_stop_active() on
      ERR_PTR(-ENODEV).  This is a bit nasty but ensures that the active put
      is skipped iff get_active failed in kernfs_seq_start().
      
      tj: This was originally committed as d92d2e6b but got reverted by
          683bb276 along with other kernfs self removal patches.
          However, this one is an independent fix and shouldn't have been
          reverted together.  Reinstate the change.  Sorry about the mess.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb305947
  4. 14 1月, 2014 12 次提交
  5. 12 1月, 2014 1 次提交
    • T
      kernfs: remove unnecessary NULL check in __kernfs_remove() · 88533f99
      Tejun Heo 提交于
      895a068a ("kernfs: make kernfs_get_active() block if the node is
      deactivated but not removed") added "struct kernfs_root *root =
      kernfs_root(kn);" at the head of the function; however, the parameter
      @kn is checked for later implying that the function may be called with
      NULL.  This means that we may end up invoking kernfs_root() with NULL
      which will oops.  None of the existing users invokes removal with NULL
      @kn, so this bug doesn't actually trigger.
      
      We can relocate kernfs_root() invocation after NULL check; however,
      allowing NULL param tends to cause more confusion than actually
      helping anything.  As there's no existing user, let's remove the
      spurious NULL check.
      
      This bug was detected by smatch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      88533f99
  6. 11 1月, 2014 13 次提交
    • T
      sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner() · d1ba277e
      Tejun Heo 提交于
      All device_schedule_callback_owner() users are converted to use
      device_remove_file_self().  Remove now unused
      {sysfs|device}_schedule_callback_owner().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d1ba277e
    • T
      kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers · 1ae06819
      Tejun Heo 提交于
      Sometimes it's necessary to implement a node which wants to delete
      nodes including itself.  This isn't straightforward because of kernfs
      active reference.  While a file operation is in progress, an active
      reference is held and kernfs_remove() waits for all such references to
      drain before completing.  For a self-deleting node, this is a deadlock
      as kernfs_remove() ends up waiting for an active reference that itself
      is sitting on top of.
      
      This currently is worked around in the sysfs layer using
      sysfs_schedule_callback() which makes such removals asynchronous.
      While it works, it's rather cumbersome and inherently breaks
      synchronicity of the operation - the file operation which triggered
      the operation may complete before the removal is finished (or even
      started) and the removal may fail asynchronously.  If a removal
      operation is immmediately followed by another operation which expects
      the specific name to be available (e.g. removal followed by rename
      onto the same name), there's no way to make the latter operation
      reliable.
      
      The thing is there's no inherent reason for this to be asynchrnous.
      All that's necessary to do this synchronous is a dedicated operation
      which drops its own active ref and deactivates self.  This patch
      implements kernfs_remove_self() and its wrappers in sysfs and driver
      core.  kernfs_remove_self() is to be called from one of the file
      operations, drops the active ref and deactivates using
      __kernfs_deactivate_self(), removes the self node, and restores active
      ref to the dead node using __kernfs_reactivate_self() so that the ref
      is balanced afterwards.  __kernfs_remove() is updated so that it takes
      an early exit if the target node is already fully removed so that the
      active ref restored by kernfs_remove_self() after removal doesn't
      confuse the deactivation path.
      
      This makes implementing self-deleting nodes very easy.  The normal
      removal path doesn't even need to be changed to use
      kernfs_remove_self() for the self-deleting node.  The method can
      invoke kernfs_remove_self() on itself before proceeding the normal
      removal path.  kernfs_remove() invoked on the node by the normal
      deletion path will simply be ignored.
      
      This will replace sysfs_schedule_callback().  A subtle feature of
      sysfs_schedule_callback() is that it collapses multiple invocations -
      even if multiple removals are triggered, the removal callback is run
      only once.  An equivalent effect can be achieved by testing the return
      value of kernfs_remove_self() - only the one which gets %true return
      value should proceed with actual deletion.  All other instances of
      kernfs_remove_self() will wait till the enclosing kernfs operation
      which invoked the winning instance of kernfs_remove_self() finishes
      and then return %false.  This trivially makes all users of
      kernfs_remove_self() automatically show correct synchronous behavior
      even when there are multiple concurrent operations - all "echo 1 >
      delete" instances will finish only after the whole operation is
      completed by one of the instances.
      
      v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing
          and sysfs_remove_file_self() had incorrect return type.  Fix it.
          Reported by kbuild test bot.
      
      v3: Updated to use __kernfs_{de|re}activate_self().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ae06819
    • T
      kernfs: implement kernfs_{de|re}activate[_self]() · 9f010c2a
      Tejun Heo 提交于
      This patch implements four functions to manipulate deactivation state
      - deactivate, reactivate and the _self suffixed pair.  A new fields
      kernfs_node->deact_depth is added so that concurrent and nested
      deactivations are handled properly.  kernfs_node->hash is moved so
      that it's paired with the new field so that it doesn't increase the
      size of kernfs_node.
      
      A kernfs user's lock would normally nest inside active ref but during
      removal the user may want to perform kernfs_remove() while holding the
      said lock, which would introduce a reverse locking dependency.  This
      function can be used to break such reverse dependency by allowing
      deactivation step to performed separately outside user's critical
      section.
      
      This will also be used implement kernfs_remove_self().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9f010c2a
    • T
      kernfs: make kernfs_get_active() block if the node is deactivated but not removed · 895a068a
      Tejun Heo 提交于
      Currently, kernfs_get_active() fails if the target node is
      deactivated.  This is fine as a node always gets removed after
      deactivation; however, we're gonna add reactivation so the assumption
      won't hold.  It'd be incorrect for kernfs_get_active() to fail for a
      node which was deactivated only temporarily.
      
      This patch makes kernfs_get_active() block if the node is deactivated
      but not removed.  If the node gets reactivated (not yet implemented),
      it will be retried and succeed.  If the node gets removed, it will be
      woken up and fail.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      895a068a
    • T
      kernfs: remove kernfs_addrm_cxt · 99177a34
      Tejun Heo 提交于
      kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were
      added because there were operations which should be performed outside
      kernfs_mutex after adding and removing kernfs_nodes.  The necessary
      operations were recorded in kernfs_addrm_cxt and performed by
      kernfs_addrm_finish(); however, after the recent changes which
      relocated deactivation and unmapping so that they're performed
      directly during removal, the only operation kernfs_addrm_finish()
      performs is kernfs_put(), which can be moved inside the removal path
      too.
      
      This patch moves the kernfs_put() of the base ref to __kernfs_remove()
      and remove kernfs_addrm_cxt and kernfs_addrm_start/finish().
      
      * kernfs_add_one() is updated to grab and release the parent's active
        ref and kernfs_mutex itself.  kernfs_get/put_active() and
        kernfs_addrm_start/finish() invocations around it are removed from
        all users.
      
      * __kernfs_remove() puts an unlinked node directly instead of chaining
        it to kernfs_addrm_cxt.  Its callers are updated to grab and release
        kernfs_mutex instead of calling kernfs_addrm_start/finish() around
        it.
      
      v2: Updated to fit the v2 restructuring of removal path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      99177a34
    • T
      kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove() · f601f9a2
      Tejun Heo 提交于
      kernfs_unmap_bin_file() is supposed to unmap all memory mappings of
      the target file before kernfs_remove() finishes; however, it currently
      is being called from kernfs_addrm_finish() and has the same race
      problem as the original implementation of deactivation when there are
      multiple removers - only the remover which snatches the node to its
      addrm_cxt->removed list is guaranteed to wait for its completion
      before returning.
      
      It can be fixed by moving kernfs_unmap_bin_file() invocation from
      kernfs_addrm_finish() to __kernfs_remove().  The function may be
      called multiple times but that shouldn't do any harm.
      
      We end up dropping kernfs_mutex in the removal loop and the node may
      be removed inbetween by someone else.  kernfs_unlink_sibling() is
      updated to test whether the node has already been removed and return
      accordingly.  __kernfs_remove() in turn performs post-unlinking
      cleanup only if it actually unlinked the node.
      
      KERNFS_HAS_MMAP test is moved out of the unmap function into
      __kernfs_remove() so that we don't unlock kernfs_mutex unnecessarily.
      While at it, drop the now meaningless "bin" qualifier from the
      function name.
      
      v2: Rewritten to fit the v2 restructuring of removal path.  HAS_MMAP
          test relocated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f601f9a2
    • T
      kernfs: restructure removal path to fix possible premature return · 45a140e5
      Tejun Heo 提交于
      The recursive nature of kernfs_remove() means that, even if
      kernfs_remove() is not allowed to be called multiple times on the same
      node, there may be race conditions between removal of parent and its
      descendants.  While we can claim that kernfs_remove() shouldn't be
      called on one of the descendants while the removal of an ancestor is
      in progress, such rule is unnecessarily restrictive and very difficult
      to enforce.  It's better to simply allow invoking kernfs_remove() as
      the caller sees fit as long as the caller ensures that the node is
      accessible.
      
      The current behavior in such situations is broken.  Whoever enters
      removal path first takes the node off the hierarchy and then
      deactivates.  Following removers either return as soon as it notices
      that it's not the first one or can't even find the target node as it
      has already been removed from the hierarchy.  In both cases, the
      following removers may finish prematurely while the nodes which should
      be removed and drained are still being processed by the first one.
      
      This patch restructures so that multiple removers, whether through
      recursion or direction invocation, always follow the following rules.
      
      * When there are multiple concurrent removers, only one puts the base
        ref.
      
      * Regardless of which one puts the base ref, all removers are blocked
        until the target node is fully deactivated and removed.
      
      To achieve the above, removal path now first deactivates the subtree,
      drains it and then unlinks one-by-one.  __kernfs_deactivate() is
      called directly from __kernfs_removal() and drops and regrabs
      kernfs_mutex for each descendant to drain active refs.  As this means
      that multiple removers can enter __kernfs_deactivate() for the same
      node, the function is updated so that it can handle multiple
      deactivators of the same node - only one actually deactivates but all
      wait till drain completion.
      
      The restructured removal path guarantees that a removed node gets
      unlinked only after the node is deactivated and drained.  Combined
      with proper multiple deactivator handling, this guarantees that any
      invocation of kernfs_remove() returns only after the node itself and
      all its descendants are deactivated, drained and removed.
      
      v2: Draining separated into a separate loop (used to be in the same
          loop as unlink) and done from __kernfs_deactivate().  This is to
          allow exposing deactivation as a separate interface later.
      
          Root node removal was broken in v1 patch.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      45a140e5
    • T
      kernfs: remove KERNFS_REMOVED · ae34372e
      Tejun Heo 提交于
      KERNFS_REMOVED is used to mark half-initialized and dying nodes so
      that they don't show up in lookups and deny adding new nodes under or
      renaming it; however, its role overlaps those of deactivation and
      removal from rbtree.
      
      It's necessary to deny addition of new children while removal is in
      progress; however, this role considerably intersects with deactivation
      - KERNFS_REMOVED prevents new children while deactivation prevents new
      file operations.  There's no reason to have them separate making
      things more complex than necessary.
      
      KERNFS_REMOVED is also used to decide whether a node is still visible
      to vfs layer, which is rather redundant as equivalent determination
      can be made by testing whether the node is on its parent's children
      rbtree or not.
      
      This patch removes KERNFS_REMOVED.
      
      * Instead of KERNFS_REMOVED, each node now starts its life
        deactivated.  This means that we now use both atomic_add() and
        atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN.  The compiler
        generates an overflow warnings when negating INT_MIN as the negation
        can't be represented as a positive number.  Nothing is actually
        broken but let's bump BIAS by one to avoid the warnings for archs
        which negates the subtrahend..
      
      * KERNFS_REMOVED tests in add and rename paths are replaced with
        kernfs_get/put_active() of the target nodes.  Due to the way the add
        path is structured now, active ref handling is done in the callers
        of kernfs_add_one().  This will be consolidated up later.
      
      * kernfs_remove_one() is updated to deactivate instead of setting
        KERNFS_REMOVED.  This removes deactivation from kernfs_deactivate(),
        which is now renamed to kernfs_drain().
      
      * kernfs_dop_revalidate() now tests RB_EMPTY_NODE(&kn->rb) instead of
        KERNFS_REMOVED and KERNFS_REMOVED test in kernfs_dir_pos() is
        dropped.  A node which is removed from the children rbtree is not
        included in the iteration in the first place.  This means that a
        node may be visible through vfs a bit longer - it's now also visible
        after deactivation until the actual removal.  This slightly enlarged
        window difference doesn't make any difference to the userland.
      
      * Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with
        checks on the active ref.
      
      * Some comment style updates in the affected area.
      
      v2: Reordered before removal path restructuring.  kernfs_active()
          dropped and kernfs_get/put_active() used instead.  RB_EMPTY_NODE()
          used in the lookup paths.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ae34372e
    • T
      kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep() · a69d001c
      Tejun Heo 提交于
      There currently are two mechanisms gating active ref lockdep
      annotations - KERNFS_LOCKDEP flag and KERNFS_ACTIVE_REF type mask.
      The former disables lockdep annotations in kernfs_get/put_active()
      while the latter disables all of kernfs_deactivate().
      
      While KERNFS_ACTIVE_REF also behaves as an optimization to skip the
      deactivation step for non-file nodes, the benefit is marginal and it
      needlessly diverges code paths.  Let's drop KERNFS_ACTIVE_REF and use
      KERNFS_LOCKDEP in kernfs_deactivate() too.
      
      While at it, add a test helper kernfs_lockdep() to test KERNFS_LOCKDEP
      flag so that it's more convenient and the related code can be compiled
      out when not enabled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a69d001c
    • T
      kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq · ea1c472d
      Tejun Heo 提交于
      kernfs_node->u.completion is used to notify deactivation completion
      from kernfs_put_active() to kernfs_deactivate().  We now allow
      multiple racing removals of the same node and the current removal
      scheme is no longer correct - kernfs_remove() invocation may return
      before the node is properly deactivated if it races against another
      removal.  The removal path will be restructured to address the issue.
      
      To help such restructure which requires supporting multiple waiters,
      this patch replaces kernfs_node->u.completion with
      kernfs_root->deactivate_waitq.  This makes deactivation event
      notifications share a per-root waitqueue_head; however, the wait path
      is quite cold and this will also allow shaving one pointer off
      kernfs_node.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ea1c472d
    • T
      kernfs: fix get_active failure handling in kernfs_seq_*() · d92d2e6b
      Tejun Heo 提交于
      When kernfs_seq_start() fails to obtain an active reference, it
      returns ERR_PTR(-ENODEV).  kernfs_seq_stop() is then invoked with the
      error pointer value; however, it still proceeds to invoke
      kernfs_put_active() on the node leading to unbalanced put.
      
      If kernfs_seq_stop() is called even after active ref failure, it
      should skip invocation of @ops->seq_stop() and put_active.
      Unfortunately, this is a bit complicated because active ref failure
      isn't the only thing which may fail with ERR_PTR(-ENODEV).
      @ops->seq_start/next() may also fail with the error value and
      kernfs_seq_stop() doesn't have a way to tell apart those failures.
      
      Work it around by factoring out the active part of kernfs_seq_stop()
      into kernfs_seq_stop_active() and invoking it directly if
      @ops->seq_start/next() fail with ERR_PTR(-ENODEV) and updating
      kernfs_seq_stop() to skip kernfs_seq_stop_active() on
      ERR_PTR(-ENODEV).  This is a bit nasty but ensures that the active put
      is skipped iff get_active failed in kernfs_seq_start().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d92d2e6b
    • C
      xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK() · 1f4a63bf
      Chuansheng Liu 提交于
      In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to
      call destroy_work_on_stack() which frees the debug object to pair
      with INIT_WORK_ONSTACK().
      Signed-off-by: NLiu, Chuansheng <chuansheng.liu@intel.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 6f96b306)
      1f4a63bf
    • J
      xfs: fix off-by-one error in xfs_attr3_rmt_verify · bba719b5
      Jie Liu 提交于
      With CRC check is enabled, if trying to set an attributes value just
      equal to the maximum size of XATTR_SIZE_MAX would cause the v3 remote
      attr write verification procedure failure, which would yield the back
      trace like below:
      
      <snip>
      XFS (sda7): Internal error xfs_attr3_rmt_write_verify at line 191 of file fs/xfs/xfs_attr_remote.c
      <snip>
      Call Trace:
      [<ffffffff816f0042>] dump_stack+0x45/0x56
      [<ffffffffa0d99c8b>] xfs_error_report+0x3b/0x40 [xfs]
      [<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
      [<ffffffffa0d99ce5>] xfs_corruption_error+0x55/0x80 [xfs]
      [<ffffffffa0dbef6b>] xfs_attr3_rmt_write_verify+0x14b/0x1a0 [xfs]
      [<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
      [<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
      [<ffffffffa0d96edd>] _xfs_buf_ioapply+0x6d/0x390 [xfs]
      [<ffffffff81184cda>] ? vm_map_ram+0x31a/0x460
      [<ffffffff81097230>] ? wake_up_state+0x20/0x20
      [<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
      [<ffffffffa0d9726b>] xfs_buf_iorequest+0x6b/0xc0 [xfs]
      [<ffffffffa0d97315>] xfs_bdstrat_cb+0x55/0xb0 [xfs]
      [<ffffffffa0d97906>] xfs_bwrite+0x46/0x80 [xfs]
      [<ffffffffa0dbfa94>] xfs_attr_rmtval_set+0x334/0x490 [xfs]
      [<ffffffffa0db84aa>] xfs_attr_leaf_addname+0x24a/0x410 [xfs]
      [<ffffffffa0db8893>] xfs_attr_set_int+0x223/0x470 [xfs]
      [<ffffffffa0db8b76>] xfs_attr_set+0x96/0xb0 [xfs]
      [<ffffffffa0db13b2>] xfs_xattr_set+0x42/0x70 [xfs]
      [<ffffffff811df9b2>] generic_setxattr+0x62/0x80
      [<ffffffff811e0213>] __vfs_setxattr_noperm+0x63/0x1b0
      [<ffffffff81307afe>] ? evm_inode_setxattr+0xe/0x10
      [<ffffffff811e0415>] vfs_setxattr+0xb5/0xc0
      [<ffffffff811e054e>] setxattr+0x12e/0x1c0
      [<ffffffff811c6e82>] ? final_putname+0x22/0x50
      [<ffffffff811c708b>] ? putname+0x2b/0x40
      [<ffffffff811cc4bf>] ? user_path_at_empty+0x5f/0x90
      [<ffffffff811bdfd9>] ? __sb_start_write+0x49/0xe0
      [<ffffffff81168589>] ? vm_mmap_pgoff+0x99/0xc0
      [<ffffffff811e07df>] SyS_setxattr+0x8f/0xe0
      [<ffffffff81700c2d>] system_call_fastpath+0x1a/0x1f
      
      Tests:
          setfattr -n user.longxattr -v `perl -e 'print "A"x65536'` testfile
      
      This patch fix it to check the remote EA size is greater than the
      XATTR_SIZE_MAX rather than more than or equal to it, because it's
      valid if the specified EA value size is equal to the limitation as
      per VFS setxattr interface.
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 85dd0707)
      bba719b5
  7. 07 1月, 2014 1 次提交
    • E
      ext4: fix bigalloc regression · d0abafac
      Eric Whitney 提交于
      Commit f5a44db5 introduced a regression on filesystems created with
      the bigalloc feature (cluster size > blocksize).  It causes xfstests
      generic/006 and /013 to fail with an unexpected JBD2 failure and
      transaction abort that leaves the test file system in a read only state.
      Other xfstests run on bigalloc file systems are likely to fail as well.
      
      The cause is the accidental use of a cluster mask where a cluster
      offset was needed in ext4_ext_map_blocks().
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      d0abafac
  8. 03 1月, 2014 1 次提交
    • J
      epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL · 4ff36ee9
      Jason Baron 提交于
      The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
      That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
      epoll_ctl(b, EPOLL_CTL_DEL, a, x).  The deadlock was introduced with
      commmit 67347fe4 ("epoll: do not take global 'epmutex' for simple
      topologies").
      
      The acquistion of the ep->mtx for the destination 'ep' was added such
      that a concurrent EPOLL_CTL_ADD operation would see the correct state of
      the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')
      
      However, by simply not acquiring the lock, we do not serialize behind
      the ep->mtx from the add path, and thus may perform a full path check
      when if we had waited a little longer it may not have been necessary.
      However, this is a transient state, and performing the full loop
      checking in this case is not harmful.
      
      The important point is that we wouldn't miss doing the full loop
      checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
      its operating upon.  The reason we don't need to do lock ordering in the
      add path, is that we are already are holding the global 'epmutex'
      whenever we do the double lock.  Further, the original posting of this
      patch, which was tested for the intended performance gains, did not
      perform this additional locking.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Cc: Nathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ff36ee9
  9. 02 1月, 2014 1 次提交
  10. 28 12月, 2013 3 次提交
  11. 23 12月, 2013 1 次提交
    • L
      aio: clean up and fix aio_setup_ring page mapping · 3dc9acb6
      Linus Torvalds 提交于
      Since commit 36bc08cc ("fs/aio: Add support to aio ring pages
      migration") the aio ring setup code has used a special per-ring backing
      inode for the page allocations, rather than just using random anonymous
      pages.
      
      However, rather than remembering the pages as it allocated them, it
      would allocate the pages, insert them into the file mapping (dirty, so
      that they couldn't be free'd), and then forget about them.  And then to
      look them up again, it would mmap the mapping, and then use
      "get_user_pages()" to get back an array of the pages we just created.
      
      Now, not only is that incredibly inefficient, it also leaked all the
      pages if the mmap failed (which could happen due to excessive number of
      mappings, for example).
      
      So clean it all up, making it much more straightforward.  Also remove
      some left-overs of the previous (broken) mm_populate() usage that was
      removed in commit d6c355c7 ("aio: fix race in ring buffer page
      lookup introduced by page migration support") but left the pointless and
      now misleading MAP_POPULATE flag around.
      Tested-and-acked-by: NBenjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dc9acb6
  12. 22 12月, 2013 2 次提交
    • B
      aio/migratepages: make aio migrate pages sane · 8e321fef
      Benjamin LaHaise 提交于
      The arbitrary restriction on page counts offered by the core
      migrate_page_move_mapping() code results in rather suspicious looking
      fiddling with page reference counts in the aio_migratepage() operation.
      To fix this, make migrate_page_move_mapping() take an extra_count parameter
      that allows aio to tell the code about its own reference count on the page
      being migrated.
      
      While cleaning up aio_migratepage(), make it validate that the old page
      being passed in is actually what aio_migratepage() expects to prevent
      misbehaviour in the case of races.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      8e321fef
    • B
      aio: fix kioctx leak introduced by "aio: Fix a trinity splat" · 1881686f
      Benjamin LaHaise 提交于
      e34ecee2 reworked the percpu reference
      counting to correct a bug trinity found.  Unfortunately, the change lead
      to kioctxes being leaked because there was no final reference count to
      put.  Add that reference count back in to fix things.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: stable@vger.kernel.org
      1881686f