1. 02 4月, 2014 3 次提交
    • A
      ncpfs: switch to sockfd_lookup()/sockfd_put() · 44ba8406
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      44ba8406
    • A
      reduce m_start() cost... · c7999c36
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c7999c36
    • A
      smarter propagate_mnt() · f2ebb3a9
      Al Viro 提交于
      The current mainline has copies propagated to *all* nodes, then
      tears down the copies we made for nodes that do not contain
      counterparts of the desired mountpoint.  That sets the right
      propagation graph for the copies (at teardown time we move
      the slaves of removed node to a surviving peer or directly
      to master), but we end up paying a fairly steep price in
      useless allocations.  It's fairly easy to create a situation
      where N calls of mount(2) create exactly N bindings, with
      O(N^2) vfsmounts allocated and freed in process.
      
      Fortunately, it is possible to avoid those allocations/freeings.
      The trick is to create copies in the right order and find which
      one would've eventually become a master with the current algorithm.
      It turns out to be possible in O(nodes getting propagation) time
      and with no extra allocations at all.
      
      One part is that we need to make sure that eventual master will be
      created before its slaves, so we need to walk the propagation
      tree in a different order - by peer groups.  And iterate through
      the peers before dealing with the next group.
      
      Another thing is finding the (earlier) copy that will be a master
      of one we are about to create; to do that we are (temporary) marking
      the masters of mountpoints we are attaching the copies to.
      
      Either we are in a peer of the last mountpoint we'd dealt with,
      or we have the following situation: we are attaching to mountpoint M,
      the last copy S_0 had been attached to M_0 and there are sequences
      S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
      S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
      such that M is getting propagation from M_{k}.  It means that the master
      of M_{k} will be among the sequence of masters of M.  On the
      other hand, the nearest marked node in that sequence will either
      be the master of M_{k} or the master of M_{k-1} (the latter -
      in the case if M_{k-1} is a slave of something M gets propagation
      from, but in a wrong peer group).
      
      So we go through the sequence of masters of M until we find
      a marked one (P).  Let N be the one before it.  Then we go through
      the sequence of masters of S_0 until we find one (say, S) mounted
      on a node D that has P as master and check if D is a peer of N.
      If it is, S will be the master of new copy, if not - the master of S
      will be.
      
      That's it for the hard part; the rest is fairly simple.  Iterator
      is in next_group(), handling of one prospective mountpoint is
      propagate_one().
      
      It seems to survive all tests and gives a noticably better performance
      than the current mainline for setups that are seriously using shared
      subtrees.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f2ebb3a9
  2. 31 3月, 2014 4 次提交
  3. 29 3月, 2014 1 次提交
    • S
      ocfs2: check if cluster name exists before deref · d9060742
      Sasha Levin 提交于
      Commit c74a3bdd ("ocfs2: add clustername to cluster connection") is
      trying to strlcpy a string which was explicitly passed as NULL in the
      very same patch, triggering a NULL ptr deref.
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: strlcpy (lib/string.c:388 lib/string.c:151)
        CPU: 19 PID: 19426 Comm: trinity-c19 Tainted: G        W     3.14.0-rc7-next-20140325-sasha-00014-g9476368-dirty #274
        RIP:  strlcpy (lib/string.c:388 lib/string.c:151)
        Call Trace:
         ocfs2_cluster_connect (fs/ocfs2/stackglue.c:350)
         ocfs2_cluster_connect_agnostic (fs/ocfs2/stackglue.c:396)
         user_dlm_register (fs/ocfs2/dlmfs/userdlm.c:679)
         dlmfs_mkdir (fs/ocfs2/dlmfs/dlmfs.c:503)
         vfs_mkdir (fs/namei.c:3467)
         SyS_mkdirat (fs/namei.c:3488 fs/namei.c:3472)
         tracesys (arch/x86/kernel/entry_64.S:749)
      
      akpm: this patch probably disables the feature.  A temporary thing to
      avoid triviel oopses.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9060742
  4. 28 3月, 2014 1 次提交
  5. 26 3月, 2014 2 次提交
  6. 23 3月, 2014 4 次提交
    • A
      rcuwalk: recheck mount_lock after mountpoint crossing attempts · b37199e6
      Al Viro 提交于
      We can get false negative from __lookup_mnt() if an unrelated vfsmount
      gets moved.  In that case legitimize_mnt() is guaranteed to fail,
      and we will fall back to non-RCU walk... unless we end up running
      into a hard error on a filesystem object we wouldn't have reached
      if not for that false negative.  IOW, delaying that check until
      the end of pathname resolution is wrong - we should recheck right
      after we attempt to cross the mountpoint.  We don't need to recheck
      unless we see d_mountpoint() being true - in that case even if
      we have just raced with mount/umount, we can simply go on as if
      we'd come at the moment when the sucker wasn't a mountpoint; if we
      run into a hard error as the result, it was a legitimate outcome.
      __lookup_mnt() returning NULL is different in that respect, since
      it might've happened due to operation on completely unrelated
      mountpoint.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b37199e6
    • A
      make prepend_name() work correctly when called with negative *buflen · e825196d
      Al Viro 提交于
      In all callchains leading to prepend_name(), the value left in *buflen
      is eventually discarded unused if prepend_name() has returned a negative.
      So we are free to do what prepend() does, and subtract from *buflen
      *before* checking for underflow (which turns into checking the sign
      of subtraction result, of course).
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e825196d
    • E
      vfs: Don't let __fdget_pos() get FMODE_PATH files · 99aea681
      Eric Biggers 提交于
      Commit bd2a31d5 ("get rid of fget_light()") introduced the
      __fdget_pos() function, which returns the resulting file pointer and
      fdput flags combined in an 'unsigned long'.  However, it also changed the
      behavior to return files with FMODE_PATH set, which shouldn't happen
      because read(), write(), lseek(), etc. aren't allowed on such files.
      This commit restores the old behavior.
      
      This regression actually had no effect on read() and write() since
      FMODE_READ and FMODE_WRITE are not set on file descriptors opened with
      O_PATH, but it did cause lseek() on a file descriptor opened with O_PATH
      to fail with ESPIPE rather than EBADF.
      Signed-off-by: NEric Biggers <ebiggers3@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      99aea681
    • E
      vfs: atomic f_pos access in llseek() · d7a15f8d
      Eric Biggers 提交于
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") changed
      several system calls to use fdget_pos() instead of fdget(), but missed
      sys_llseek().  Fix it.
      Signed-off-by: NEric Biggers <ebiggers3@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d7a15f8d
  7. 11 3月, 2014 2 次提交
  8. 10 3月, 2014 3 次提交
    • A
      get rid of fget_light() · bd2a31d5
      Al Viro 提交于
      instead of returning the flags by reference, we can just have the
      low-level primitive return those in lower bits of unsigned long,
      with struct file * derived from the rest.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd2a31d5
    • L
      vfs: atomic f_pos accesses as per POSIX · 9c225f26
      Linus Torvalds 提交于
      Our write() system call has always been atomic in the sense that you get
      the expected thread-safe contiguous write, but we haven't actually
      guaranteed that concurrent writes are serialized wrt f_pos accesses, so
      threads (or processes) that share a file descriptor and use "write()"
      concurrently would quite likely overwrite each others data.
      
      This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:
      
       "2.9.7 Thread Interactions with Regular File Operations
      
        All of the following functions shall be atomic with respect to each
        other in the effects specified in POSIX.1-2008 when they operate on
        regular files or symbolic links: [...]"
      
      and one of the effects is the file position update.
      
      This unprotected file position behavior is not new behavior, and nobody
      has ever cared.  Until now.  Yongzhi Pan reported unexpected behavior to
      Michael Kerrisk that was due to this.
      
      This resolves the issue with a f_pos-specific lock that is taken by
      read/write/lseek on file descriptors that may be shared across threads
      or processes.
      Reported-by: NYongzhi Pan <panyongzhi@gmail.com>
      Reported-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9c225f26
    • A
      ocfs2 syncs the wrong range... · 1b56e989
      Al Viro 提交于
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1b56e989
  9. 06 3月, 2014 4 次提交
  10. 04 3月, 2014 3 次提交
    • V
      hfsplus: fix remount issue · bd2c0035
      Vyacheslav Dubeyko 提交于
      Current implementation of HFS+ driver has small issue with remount
      option.  Namely, for example, you are unable to remount from RO mode
      into RW mode by means of command "mount -o remount,rw /dev/loop0
      /mnt/hfsplus".  Trying to execute sequence of commands results in an
      error message:
      
        mount /dev/loop0 /mnt/hfsplus
        mount -o remount,ro /dev/loop0 /mnt/hfsplus
        mount -o remount,rw /dev/loop0 /mnt/hfsplus
      
        mount: you must specify the filesystem type
      
        mount -t hfsplus -o remount,rw /dev/loop0 /mnt/hfsplus
      
        mount: /mnt/hfsplus not mounted or bad option
      
      The reason of such issue is failure of mount syscall:
      
        mount("/dev/loop0", "/mnt/hfsplus", 0x2282a60, MS_MGC_VAL|MS_REMOUNT, NULL) = -1 EINVAL (Invalid argument)
      
      Namely, hfsplus_parse_options_remount() method receives empty "input"
      argument and return false in such case.  As a result, hfsplus_remount()
      returns -EINVAL error code.
      
      This patch fixes the issue by means of return true for the case of empty
      "input" argument in hfsplus_parse_options_remount() method.
      Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd2c0035
    • J
      ocfs2: fix quota file corruption · 15c34a76
      Jan Kara 提交于
      Global quota files are accessed from different nodes.  Thus we cannot
      cache offset of quota structure in the quota file after we drop our node
      reference count to it because after that moment quota structure may be
      freed and reallocated elsewhere by a different node resulting in
      corruption of quota file.
      
      Fix the problem by clearing dq_off when we are releasing dquot structure.
      We also remove the DB_READ_B handling because it is useless -
      DQ_ACTIVE_B is set iff DQ_READ_B is set.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15c34a76
    • D
      mm: close PageTail race · 668f9abb
      David Rientjes 提交于
      Commit bf6bddf1 ("mm: introduce compaction and migration for
      ballooned pages") introduces page_count(page) into memory compaction
      which dereferences page->first_page if PageTail(page).
      
      This results in a very rare NULL pointer dereference on the
      aforementioned page_count(page).  Indeed, anything that does
      compound_head(), including page_count() is susceptible to racing with
      prep_compound_page() and seeing a NULL or dangling page->first_page
      pointer.
      
      This patch uses Andrea's implementation of compound_trans_head() that
      deals with such a race and makes it the default compound_head()
      implementation.  This includes a read memory barrier that ensures that
      if PageTail(head) is true that we return a head page that is neither
      NULL nor dangling.  The patch then adds a store memory barrier to
      prep_compound_page() to ensure page->first_page is set.
      
      This is the safest way to ensure we see the head page that we are
      expecting, PageTail(page) is already in the unlikely() path and the
      memory barriers are unfortunately required.
      
      Hugetlbfs is the exception, we don't enforce a store memory barrier
      during init since no race is possible.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      668f9abb
  11. 03 3月, 2014 1 次提交
  12. 02 3月, 2014 1 次提交
  13. 01 3月, 2014 1 次提交
  14. 25 2月, 2014 4 次提交
    • L
      sysfs: fix namespace refcnt leak · fed95bab
      Li Zefan 提交于
      As mount() and kill_sb() is not a one-to-one match, we shoudn't get
      ns refcnt unconditionally in sysfs_mount(), and instead we should
      get the refcnt only when kernfs_mount() allocated a new superblock.
      
      v2:
      - Changed the name of the new argument, suggested by Tejun.
      - Made the argument optional, suggested by Tejun.
      
      v3:
      - Make the new argument as second-to-last arg, suggested by Tejun.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NTejun Heo <tj@kernel.org>
       ---
       fs/kernfs/mount.c      | 8 +++++++-
       fs/sysfs/mount.c       | 5 +++--
       include/linux/kernfs.h | 9 +++++----
       3 files changed, 15 insertions(+), 7 deletions(-)
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fed95bab
    • J
      fsnotify: Allocate overflow events with proper type · ff57cd58
      Jan Kara 提交于
      Commit 7053aee2 "fsnotify: do not share events between notification
      groups" used overflow event statically allocated in a group with the
      size of the generic notification event. This causes problems because
      some code looks at type specific parts of event structure and gets
      confused by a random data it sees there and causes crashes.
      
      Fix the problem by allocating overflow event with type corresponding to
      the group type so code cannot get confused.
      Signed-off-by: NJan Kara <jack@suse.cz>
      ff57cd58
    • J
      fanotify: Handle overflow in case of permission events · 482ef06c
      Jan Kara 提交于
      If the event queue overflows when we are handling permission event, we
      will never get response from userspace. So we must avoid waiting for it.
      Change fsnotify_add_notify_event() to return whether overflow has
      happened so that we can detect it in fanotify_handle_event() and act
      accordingly.
      Signed-off-by: NJan Kara <jack@suse.cz>
      482ef06c
    • J
      fsnotify: Fix detection whether overflow event is queued · 2513190a
      Jan Kara 提交于
      Currently we didn't initialize event's list head when we removed it from
      the event list. Thus a detection whether overflow event is already
      queued wasn't working. Fix it by always initializing the list head when
      deleting event from a list.
      Signed-off-by: NJan Kara <jack@suse.cz>
      2513190a
  15. 24 2月, 2014 2 次提交
  16. 22 2月, 2014 2 次提交
  17. 21 2月, 2014 2 次提交
    • J
      quota: Fix race between dqput() and dquot_scan_active() · 1362f4ea
      Jan Kara 提交于
      Currently last dqput() can race with dquot_scan_active() causing it to
      call callback for an already deactivated dquot. The race is as follows:
      
      CPU1					CPU2
        dqput()
          spin_lock(&dq_list_lock);
          if (atomic_read(&dquot->dq_count) > 1) {
           - not taken
          if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
            spin_unlock(&dq_list_lock);
            ->release_dquot(dquot);
              if (atomic_read(&dquot->dq_count) > 1)
               - not taken
      					  dquot_scan_active()
      					    spin_lock(&dq_list_lock);
      					    if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
      					     - not taken
      					    atomic_inc(&dquot->dq_count);
      					    spin_unlock(&dq_list_lock);
              - proceeds to release dquot
      					    ret = fn(dquot, priv);
      					     - called for inactive dquot
      
      Fix the problem by making sure possible ->release_dquot() is finished by
      the time we call the callback and new calls to it will notice reference
      dquot_scan_active() has taken and bail out.
      
      CC: stable@vger.kernel.org # >= 2.6.29
      Signed-off-by: NJan Kara <jack@suse.cz>
      1362f4ea
    • J
      udf: Fix data corruption on file type conversion · 09ebb17a
      Jan Kara 提交于
      UDF has two types of files - files with data stored in inode (ICB in
      UDF terminology) and files with data stored in external data blocks. We
      convert file from in-inode format to external format in
      udf_file_aio_write() when we find out data won't fit into inode any
      longer. However the following race between two O_APPEND writes can happen:
      
      CPU1					CPU2
      udf_file_aio_write()			udf_file_aio_write()
        down_write(&iinfo->i_data_sem);
        checks that i_size + count1 fits within inode
          => no need to convert
        up_write(&iinfo->i_data_sem);
      					  down_write(&iinfo->i_data_sem);
      					  checks that i_size + count2 fits
      					    within inode => no need to convert
      					  up_write(&iinfo->i_data_sem);
        generic_file_aio_write()
          - extends file by count1 bytes
      					  generic_file_aio_write()
      					    - extends file by count2 bytes
      
      Clearly if count1 + count2 doesn't fit into the inode, we overwrite
      kernel buffers beyond inode, possibly corrupting the filesystem as well.
      
      Fix the problem by acquiring i_mutex before checking whether write fits
      into the inode and using __generic_file_aio_write() afterwards which
      puts check and write into one critical section.
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NJan Kara <jack@suse.cz>
      09ebb17a