1. 24 12月, 2014 1 次提交
  2. 19 12月, 2014 6 次提交
    • J
      ocfs2: fix journal commit deadlock · 136f49b9
      Junxiao Bi 提交于
      For buffer write, page lock will be got in write_begin and released in
      write_end, in ocfs2_write_end_nolock(), before it unlock the page in
      ocfs2_free_write_ctxt(), it calls ocfs2_run_deallocs(), this will ask
      for the read lock of journal->j_trans_barrier.  Holding page lock and
      ask for journal->j_trans_barrier breaks the locking order.
      
      This will cause a deadlock with journal commit threads, ocfs2cmt will
      get write lock of journal->j_trans_barrier first, then it wakes up
      kjournald2 to do the commit work, at last it waits until done.  To
      commit journal, kjournald2 needs flushing data first, it needs get the
      cache page lock.
      
      Since some ocfs2 cluster locks are holding by write process, this
      deadlock may hung the whole cluster.
      
      unlock pages before ocfs2_run_deallocs() can fix the locking order, also
      put unlock before ocfs2_commit_trans() to make page lock is unlocked
      before j_trans_barrier to preserve unlocking order.
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NWengang Wang <wen.gang.wang@oracle.com>
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      136f49b9
    • J
      ocfs2/dlm: fix race between dispatched_work and dlm_lockres_grab_inflight_worker · 1e589581
      Joseph Qi 提交于
      Commit ac4fef4d ("ocfs2/dlm: do not purge lockres that is queued for
      assert master") may have the following possible race case:
      
        dlm_dispatch_assert_master       dlm_wq
        ========================================================================
        queue_work(dlm->quedlm_worker,
            &dlm->dispatched_work);
                                       dispatch work,
                                       dlm_lockres_drop_inflight_worker
                                       *BUG_ON(res->inflight_assert_workers == 0)*
        dlm_lockres_grab_inflight_worker
        inflight_assert_workers++
      
      So ensure inflight_assert_workers to be increased first.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Signed-off-by: NXue jiufei <xuejiufei@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e589581
    • J
      ocfs2: reflink: fix slow unlink for refcounted file · f62f12b3
      Junxiao Bi 提交于
      When running ocfs2 test suite multiple nodes reflink stress test, for a
      4 nodes cluster, every unlink() for refcounted file needs about 700s.
      
      The slow unlink is caused by the contention of refcount tree lock since
      all nodes are unlink files using the same refcount tree.  When the
      unlinking file have many extents(over 1600 in our test), most of the
      extents has refcounted flag set.  In ocfs2_commit_truncate(), it will
      execute the following call trace for every extents.  This means it needs
      get and released refcount tree lock about 1600 times.  And when several
      nodes are do this at the same time, the performance will be very low.
      
        ocfs2_remove_btree_range()
        --  ocfs2_lock_refcount_tree()
        ----  ocfs2_refcount_lock()
        ------  __ocfs2_cluster_lock()
      
      ocfs2_refcount_lock() is costly, move it to ocfs2_commit_truncate() to
      do lock/unlock once can improve a lot performance.
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Wengang <wen.gang.wang@oracle.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f62f12b3
    • P
      fs/proc/meminfo.c: include cma info in proc/meminfo · 47f8f929
      Pintu Kumar 提交于
      This patch include CMA info (CMATotal, CMAFree) in /proc/meminfo.
      Currently, in a CMA enabled system, if somebody wants to know the total
      CMA size declared, there is no way to tell, other than the dmesg or
      /var/log/messages logs.
      
      With this patch we are showing the CMA info as part of meminfo, so that it
      can be determined at any point of time.  This will be populated only when
      CMA is enabled.
      
      Below is the sample output from a ARM based device with RAM:512MB and CMA:16MB.
      
        MemTotal:         471172 kB
        MemFree:          111712 kB
        MemAvailable:     271172 kB
        .
        .
        .
        CmaTotal:          16384 kB
        CmaFree:            6144 kB
      
      This patch also fix below checkpatch errors that were found during these changes.
      
        ERROR: space required after that ',' (ctx:ExV)
        199: FILE: fs/proc/meminfo.c:199:
        +       ,atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
                ^
      
        ERROR: space required after that ',' (ctx:ExV)
        202: FILE: fs/proc/meminfo.c:202:
        +       ,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
                ^
      
        ERROR: space required after that ',' (ctx:ExV)
        206: FILE: fs/proc/meminfo.c:206:
        +       ,K(totalcma_pages)
                ^
      
        total: 3 errors, 0 warnings, 2 checks, 236 lines checked
      Signed-off-by: NPintu Kumar <pintu.k@samsung.com>
      Signed-off-by: NVishnu Pratap Singh <vishnu.ps@samsung.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47f8f929
    • S
      hfsplus: fix longname handling · 89ac9b4d
      Sougata Santra 提交于
      Longname is not correctly handled by hfsplus driver.  If an attempt to
      create a longname(>255) file/directory is made, it succeeds by creating a
      file/directory with HFSPLUS_MAX_STRLEN and incorrect catalog key.  Thus
      leaving the volume in an inconsistent state.  This patch fixes this issue.
      
      Although lookup is always called first to create a negative entry, so just
      doing a check in lookup would probably fix this issue.  I choose to
      propagate error to other iops as well.
      
      Please NOTE: I have factored out hfsplus_cat_build_key_with_cnid from
      hfsplus_cat_build_key, to avoid unncessary branching.
      
      Thanks a lot.
      
        TEST:
        ------
        dir="TEST_DIR"
        cdir=`pwd`
        name255="_123456789_123456789_123456789_123456789_123456789_123456789\
        _123456789_123456789_123456789_123456789_123456789_123456789_123456789\
        _123456789_123456789_123456789_123456789_123456789_123456789_123456789\
        _123456789_123456789_123456789_123456789_123456789_1234"
        name256="${name255}5"
      
        mkdir $dir
        cd $dir
        touch $name255
        rm -f $name255
        touch $name256
        ls -la
        cd $cdir
        rm -rf $dir
      
        RESULT:
        -------
        [sougata@ultrabook tmp]$ cdir=`pwd`
        [sougata@ultrabook tmp]$
        name255="_123456789_123456789_123456789_123456789_123456789_123456789\
         > _123456789_123456789_123456789_123456789_123456789_123456789_123456789\
         > _123456789_123456789_123456789_123456789_123456789_123456789_123456789\
         > _123456789_123456789_123456789_123456789_123456789_1234"
        [sougata@ultrabook tmp]$ name256="${name255}5"
        [sougata@ultrabook tmp]$
        [sougata@ultrabook tmp]$ mkdir $dir
        [sougata@ultrabook tmp]$ cd $dir
        [sougata@ultrabook TEST_DIR]$ touch $name255
        [sougata@ultrabook TEST_DIR]$ rm -f $name255
        [sougata@ultrabook TEST_DIR]$ touch $name256
        [sougata@ultrabook TEST_DIR]$ ls -la
        ls: cannot access
        _123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234:
        No such file or directory
        total 0
        drwxrwxr-x 1 sougata sougata 3 Feb 20 19:56 .
        drwxrwxrwx 1 root    root    6 Feb 20 19:56 ..
        -????????? ? ?       ?       ?            ?
        _123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234
        [sougata@ultrabook TEST_DIR]$ cd $cdir
        [sougata@ultrabook tmp]$ rm -rf $dir
        rm: cannot remove `TEST_DIR': Directory not empty
      
      -ENAMETOOLONG returned from hfsplus_asc2uni was not propaged to iops.
      This allowed hfsplus to create files/directories with HFSPLUS_MAX_STRLEN
      and incorrect keys, leaving the FS in an inconsistent state.  This patch
      fixes this issue.
      Signed-off-by: NSougata Santra <sougata@tuxera.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89ac9b4d
    • E
      mnt: Fix a memory stomp in umount · c297abfd
      Eric W. Biederman 提交于
      While reviewing the code of umount_tree I realized that when we append
      to a preexisting unmounted list we do not change pprev of the former
      first item in the list.
      
      Which means later in namespace_unlock hlist_del_init(&mnt->mnt_hash) on
      the former first item of the list will stomp unmounted.first leaving
      it set to some random mount point which we are likely to free soon.
      
      This isn't likely to hit, but if it does I don't know how anyone could
      track it down.
      
      [ This happened because we don't have all the same operations for
        hlist's as we do for normal doubly-linked lists. In particular,
        list_splice() is easy on our standard doubly-linked lists, while
        hlist_splice() doesn't exist and needs both start/end entries of the
        hlist.  And commit 38129a13 incorrectly open-coded that missing
        hlist_splice().
      
        We should think about making these kinds of "mindless" conversions
        easier to get right by adding the missing hlist helpers   - Linus ]
      
      Fixes: 38129a13 switch mnt_hash to hlist
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c297abfd
  3. 18 12月, 2014 20 次提交
  4. 17 12月, 2014 6 次提交
  5. 15 12月, 2014 1 次提交
    • J
      isofs: Fix infinite looping over CE entries · f54e18f1
      Jan Kara 提交于
      Rock Ridge extensions define so called Continuation Entries (CE) which
      define where is further space with Rock Ridge data. Corrupted isofs
      image can contain arbitrarily long chain of these, including a one
      containing loop and thus causing kernel to end in an infinite loop when
      traversing these entries.
      
      Limit the traversal to 32 entries which should be more than enough space
      to store all the Rock Ridge data.
      Reported-by: NP J P <ppandit@redhat.com>
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      f54e18f1
  6. 14 12月, 2014 6 次提交
    • F
      aio: Skip timer for io_getevents if timeout=0 · 5f785de5
      Fam Zheng 提交于
      In this case, it is basically a polling. Let's not involve timer at all
      because that would hurt performance for application event loops.
      
      In an arbitrary test I've done, io_getevents syscall elapsed time
      reduces from 50000+ nanoseconds to a few hundereds.
      Signed-off-by: NFam Zheng <famz@redhat.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      5f785de5
    • P
      aio: Make it possible to remap aio ring · e4a0d3e7
      Pavel Emelyanov 提交于
      There are actually two issues this patch addresses. Let me start with
      the one I tried to solve in the beginning.
      
      So, in the checkpoint-restore project (criu) we try to dump tasks'
      state and restore one back exactly as it was. One of the tasks' state
      bits is rings set up with io_setup() call. There's (almost) no problems
      in dumping them, there's a problem restoring them -- if I dump a task
      with aio ring originally mapped at address A, I want to restore one
      back at exactly the same address A. Unfortunately, the io_setup() does
      not allow for that -- it mmaps the ring at whatever place mm finds
      appropriate (it calls do_mmap_pgoff() with zero address and without
      the MAP_FIXED flag).
      
      To make restore possible I'm going to mremap() the freshly created ring
      into the address A (under which it was seen before dump). The problem is
      that the ring's virtual address is passed back to the user-space as the
      context ID and this ID is then used as search key by all the other io_foo()
      calls. Reworking this ID to be just some integer doesn't seem to work, as
      this value is already used by libaio as a pointer using which this library
      accesses memory for aio meta-data.
      
      So, to make restore work we need to make sure that
      
      a) ring is mapped at desired virtual address
      b) kioctx->user_id matches this value
      
      Having said that, the patch makes mremap() on aio region update the
      kioctx's user_id and mmap_base values.
      
      Here appears the 2nd issue I mentioned in the beginning of this mail.
      If (regardless of the C/R dances I do) someone creates an io context
      with io_setup(), then mremap()-s the ring and then destroys the context,
      the kill_ioctx() routine will call munmap() on wrong (old) address.
      This will result in a) aio ring remaining in memory and b) some other
      vma get unexpectedly unmapped.
      
      What do you think?
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      e4a0d3e7
    • J
      fsnotify: remove destroy_list from fsnotify_mark · 37d469e7
      Jan Kara 提交于
      destroy_list is used to track marks which still need waiting for srcu
      period end before they can be freed.  However by the time mark is added to
      destroy_list it isn't in group's list of marks anymore and thus we can
      reuse fsnotify_mark->g_list for queueing into destroy_list.  This saves
      two pointers for each fsnotify_mark.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37d469e7
    • J
      fsnotify: unify inode and mount marks handling · 0809ab69
      Jan Kara 提交于
      There's a lot of common code in inode and mount marks handling.  Factor it
      out to a common helper function.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0809ab69
    • H
      fallocate: create FAN_MODIFY and IN_MODIFY events · 820c12d5
      Heinrich Schuchardt 提交于
      The fanotify and the inotify API can be used to monitor changes of the
      file system.  System call fallocate() modifies files.  Hence it should
      trigger the corresponding fanotify (FAN_MODIFY) and inotify (IN_MODIFY)
      events.  The most interesting case is FALLOC_FL_COLLAPSE_RANGE because
      this value allows to create arbitrary file content from random data.
      
      This patch adds the missing call to fsnotify_modify().
      
      The FAN_MODIFY and IN_MODIFY event will be created when fallocate()
      succeeds.  It will even be created if the file length remains unchanged,
      e.g.  when calling fanotify with flag FALLOC_FL_KEEP_SIZE.
      
      This logic was primarily chosen to keep the coding simple.
      
      It resembles the logic of the write() system call.
      
      When we call write() we always create a FAN_MODIFY event, even in the case
      of overwriting with identical data.
      
      Events FAN_MODIFY and IN_MODIFY do not provide any guarantee that data was
      actually changed.
      
      Furthermore even if if the filesize remains unchanged, fallocate() may
      influence whether a subsequent write() will succeed and hence the
      fallocate() call may be considered a modification.
      
      The fallocate(2) man page teaches: After a successful call, subsequent
      writes into the range specified by offset and len are guaranteed not to
      fail because of lack of disk space.
      
      So calling fallocate(fd, FALLOC_FL_KEEP_SIZE, offset, len) may result in
      different outcomes of a subsequent write depending on the values of offset
      and len.
      Signed-off-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: John McCutchan <john@johnmccutchan.com>
      Cc: Robert Love <rlove@rlove.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      820c12d5
    • F
      fs/affs/file.c: remove obsolete pagesize check · 92cab82b
      Fabian Frederick 提交于
      linux kernel doesn't manage page sizes below 4kb.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92cab82b