1. 04 7月, 2013 10 次提交
  2. 03 7月, 2013 2 次提交
    • J
      vfs: export lseek_execute() to modules · 46a1c2c7
      Jie Liu 提交于
      For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
      SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
      matter in lseek_execute() to update the current file offset
      to the desired offset if it is valid, ceph also does the
      simliar things at ceph_llseek().
      
      To reduce the duplications, this patch make lseek_execute()
      public accessible so that we can call it directly from the
      underlying file systems.
      
      Thanks Dave Chinner for this suggestion.
      
      [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]
      
      v2->v1:
      - Add kernel-doc comments for lseek_execute()
      - Call lseek_execute() in ceph->llseek()
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Ted Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Sage Weil <sage@inktank.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      46a1c2c7
    • D
      sync: don't block the flusher thread waiting on IO · 7747bd4b
      Dave Chinner 提交于
      When sync does it's WB_SYNC_ALL writeback, it issues data Io and
      then immediately waits for IO completion. This is done in the
      context of the flusher thread, and hence completely ties up the
      flusher thread for the backing device until all the dirty inodes
      have been synced. On filesystems that are dirtying inodes constantly
      and quickly, this means the flusher thread can be tied up for
      minutes per sync call and hence badly affect system level write IO
      performance as the page cache cannot be cleaned quickly.
      
      We already have a wait loop for IO completion for sync(2), so cut
      this out of the flusher thread and delegate it to wait_sb_inodes().
      Hence we can do rapid IO submission, and then wait for it all to
      complete.
      
      Effect of sync on fsmark before the patch:
      
      FSUse%        Count         Size    Files/sec     App Overhead
      .....
           0       640000         4096      35154.6          1026984
           0       720000         4096      36740.3          1023844
           0       800000         4096      36184.6           916599
           0       880000         4096       1282.7          1054367
           0       960000         4096       3951.3           918773
           0      1040000         4096      40646.2           996448
           0      1120000         4096      43610.1           895647
           0      1200000         4096      40333.1           921048
      
      And a single sync pass took:
      
        real    0m52.407s
        user    0m0.000s
        sys     0m0.090s
      
      After the patch, there is no impact on fsmark results, and each
      individual sync(2) operation run concurrently with the same fsmark
      workload takes roughly 7s:
      
        real    0m6.930s
        user    0m0.000s
        sys     0m0.039s
      
      IOWs, sync is 7-8x faster on a busy filesystem and does not have an
      adverse impact on ongoing async data write operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7747bd4b
  3. 01 7月, 2013 1 次提交
    • T
      jbd2: invalidate handle if jbd2_journal_restart() fails · 41a5b913
      Theodore Ts'o 提交于
      If jbd2_journal_restart() fails the handle will have been disconnected
      from the current transaction.  In this situation, the handle must not
      be used for for any jbd2 function other than jbd2_journal_stop().
      Enforce this with by treating a handle which has a NULL transaction
      pointer as an aborted handle, and issue a kernel warning if
      jbd2_journal_extent(), jbd2_journal_get_write_access(),
      jbd2_journal_dirty_metadata(), etc. is called with an invalid handle.
      
      This commit also fixes a bug where jbd2_journal_stop() would trip over
      a kernel jbd2 assertion check when trying to free an invalid handle.
      
      Also move the responsibility of setting current->journal_info to
      start_this_handle(), simplifying the three users of this function.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reported-by: NYounger Liu <younger.liu@huawei.com>
      Cc: Jan Kara <jack@suse.cz>
      41a5b913
  4. 29 6月, 2013 20 次提交
  5. 28 6月, 2013 1 次提交
    • T
      cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when comparing mount options · 0ce6cba3
      Tejun Heo 提交于
      1672d040 ("cgroup: fix cgroupfs_root early destruction path")
      introduced CGRP_ROOT_SUBSYS_BOUND which is used to mark completion of
      subsys binding on a new root; however, this broke remounts.
      cgroup_remount() doesn't allow changing root options via remount and
      CGRP_ROOT_SUBSYS_BOUND, which is set on all fully initialized roots,
      makes the function reject all remounts.
      
      Fix it by putting the options part in the lower 16 bits of root->flags
      and masking the comparions.  While at it, make cgroup_remount() emit
      an error message explaining why it's rejecting a remount request, so
      that it's less of a mystery.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      0ce6cba3
  6. 27 6月, 2013 5 次提交
    • K
      sched: Fix typo in struct sched_avg member description · 239003ea
      Kamalesh Babulal 提交于
      Remove extra 'for' from the description about member of
      struct sched_avg.
      Signed-off-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Cc: pjt@google.com
      Cc: peterz@infradead.org
      Link: http://lkml.kernel.org/r/20130627060409.GB18582@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      239003ea
    • A
      Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" · 141965c7
      Alex Shi 提交于
      Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
      we can use runnable load variables.
      
      Also remove 2 CONFIG_FAIR_GROUP_SCHED setting which is not in reverted
      patch(introduced in 9ee474f5), but also need to revert.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/51CA76A3.3050207@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      141965c7
    • N
      net: fix kernel deadlock with interface rename and netdev name retrieval. · 5dbe7c17
      Nicolas Schichan 提交于
      When the kernel (compiled with CONFIG_PREEMPT=n) is performing the
      rename of a network interface, it can end up waiting for a workqueue
      to complete. If userland is able to invoke a SIOCGIFNAME ioctl or a
      SO_BINDTODEVICE getsockopt in between, the kernel will deadlock due to
      the fact that read_secklock_begin() will spin forever waiting for the
      writer process (the one doing the interface rename) to update the
      devnet_rename_seq sequence.
      
      This patch fixes the problem by adding a helper (netdev_get_name())
      and using it in the code handling the SIOCGIFNAME ioctl and
      SO_BINDTODEVICE setsockopt.
      
      The netdev_get_name() helper uses raw_seqcount_begin() to avoid
      spinning forever, waiting for devnet_rename_seq->sequence to become
      even. cond_resched() is used in the contended case, before retrying
      the access to give the writer process a chance to finish.
      
      The use of raw_seqcount_begin() will incur some unneeded work in the
      reader process in the contended case, but this is better than
      deadlocking the system.
      Signed-off-by: NNicolas Schichan <nschichan@freebox.fr>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5dbe7c17
    • T
      cgroup: fix RCU accesses to task->cgroups · 14611e51
      Tejun Heo 提交于
      task->cgroups is a RCU pointer pointing to struct css_set.  A task
      switches to a different css_set on cgroup migration but a css_set
      doesn't change once created and its pointers to cgroup_subsys_states
      aren't RCU protected.
      
      task_subsys_state[_check]() is the macro to acquire css given a task
      and subsys_id pair.  It RCU-dereferences task->cgroups->subsys[] not
      task->cgroups, so the RCU pointer task->cgroups ends up being
      dereferenced without read_barrier_depends() after it.  It's broken.
      
      Fix it by introducing task_css_set[_check]() which does
      RCU-dereference on task->cgroups.  task_subsys_state[_check]() is
      reimplemented to directly dereference ->subsys[] of the css_set
      returned from task_css_set[_check]().
      
      This removes some of sparse RCU warnings in cgroup.
      
      v2: Fixed unbalanced parenthsis and there's no need to use
          rcu_dereference_raw() when !CONFIG_PROVE_RCU.  Both spotted by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      14611e51
    • T
      cgroup: fix cgroupfs_root early destruction path · 1672d040
      Tejun Heo 提交于
      cgroupfs_root used to have ->actual_subsys_mask in addition to
      ->subsys_mask.  a8a648c4 ("cgroup: remove
      cgroup->actual_subsys_mask") removed it noting that the subsys_mask is
      essentially temporary and doesn't belong in cgroupfs_root; however,
      the patch made it impossible to tell whether a cgroupfs_root actually
      has the subsystems bound or just have the bits set leading to the
      following BUG when trying to mount with subsystems which are already
      mounted elsewhere.
      
       kernel BUG at kernel/cgroup.c:1038!
       invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       ...
       CPU: 1 PID: 7973 Comm: mount Tainted: G        W    3.10.0-rc7-next-20130625-sasha-00011-g1c1dc0e #1105
       task: ffff880fc0ae8000 ti: ffff880fc0b9a000 task.ti: ffff880fc0b9a000
       RIP: 0010:[<ffffffff81249b29>]  [<ffffffff81249b29>] rebind_subsystems+0x409/0x5f0
       ...
       Call Trace:
        [<ffffffff8124bd4f>] cgroup_kill_sb+0xff/0x210
        [<ffffffff813d21af>] deactivate_locked_super+0x4f/0x90
        [<ffffffff8124f3b3>] cgroup_mount+0x673/0x6e0
        [<ffffffff81257169>] cpuset_mount+0xd9/0x110
        [<ffffffff813d2580>] mount_fs+0xb0/0x2d0
        [<ffffffff81404afd>] vfs_kern_mount+0xbd/0x180
        [<ffffffff814070b5>] do_new_mount+0x145/0x2c0
        [<ffffffff814085d6>] do_mount+0x356/0x3c0
        [<ffffffff8140873d>] SyS_mount+0xfd/0x140
        [<ffffffff854eb600>] tracesys+0xdd/0xe2
      
      We still want rebind_subsystems() to take added/removed masks, so
      let's fix it by marking whether a cgroupfs_root has finished binding
      or not.  Also, document what's going on around ->subsys_mask
      initialization so that similar mistakes aren't repeated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      1672d040
  7. 26 6月, 2013 1 次提交
    • D
      mutex: Add w/w mutex slowpath debugging · 23010027
      Daniel Vetter 提交于
      Injects EDEADLK conditions at pseudo-random interval, with
      exponential backoff up to UINT_MAX (to ensure that every lock
      operation still completes in a reasonable time).
      
      This way we can test the wound slowpath even for ww mutex users
      where contention is never expected, and the ww deadlock
      avoidance algorithm is only needed for correctness against
      malicious userspace. An example would be protecting kernel
      modesetting properties, which thanks to single-threaded X isn't
      really expected to contend, ever.
      
      I've looked into using the CONFIG_FAULT_INJECTION
      infrastructure, but decided against it for two reasons:
      
      - EDEADLK handling is mandatory for ww mutex users and should
        never affect the outcome of a syscall. This is in contrast to -ENOMEM
        injection. So fine configurability isn't required.
      
      - The fault injection framework only allows to set a simple
        probability for failure. Now the probability that a ww mutex acquire
        stage with N locks will never complete (due to too many injected
        EDEADLK backoffs) is zero. But the expected number of ww_mutex_lock
        operations for the completely uncontended case would be O(exp(N)).
        The per-acuiqire ctx exponential backoff solution choosen here only
        results in O(log N) overhead due to injection and so O(log N * N)
        lock operations. This way we can fail with high probability (and so
        have good test coverage even for fancy backoff and lock acquisition
        paths) without running into patalogical cases.
      
      Note that EDEADLK will only ever be injected when we managed to
      acquire the lock. This prevents any behaviour changes for users
      which rely on the EALREADY semantics.
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: NMaarten Lankhorst <maarten.lankhorst@canonical.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: dri-devel@lists.freedesktop.org
      Cc: linaro-mm-sig@lists.linaro.org
      Cc: rostedt@goodmis.org
      Cc: daniel@ffwll.ch
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20130620113117.4001.21681.stgit@patserSigned-off-by: NIngo Molnar <mingo@kernel.org>
      23010027