1. 18 12月, 2011 3 次提交
    • W
      writeback: dirty ratelimit - think time compensation · 83712358
      Wu Fengguang 提交于
      Compensate the task's think time when computing the final pause time,
      so that ->dirty_ratelimit can be executed accurately.
      
              think time := time spend outside of balance_dirty_pages()
      
      In the rare case that the task slept longer than the 200ms period time
      (result in negative pause time), the sleep time will be compensated in
      the following periods, too, if it's less than 1 second.
      
      Accumulated errors are carefully avoided as long as the max pause area
      is not hitted.
      
      Pseudo code:
      
              period = pages_dirtied / task_ratelimit;
              think = jiffies - dirty_paused_when;
              pause = period - think;
      
      1) normal case: period > think
      
              pause = period - think
              dirty_paused_when = jiffies + pause
              nr_dirtied = 0
      
                                   period time
                    |===============================>|
                        think time      pause time
                    |===============>|==============>|
              ------|----------------|---------------|------------------------
              dirty_paused_when   jiffies
      
      2) no pause case: period <= think
      
              don't pause; reduce future pause time by:
              dirty_paused_when += period
              nr_dirtied = 0
      
                                 period time
                    |===============================>|
                                        think time
                    |===================================================>|
              ------|--------------------------------+-------------------|----
              dirty_paused_when                                       jiffies
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      83712358
    • W
      writeback: fix dirtied pages accounting on redirty · 2f800fbd
      Wu Fengguang 提交于
      De-account the accumulative dirty counters on page redirty.
      
      Page redirties (very common in ext4) will introduce mismatch between
      counters (a) and (b)
      
      a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
      b) NR_WRITTEN, BDI_WRITTEN
      
      This will introduce systematic errors in balanced_rate and result in
      dirty page position errors (ie. the dirty pages are no longer balanced
      around the global/bdi setpoints).
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      2f800fbd
    • W
      writeback: charge leaked page dirties to active tasks · 54848d73
      Wu Fengguang 提交于
      It's a years long problem that a large number of short-lived dirtiers
      (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
      (eg. dd) as well as pushing the dirty pages to the global hard limit.
      
      The solution is to charge the pages dirtied by the exited gcc to the
      other random dirtying tasks. It sounds not perfect, however should
      behave good enough in practice, seeing as that throttled tasks aren't
      actually running so those that are running are more likely to pick it up
      and get throttled, therefore promoting an equal spread.
      
      Randy: fix compile error: 'dirty_throttle_leaks' undeclared in exit.c
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      54848d73
  2. 17 12月, 2011 1 次提交
    • E
      iommu: Export intel_iommu_enabled to signal when iommu is in use · 8bc1f85c
      Eugeni Dodonov 提交于
      In i915 driver, we do not enable either rc6 or semaphores on SNB when dmar
      is enabled. The new 'intel_iommu_enabled' variable signals when the
      iommu code is in operation.
      
      Cc: Ted Phelps <phelps@gnusto.com>
      Cc: Peter <pab1612@gmail.com>
      Cc: Lukas Hejtmanek <xhejtman@fi.muni.cz>
      Cc: Andrew Lutomirski <luto@mit.edu>
      CC: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Eugeni Dodonov <eugeni.dodonov@intel.com>
      Signed-off-by: NKeith Packard <keithp@keithp.com>
      8bc1f85c
  3. 13 12月, 2011 1 次提交
    • L
      linux/log2.h: Fix rounddown_pow_of_two(1) · 13c07b02
      Linus Torvalds 提交于
      Exactly like roundup_pow_of_two(1), the rounddown version was buggy for
      the case of a compile-time constant '1' argument.  Probably because it
      originated from the same code, sharing history with the roundup version
      from before the bugfix (for that one, see commit 1a06a52e: "Fix
      roundup_pow_of_two(1)").
      
      However, unlike the roundup version, the fix for rounddown is to just
      remove the broken special case entirely.  It's simply not needed - the
      generic code
      
          1UL << ilog2(n)
      
      does the right thing for the constant '1' argment too.  The only reason
      roundup needed that special case was because rounding up does so by
      subtracting one from the argument (and then adding one to the result)
      causing the obvious problems with "ilog2(0)".
      
      But rounddown doesn't do any of that, since ilog2() naturally truncates
      (ie "rounds down") to the right rounded down value.  And without the
      ilog2(0) case, there's no reason for the special case that had the wrong
      value.
      
      tl;dr: rounddown_pow_of_two(1) should be 1, not 0.
      Acked-by: NDmitry Torokhov <dtor@vmware.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13c07b02
  4. 11 12月, 2011 1 次提交
  5. 09 12月, 2011 1 次提交
  6. 07 12月, 2011 1 次提交
    • A
      fix apparmor dereferencing potentially freed dentry, sanitize __d_path() API · 02125a82
      Al Viro 提交于
      __d_path() API is asking for trouble and in case of apparmor d_namespace_path()
      getting just that.  The root cause is that when __d_path() misses the root
      it had been told to look for, it stores the location of the most remote ancestor
      in *root.  Without grabbing references.  Sure, at the moment of call it had
      been pinned down by what we have in *path.  And if we raced with umount -l, we
      could have very well stopped at vfsmount/dentry that got freed as soon as
      prepend_path() dropped vfsmount_lock.
      
      It is safe to compare these pointers with pre-existing (and known to be still
      alive) vfsmount and dentry, as long as all we are asking is "is it the same
      address?".  Dereferencing is not safe and apparmor ended up stepping into
      that.  d_namespace_path() really wants to examine the place where we stopped,
      even if it's not connected to our namespace.  As the result, it looked
      at ->d_sb->s_magic of a dentry that might've been already freed by that point.
      All other callers had been careful enough to avoid that, but it's really
      a bad interface - it invites that kind of trouble.
      
      The fix is fairly straightforward, even though it's bigger than I'd like:
      	* prepend_path() root argument becomes const.
      	* __d_path() is never called with NULL/NULL root.  It was a kludge
      to start with.  Instead, we have an explicit function - d_absolute_root().
      Same as __d_path(), except that it doesn't get root passed and stops where
      it stops.  apparmor and tomoyo are using it.
      	* __d_path() returns NULL on path outside of root.  The main
      caller is show_mountinfo() and that's precisely what we pass root for - to
      skip those outside chroot jail.  Those who don't want that can (and do)
      use d_path().
      	* __d_path() root argument becomes const.  Everyone agrees, I hope.
      	* apparmor does *NOT* try to use __d_path() or any of its variants
      when it sees that path->mnt is an internal vfsmount.  In that case it's
      definitely not mounted anywhere and dentry_path() is exactly what we want
      there.  Handling of sysctl()-triggered weirdness is moved to that place.
      	* if apparmor is asked to do pathname relative to chroot jail
      and __d_path() tells it we it's not in that jail, the sucker just calls
      d_absolute_path() instead.  That's the other remaining caller of __d_path(),
      BTW.
              * seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway -
      the normal seq_file logics will take care of growing the buffer and redoing
      the call of ->show() just fine).  However, if it gets path not reachable
      from root, it returns SEQ_SKIP.  The only caller adjusted (i.e. stopped
      ignoring the return value as it used to do).
      Reviewed-by: NJohn Johansen <john.johansen@canonical.com>
      ACKed-by: NJohn Johansen <john.johansen@canonical.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      02125a82
  7. 06 12月, 2011 2 次提交
  8. 05 12月, 2011 1 次提交
    • P
      perf: Fix loss of notification with multi-event · 10c6db11
      Peter Zijlstra 提交于
      When you do:
              $ perf record -e cycles,cycles,cycles noploop 10
      
      You expect about 10,000 samples for each event, i.e., 10s at
      1000samples/sec. However, this is not what's happening. You
      get much fewer samples, maybe 3700 samples/event:
      
      $ perf report -D | tail -15
      Aggregated stats:
                 TOTAL events:      10998
                  MMAP events:         66
                  COMM events:          2
                SAMPLE events:      10930
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      cycles stats:
                 TOTAL events:       3642
                SAMPLE events:       3642
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      
      On a Intel Nehalem or even AMD64, there are 4 counters capable
      of measuring cycles, so there is plenty of space to measure those
      events without multiplexing (even with the NMI watchdog active).
      And even with multiplexing, we'd expect roughly the same number
      of samples per event.
      
      The root of the problem was that when the event that caused the buffer
      to become full was not the first event passed on the cmdline, the user
      notification would get lost. The notification was sent to the file
      descriptor of the overflowed event but the perf tool was not polling
      on it.  The perf tool aggregates all samples into a single buffer,
      i.e., the buffer of the first event. Consequently, it assumes
      notifications for any event will come via that descriptor.
      
      The seemingly straight forward solution of moving the waitq into the
      ringbuffer object doesn't work because of life-time issues. One could
      perf_event_set_output() on a fd that you're also blocking on and cause
      the old rb object to be freed while its waitq would still be
      referenced by the blocked thread -> FAIL.
      
      Therefore link all events to the ringbuffer and broadcast the wakeup
      from the ringbuffer object to all possible events that could be waited
      upon. This is rather ugly, and we're open to better solutions but it
      works for now.
      Reported-by: NStephane Eranian <eranian@google.com>
      Finished-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20111126014731.GA7030@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      10c6db11
  9. 04 12月, 2011 1 次提交
  10. 29 11月, 2011 4 次提交
  11. 24 11月, 2011 2 次提交
  12. 23 11月, 2011 4 次提交
  13. 19 11月, 2011 1 次提交
    • D
      hugetlb: remove dummy definitions of HPAGE_MASK and HPAGE_SIZE · a5c86e98
      David Rientjes 提交于
      Dummy, non-zero definitions for HPAGE_MASK and HPAGE_SIZE were added in
      51c6f666 ("mm: ZAP_BLOCK causes redundant work") to avoid a divide
      by zero in generic kernel code.
      
      That code has since been removed, but probably should never have been
      added in the first place: we don't want HPAGE_SIZE to act like PAGE_SIZE
      for code that is working with hugepages, for example, when the
      dependency on CONFIG_HUGETLB_PAGE has not been fulfilled.
      
      Because hugepage size can differ from architecture to architecture, each
      is required to have their own definitions for both HPAGE_MASK and
      HPAGE_SIZE.  This is always done in arch/*/include/asm/page.h.
      
      So, just remove the dummy and dangerous definitions since they are no
      longer needed and reveals the correct dependencies.  Tested on
      architectures using the definitions with allyesconfig: x86 (even with
      thp), hppa, mips, powerpc, s390, sh3, sh4, sparc, and sparc64, and with
      defconfig on ia64.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5c86e98
  14. 18 11月, 2011 2 次提交
    • K
      pstore: pass allocated memory region back to caller · f6f82851
      Kees Cook 提交于
      The buf_lock cannot be held while populating the inodes, so make the backend
      pass forward an allocated and filled buffer instead. This solves the following
      backtrace. The effect is that "buf" is only ever used to notify the backends
      that something was written to it, and shouldn't be used in the read path.
      
      To replace the buf_lock during the read path, isolate the open/read/close
      loop with a separate mutex to maintain serialized access to the backend.
      
      Note that is is up to the pstore backend to cope if the (*write)() path is
      called in the middle of the read path.
      
      [   59.691019] BUG: sleeping function called from invalid context at .../mm/slub.c:847
      [   59.691019] in_atomic(): 0, irqs_disabled(): 1, pid: 1819, name: mount
      [   59.691019] Pid: 1819, comm: mount Not tainted 3.0.8 #1
      [   59.691019] Call Trace:
      [   59.691019]  [<810252d5>] __might_sleep+0xc3/0xca
      [   59.691019]  [<810a26e6>] kmem_cache_alloc+0x32/0xf3
      [   59.691019]  [<810b53ac>] ? __d_lookup_rcu+0x6f/0xf4
      [   59.691019]  [<810b68b1>] alloc_inode+0x2a/0x64
      [   59.691019]  [<810b6903>] new_inode+0x18/0x43
      [   59.691019]  [<81142447>] pstore_get_inode.isra.1+0x11/0x98
      [   59.691019]  [<81142623>] pstore_mkfile+0xae/0x26f
      [   59.691019]  [<810a2a66>] ? kmem_cache_free+0x19/0xb1
      [   59.691019]  [<8116c821>] ? ida_get_new_above+0x140/0x158
      [   59.691019]  [<811708ea>] ? __init_rwsem+0x1e/0x2c
      [   59.691019]  [<810b67e8>] ? inode_init_always+0x111/0x1b0
      [   59.691019]  [<8102127e>] ? should_resched+0xd/0x27
      [   59.691019]  [<8137977f>] ? _cond_resched+0xd/0x21
      [   59.691019]  [<81142abf>] pstore_get_records+0x52/0xa7
      [   59.691019]  [<8114254b>] pstore_fill_super+0x7d/0x91
      [   59.691019]  [<810a7ff5>] mount_single+0x46/0x82
      [   59.691019]  [<8114231a>] pstore_mount+0x15/0x17
      [   59.691019]  [<811424ce>] ? pstore_get_inode.isra.1+0x98/0x98
      [   59.691019]  [<810a8199>] mount_fs+0x5a/0x12d
      [   59.691019]  [<810b9174>] ? alloc_vfsmnt+0xa4/0x14a
      [   59.691019]  [<810b9474>] vfs_kern_mount+0x4f/0x7d
      [   59.691019]  [<810b9d7e>] do_kern_mount+0x34/0xb2
      [   59.691019]  [<810bb15f>] do_mount+0x5fc/0x64a
      [   59.691019]  [<810912fb>] ? strndup_user+0x2e/0x3f
      [   59.691019]  [<810bb3cb>] sys_mount+0x66/0x99
      [   59.691019]  [<8137b537>] sysenter_do_call+0x12/0x26
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      f6f82851
    • R
      PM Sleep: Do not extend wakeup paths to devices with ignore_children set · 8b258cc8
      Rafael J. Wysocki 提交于
      Commit 4ca46ff3 (PM / Sleep: Mark
      devices involved in wakeup signaling during suspend) introduced
      the power.wakeup_path field in struct dev_pm_info to mark devices
      whose children are enabled to wake up the system from sleep states,
      so that power domains containing the parents that provide their
      children with wakeup power and/or relay their wakeup signals are not
      turned off.  Unfortunately, that introduced a PM regression on SH7372
      whose power consumption in the system "memory sleep" state increased
      as a result of it, because it prevented the power domain containing
      the I2C controller from being turned off when some children of that
      controller were enabled to wake up the system, although the
      controller was not necessary for them to signal wakeup.
      
      To fix this issue use the observation that devices whose
      power.ignore_children flag is set for runtime PM should be treated
      analogously during system suspend.  Namely, they shouldn't be
      included in wakeup paths going through their children.  Since the
      SH7372 I2C controller's power.ignore_children flag is set, doing so
      will restore the previous behavior of that SOC.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      8b258cc8
  15. 17 11月, 2011 4 次提交
  16. 16 11月, 2011 4 次提交
  17. 14 11月, 2011 2 次提交
  18. 12 11月, 2011 1 次提交
  19. 11 11月, 2011 4 次提交