1. 15 11月, 2013 1 次提交
    • K
      mm: rename USE_SPLIT_PTLOCKS to USE_SPLIT_PTE_PTLOCKS · 57c1ffce
      Kirill A. Shutemov 提交于
      We're going to introduce split page table lock for PMD level.  Let's
      rename existing split ptlock for PTE level to avoid confusion.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57c1ffce
  2. 13 11月, 2013 23 次提交
    • M
      ipc, msg: fix message length check for negative values · 4e9b45a1
      Mathias Krause 提交于
      On 64 bit systems the test for negative message sizes is bogus as the
      size, which may be positive when evaluated as a long, will get truncated
      to an int when passed to load_msg().  So a long might very well contain a
      positive value but when truncated to an int it would become negative.
      
      That in combination with a small negative value of msg_ctlmax (which will
      be promoted to an unsigned type for the comparison against msgsz, making
      it a big positive value and therefore make it pass the check) will lead to
      two problems: 1/ The kmalloc() call in alloc_msg() will allocate a too
      small buffer as the addition of alen is effectively a subtraction.  2/ The
      copy_from_user() call in load_msg() will first overflow the buffer with
      userland data and then, when the userland access generates an access
      violation, the fixup handler copy_user_handle_tail() will try to fill the
      remainder with zeros -- roughly 4GB.  That almost instantly results in a
      system crash or reset.
      
        ,-[ Reproducer (needs to be run as root) ]--
        | #include <sys/stat.h>
        | #include <sys/msg.h>
        | #include <unistd.h>
        | #include <fcntl.h>
        |
        | int main(void) {
        |     long msg = 1;
        |     int fd;
        |
        |     fd = open("/proc/sys/kernel/msgmax", O_WRONLY);
        |     write(fd, "-1", 2);
        |     close(fd);
        |
        |     msgsnd(0, &msg, 0xfffffff0, IPC_NOWAIT);
        |
        |     return 0;
        | }
        '---
      
      Fix the issue by preventing msgsz from getting truncated by consistently
      using size_t for the message length.  This way the size checks in
      do_msgsnd() could still be passed with a negative value for msg_ctlmax but
      we would fail on the buffer allocation in that case and error out.
      
      Also change the type of m_ts from int to size_t to avoid similar nastiness
      in other code paths -- it is used in similar constructs, i.e.  signed vs.
      unsigned checks.  It should never become negative under normal
      circumstances, though.
      
      Setting msg_ctlmax to a negative value is an odd configuration and should
      be prevented.  As that might break existing userland, it will be handled
      in a separate commit so it could easily be reverted and reworked without
      reintroducing the above described bug.
      
      Hardening mechanisms for user copy operations would have catched that bug
      early -- e.g.  checking slab object sizes on user copy operations as the
      usercopy feature of the PaX patch does.  Or, for that matter, detect the
      long vs.  int sign change due to truncation, as the size overflow plugin
      of the very same patch does.
      
      [akpm@linux-foundation.org: fix i386 min() warnings]
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Cc: Pax Team <pageexec@freemail.hu>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>	[ v2.3.27+ -- yes, that old ;) ]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e9b45a1
    • J
      rbtree: fix rbtree_postorder_for_each_entry_safe() iterator · 1310a5a9
      Jan Kara 提交于
      The iterator rbtree_postorder_for_each_entry_safe() relies on pointer
      underflow behavior when testing for loop termination.  In particular it
      expects that
      
        &rb_entry(NULL, type, field)->field
      
      is NULL.  But the result of this expression is not defined by a C standard
      and some gcc versions (e.g.  4.3.4) assume the above expression can never
      be equal to NULL.  The net result is an oops because the iteration is not
      properly terminated.
      
      Fix the problem by modifying the iterator to avoid pointer underflows.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: <stable@vger.kernel.org>		[3.12.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1310a5a9
    • K
      exec/ptrace: fix get_dumpable() incorrect tests · d049f74f
      Kees Cook 提交于
      The get_dumpable() return value is not boolean.  Most users of the
      function actually want to be testing for non-SUID_DUMP_USER(1) rather than
      SUID_DUMP_DISABLE(0).  The SUID_DUMP_ROOT(2) is also considered a
      protected state.  Almost all places did this correctly, excepting the two
      places fixed in this patch.
      
      Wrong logic:
          if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ }
              or
          if (dumpable == 0) { /* be protective */ }
              or
          if (!dumpable) { /* be protective */ }
      
      Correct logic:
          if (dumpable != SUID_DUMP_USER) { /* be protective */ }
              or
          if (dumpable != 1) { /* be protective */ }
      
      Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a
      user was able to ptrace attach to processes that had dropped privileges to
      that user.  (This may have been partially mitigated if Yama was enabled.)
      
      The macros have been moved into the file that declares get/set_dumpable(),
      which means things like the ia64 code can see them too.
      
      CVE-2013-2929
      Reported-by: NVasily Kulikov <segoon@openwall.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d049f74f
    • S
      rtc: s5m-rtc: add real-time clock driver for s5m8767 · 5bccae6e
      Sangbeom Kim 提交于
      Add real-time clock driver for s5m8767.
      Signed-off-by: NSangbeom Kim <sbkim73@samsung.com>
      Signed-off-by: NSachin Kamat <sachin.kamat@linaro.org>
      Cc: Todd Broch <tbroch@chromium.org>
      Cc: Mark Brown <broonie@kernel.org>
      Acked-by: Lee Jones <lee.jones@linaro.org>	[mfd parts]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5bccae6e
    • G
      init.h: document the existence of __initconst · 65321547
      Geert Uytterhoeven 提交于
      Initdata can be const since more than 5 years, using the __initconst
      keyword.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65321547
    • O
      list: introduce list_last_entry(), use list_{first,last}_entry() · 93be3c2e
      Oleg Nesterov 提交于
      We already have list_first_entry(), it makes sense to also add
      list_last_entry() for consistency.  And we use both helpers in
      list_for_each_*().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Eilon Greenstein <eilong@broadcom.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93be3c2e
    • O
      list: change list_for_each_entry*() to use list_*_entry() · 8120e2e5
      Oleg Nesterov 提交于
      Now that we have list_{next,prev}_entry() we can change
      list_for_each_entry*() and list_safe_reset_next() to use the new helpers
      to improve the readability.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Eilon Greenstein <eilong@broadcom.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8120e2e5
    • O
      list: introduce list_next_entry() and list_prev_entry() · 008208c6
      Oleg Nesterov 提交于
      Add two trivial helpers list_next_entry() and list_prev_entry(), they
      can have a lot of users including list.h itself.  In fact the 1st one is
      already defined in events/core.c and bnx2x_sp.c, so the patch simply
      moves the definition to list.h.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Eilon Greenstein <eilong@broadcom.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      008208c6
    • N
      lib/genalloc: add a helper function for DMA buffer allocation · 684f0d3d
      Nicolin Chen 提交于
      When using pool space for DMA buffer, there might be duplicated calling of
      gen_pool_alloc() and gen_pool_virt_to_phys() in each implementation.
      
      Thus it's better to add a simple helper function, a compatible one to the
      common dma_alloc_coherent(), to save some code.
      Signed-off-by: NNicolin Chen <b42378@freescale.com>
      Cc: "Hans J. Koch" <hjk@hansjkoch.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Eric Miao <eric.y.miao@gmail.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haojian Zhuang <haojian.zhuang@gmail.com>
      Cc: Jaroslav Kysela <perex@perex.cz>
      Cc: Kevin Hilman <khilman@deeprootsystems.com>
      Cc: Liam Girdwood <lgirdwood@gmail.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Mauro Carvalho Chehab <m.chehab@samsung.com>
      Cc: Rob Herring <rob.herring@calxeda.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Sekhar Nori <nsekhar@ti.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Vinod Koul <vinod.koul@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      684f0d3d
    • D
      backlight: lm3630: apply chip revision · 28e64a68
      Daniel Jeong 提交于
      The LM3630 chip was revised by TI and chip name was also changed to
      LM3630A.  And register map, default values and initial sequences are
      changed.  The files, lm3630_bl.{c,h} are replaced by lm3630a_bl.{c,h} You
      can find more information about LM3630A(datasheet, evm etc) at
      http://www.ti.com/product/lm3630aSigned-off-by: NDaniel Jeong <gshark.jeong@gmail.com>
      Signed-off-by: NJingoo Han <jg1.han@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28e64a68
    • M
      backlight: lp855x_bl: support new LP8555 device · 5812c13a
      Milo Kim 提交于
      LP8555 is one of the LP855x family devices.
      
      This device needs pre_init_device() and post_init_device() driver
      structure.  It's same as LP8557, so the device configuration code is
      shared with LP8557.  Backlight outputs are generated from dual DC-DC boost
      converters.  It's configurable EPROM settings which are defined in the
      platform data.
      
      Driver documentation and device tree bindings are updated.
      Signed-off-by: NMilo Kim <milo.kim@ti.com>
      Signed-off-by: NJingoo Han <jg1.han@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5812c13a
    • V
      sched: remove ARCH specific fpu_counter from task_struct · 27f69e68
      Vineet Gupta 提交于
      fpu_counter in task_struct was used only by sh/x86.  Both of these now
      carry it in ARCH specific thread_struct, hence this can now be removed
      from generic task_struct, shrinking it slightly for other arches.
      Signed-off-by: NVineet Gupta <vgupta@synopsys.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul Mundt <paul.mundt@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27f69e68
    • R
      jump_label: unlikely(x) > 0 · 261adc9a
      Roel Kluin 提交于
      if (unlikely(x) > 0) doesn't seem to help branch prediction
      Signed-off-by: NRoel Kluin <roel.kluin@gmail.com>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      261adc9a
    • A
      syscalls.h: use gcc alias instead of assembler aliases for syscalls · 83460ec8
      Andi Kleen 提交于
      Use standard gcc __attribute__((alias(foo))) to define the syscall aliases
      instead of custom assembler macros.
      
      This is far cleaner, and also fixes my LTO kernel build.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83460ec8
    • M
      mm: numa: return the number of base pages altered by protection changes · 72403b4a
      Mel Gorman 提交于
      Commit 0255d491 ("mm: Account for a THP NUMA hinting update as one
      PTE update") was added to account for the number of PTE updates when
      marking pages prot_numa.  task_numa_work was using the old return value
      to track how much address space had been updated.  Altering the return
      value causes the scanner to do more work than it is configured or
      documented to in a single unit of work.
      
      This patch reverts that commit and accounts for the number of THP
      updates separately in vmstat.  It is up to the administrator to
      interpret the pair of values correctly.  This is a straight-forward
      operation and likely to only be of interest when actively debugging NUMA
      balancing problems.
      
      The impact of this patch is that the NUMA PTE scanner will scan slower
      when THP is enabled and workloads may converge slower as a result.  On
      the flip size system CPU usage should be lower than recent tests
      reported.  This is an illustrative example of a short single JVM specjbb
      test
      
      specjbb
                             3.12.0                3.12.0
                            vanilla      acctupdates
      TPut 1      26143.00 (  0.00%)     25747.00 ( -1.51%)
      TPut 7     185257.00 (  0.00%)    183202.00 ( -1.11%)
      TPut 13    329760.00 (  0.00%)    346577.00 (  5.10%)
      TPut 19    442502.00 (  0.00%)    460146.00 (  3.99%)
      TPut 25    540634.00 (  0.00%)    549053.00 (  1.56%)
      TPut 31    512098.00 (  0.00%)    519611.00 (  1.47%)
      TPut 37    461276.00 (  0.00%)    474973.00 (  2.97%)
      TPut 43    403089.00 (  0.00%)    414172.00 (  2.75%)
      
                    3.12.0      3.12.0
                   vanillaacctupdates
      User         5169.64     5184.14
      System        100.45       80.02
      Elapsed       252.75      251.85
      
      Performance is similar but note the reduction in system CPU time.  While
      this showed a performance gain, it will not be universal but at least
      it'll be behaving as documented.  The vmstats are obviously different but
      here is an obvious interpretation of them from mmtests.
      
                                      3.12.0      3.12.0
                                     vanillaacctupdates
      NUMA page range updates        1408326    11043064
      NUMA huge PMD updates                0       21040
      NUMA PTE updates               1408326      291624
      
      "NUMA page range updates" == nr_pte_updates and is the value returned to
      the NUMA pte scanner.  NUMA huge PMD updates were the number of THP
      updates which in combination can be used to calculate how many ptes were
      updated from userspace.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NAlex Thorlton <athorlton@sgi.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72403b4a
    • J
      mm: factor commit limit calculation · 00619bcc
      Jerome Marchand 提交于
      The same calculation is currently done in three differents places.
      Factor that code so future changes has to be made at only one place.
      
      [akpm@linux-foundation.org: uninline vm_commit_limit()]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00619bcc
    • T
      mm/memblock.c: introduce bottom-up allocation mode · 79442ed1
      Tang Chen 提交于
      The Linux kernel cannot migrate pages used by the kernel.  As a result,
      kernel pages cannot be hot-removed.  So we cannot allocate hotpluggable
      memory for the kernel.
      
      ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
      info.  But before SRAT is parsed, memblock has already started to allocate
      memory for the kernel.  So we need to prevent memblock from doing this.
      
      In a memory hotplug system, any numa node the kernel resides in should be
      unhotpluggable.  And for a modern server, each node could have at least
      16GB memory.  So memory around the kernel image is highly likely
      unhotpluggable.
      
      So the basic idea is: Allocate memory from the end of the kernel image and
      to the higher memory.  Since memory allocation before SRAT is parsed won't
      be too much, it could highly likely be in the same node with kernel image.
      
      The current memblock can only allocate memory top-down.  So this patch
      introduces a new bottom-up allocation mode to allocate memory bottom-up.
      And later when we use this allocation direction to allocate memory, we
      will limit the start address above the kernel.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79442ed1
    • J
      writeback: do not sync data dirtied after sync start · c4a391b5
      Jan Kara 提交于
      When there are processes heavily creating small files while sync(2) is
      running, it can easily happen that quite some new files are created
      between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2).  That can happen
      especially if there are several busy filesystems (remember that sync
      traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
      fs before starting it on another fs).  Because WB_SYNC_ALL pass is slow
      (e.g.  causes a transaction commit and cache flush for each inode in
      ext3), resulting sync(2) times are rather large.
      
      The following script reproduces the problem:
      
        function run_writers
        {
          for (( i = 0; i < 10; i++ )); do
            mkdir $1/dir$i
            for (( j = 0; j < 40000; j++ )); do
              dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
            done &
          done
        }
      
        for dir in "$@"; do
          run_writers $dir
        done
      
        sleep 40
        time sync
      
      Fix the problem by disregarding inodes dirtied after sync(2) was called
      in the WB_SYNC_ALL pass.  To allow for this, sync_inodes_sb() now takes
      a time stamp when sync has started which is used for setting up work for
      flusher threads.
      
      To give some numbers, when above script is run on two ext4 filesystems
      on simple SATA drive, the average sync time from 10 runs is 267.549
      seconds with standard deviation 104.799426.  With the patched kernel,
      the average sync time from 10 runs is 2.995 seconds with standard
      deviation 0.096.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4a391b5
    • N
      tools/vm/page-types.c: support KPF_SOFTDIRTY bit · 46c77e2b
      Naoya Horiguchi 提交于
      Soft dirty bit allows us to track which pages are written since the last
      clear_ref (by "echo 4 > /proc/pid/clear_refs".) This is useful for
      userspace applications to know their memory footprints.
      
      Note that the kernel exposes this flag via bit[55] of /proc/pid/pagemap,
      and the semantics is not a default one (scheduled to be the default in the
      near future.) However, it shifts to the new semantics at the first
      clear_ref, and the users of soft dirty bit always do it before utilizing
      the bit, so that's not a big deal.  Users must avoid relying on the bit in
      page-types before the first clear_ref.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46c77e2b
    • Z
      mm/sparsemem: use PAGES_PER_SECTION to remove redundant nr_pages parameter · 85b35fea
      Zhang Yanfei 提交于
      For below functions,
      
      - sparse_add_one_section()
      - kmalloc_section_memmap()
      - __kmalloc_section_memmap()
      - __kfree_section_memmap()
      
      they are always invoked to operate on one memory section, so it is
      redundant to always pass a nr_pages parameter, which is the page numbers
      in one section.  So we can directly use predefined macro PAGES_PER_SECTION
      instead of passing the parameter.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85b35fea
    • D
      mm, mempolicy: make mpol_to_str robust and always succeed · 948927ee
      David Rientjes 提交于
      mpol_to_str() should not fail.  Currently, it either fails because the
      string buffer is too small or because a string hasn't been defined for a
      mempolicy mode.
      
      If a new mempolicy mode is introduced and no string is defined for it,
      just warn and return "unknown".
      
      If the buffer is too small, just truncate the string and return, the
      same behavior as snprintf().
      
      This also fixes a bug where there was no NULL-byte termination when doing
      *p++ = '=' and *p++ ':' and maxlen has been reached.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Chen Gang <gang.chen@asianux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      948927ee
    • T
      cpu/mem hotplug: add try_online_node() for cpu_up() · 01b0f197
      Toshi Kani 提交于
      cpu_up() has #ifdef CONFIG_MEMORY_HOTPLUG code blocks, which call
      mem_online_node() to put its node online if offlined and then call
      build_all_zonelists() to initialize the zone list.
      
      These steps are specific to memory hotplug, and should be managed in
      mm/memory_hotplug.c.  lock_memory_hotplug() should also be held for the
      whole steps.
      
      For this reason, this patch replaces mem_online_node() with
      try_online_node(), which performs the whole steps with
      lock_memory_hotplug() held.  try_online_node() is named after
      try_offline_node() as they have similar purpose.
      
      There is no functional change in this patch.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01b0f197
    • Q
      mm: add a helper function to check may oom condition · b9921ecd
      Qiang Huang 提交于
      Use helper function to check if we need to deal with oom condition.
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9921ecd
  3. 12 11月, 2013 4 次提交
    • D
      random32: upgrade taus88 generator to taus113 from errata paper · a98814ce
      Daniel Borkmann 提交于
      Since we use prandom*() functions quite often in networking code
      i.e. in UDP port selection, netfilter code, etc, upgrade the PRNG
      from Pierre L'Ecuyer's original paper "Maximally Equidistributed
      Combined Tausworthe Generators", Mathematics of Computation, 65,
      213 (1996), 203--213 to the version published in his errata paper [1].
      
      The Tausworthe generator is a maximally-equidistributed generator,
      that is fast and has good statistical properties [1].
      
      The version presented there upgrades the 3 state LFSR to a 4 state
      LFSR with increased periodicity from about 2^88 to 2^113. The
      algorithm is presented in [1] by the very same author who also
      designed the original algorithm in [2].
      
      Also, by increasing the state, we make it a bit harder for attackers
      to "guess" the PRNGs internal state. See also discussion in [3].
      
      Now, as we use this sort of weak initialization discussed in [3]
      only between core_initcall() until late_initcall() time [*] for
      prandom32*() users, namely in prandom_init(), it is less relevant
      from late_initcall() onwards as we overwrite seeds through
      prandom_reseed() anyways with a seed source of higher entropy, that
      is, get_random_bytes(). In other words, a exhaustive keysearch of
      96 bit would be needed. Now, with the help of this patch, this
      state-search increases further to 128 bit. Initialization needs
      to make sure that s1 > 1, s2 > 7, s3 > 15, s4 > 127.
      
      taus88 and taus113 algorithm is also part of GSL. I added a test
      case in the next patch to verify internal behaviour of this patch
      with GSL and ran tests with the dieharder 3.31.1 RNG test suite:
      
      $ dieharder -g 052 -a -m 10 -s 1 -S 4137730333 #taus88
      $ dieharder -g 054 -a -m 10 -s 1 -S 4137730333 #taus113
      
      With this seed configuration, in order to compare both, we get
      the following differences:
      
      algorithm                 taus88           taus113
      rands/second [**]         1.61e+08         1.37e+08
      sts_serial(4, 1st run)    WEAK             PASSED
      sts_serial(9, 2nd run)    WEAK             PASSED
      rgb_lagged_sum(31)        WEAK             PASSED
      
      We took out diehard_sums test as according to the authors it is
      considered broken and unusable [4]. Despite that and the slight
      decrease in performance (which is acceptable), taus113 here passes
      all 113 tests (only rgb_minimum_distance_5 in WEAK, the rest PASSED).
      In general, taus/taus113 is considered "very good" by the authors
      of dieharder [5].
      
      The papers [1][2] states a single warm-up step is sufficient by
      running quicktaus once on each state to ensure proper initialization
      of ~s_{0}:
      
      Our selection of (s) according to Table 1 of [1] row 1 holds the
      condition L - k <= r - s, that is,
      
        (32 32 32 32) - (31 29 28 25) <= (25 27 15 22) - (18 2 7 13)
      
      with r = k - q and q = (6 2 13 3) as also stated by the paper.
      So according to [2] we are safe with one round of quicktaus for
      initialization. However we decided to include the warm-up phase
      of the PRNG as done in GSL in every case as a safety net. We also
      use the warm up phase to make the output of the RNG easier to
      verify by the GSL output.
      
      In prandom_init(), we also mix random_get_entropy() into it, just
      like drivers/char/random.c does it, jiffies ^ random_get_entropy().
      random-get_entropy() is get_cycles(). xor is entropy preserving so
      it is fine if it is not implemented by some architectures.
      
      Note, this PRNG is *not* used for cryptography in the kernel, but
      rather as a fast PRNG for various randomizations i.e. in the
      networking code, or elsewhere for debugging purposes, for example.
      
      [*]: In order to generate some "sort of pseduo-randomness", since
      get_random_bytes() is not yet available for us, we use jiffies and
      initialize states s1 - s3 with a simple linear congruential generator
      (LCG), that is x <- x * 69069; and derive s2, s3, from the 32bit
      initialization from s1. So the above quote from [3] accounts only
      for the time from core to late initcall, not afterwards.
      [**] Single threaded run on MacBook Air w/ Intel Core i5-3317U
      
       [1] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme2.ps
       [2] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme.ps
       [3] http://thread.gmane.org/gmane.comp.encryption.general/12103/
       [4] http://code.google.com/p/dieharder/source/browse/trunk/libdieharder/diehard_sums.c?spec=svn490&r=490#20
       [5] http://www.phy.duke.edu/~rgb/General/dieharder.php
      
      Joint work with Hannes Frederic Sowa.
      
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a98814ce
    • D
      random32: move rnd_state to linux/random.h · 38e9efcd
      Daniel Borkmann 提交于
      struct rnd_state got mistakenly pulled into uapi header. It is not
      used anywhere and does also not belong there!
      
      Commit 5960164f ("lib/random32: export pseudo-random number
      generator for modules"), the last commit on rnd_state before it
      got moved to uapi, says:
      
        This patch moves the definition of struct rnd_state and the inline
        __seed() function to linux/random.h.  It renames the static __random32()
        function to prandom32() and exports it for use in modules.
      
      Hence, the structure was moved from lib/random32.c to linux/random.h
      so that it can be used within modules (FCoE-related code in this
      case), but not from user space. However, it seems to have been
      mistakenly moved to uapi header through the uapi script. Since no-one
      should make use of it from the linux headers, move the structure back
      to the kernel for internal use, so that it can be modified on demand.
      
      Joint work with Hannes Frederic Sowa.
      
      Cc: Joe Eykholt <jeykholt@cisco.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38e9efcd
    • H
      random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized · 4af712e8
      Hannes Frederic Sowa 提交于
      The Tausworthe PRNG is initialized at late_initcall time. At that time the
      entropy pool serving get_random_bytes is not filled sufficiently. This
      patch adds an additional reseeding step as soon as the nonblocking pool
      gets marked as initialized.
      
      On some machines it might be possible that late_initcall gets called after
      the pool has been initialized. In this situation we won't reseed again.
      
      (A call to prandom_seed_late blocks later invocations of early reseed
      attempts.)
      
      Joint work with Daniel Borkmann.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4af712e8
    • D
      random32: fix off-by-one in seeding requirement · 51c37a70
      Daniel Borkmann 提交于
      For properly initialising the Tausworthe generator [1], we have
      a strict seeding requirement, that is, s1 > 1, s2 > 7, s3 > 15.
      
      Commit 697f8d03 ("random32: seeding improvement") introduced
      a __seed() function that imposes boundary checks proposed by the
      errata paper [2] to properly ensure above conditions.
      
      However, we're off by one, as the function is implemented as:
      "return (x < m) ? x + m : x;", and called with __seed(X, 1),
      __seed(X, 7), __seed(X, 15). Thus, an unwanted seed of 1, 7, 15
      would be possible, whereas the lower boundary should actually
      be of at least 2, 8, 16, just as GSL does. Fix this, as otherwise
      an initialization with an unwanted seed could have the effect
      that Tausworthe's PRNG properties cannot not be ensured.
      
      Note that this PRNG is *not* used for cryptography in the kernel.
      
       [1] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme.ps
       [2] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme2.ps
      
      Joint work with Hannes Frederic Sowa.
      
      Fixes: 697f8d03 ("random32: seeding improvement")
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51c37a70
  4. 11 11月, 2013 3 次提交
  5. 09 11月, 2013 9 次提交