1. 13 12月, 2012 5 次提交
  2. 12 12月, 2012 25 次提交
    • Z
      Thermal: Fix DEFAULT_THERMAL_GOVERNOR · 1f53ef17
      Zhang Rui 提交于
      Fix DEFAULT_THERMAL_GOVERNOR to be consistant with the
      default governor selected in kernel config file.
      Signed-off-by: NZhang Rui <rui.zhang@intel.com>
      1f53ef17
    • E
      pkt_sched: avoid requeues if possible · 1abbe139
      Eric Dumazet 提交于
      With BQL being deployed, we can more likely have following behavior :
      
      We dequeue a packet from qdisc in dequeue_skb(), then we realize target
      tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the
      skb into gso_skb for later.
      
      This shows in stats (tc -s qdisc dev eth0) as requeues.
      
      Problem of these requeues is that high priority packets can not be
      dequeued as long as this (possibly low prio and big TSO packet) is not
      removed from gso_skb.
      
      At 1Gbps speed, a full size TSO packet is 500 us of extra latency.
      
      In some cases, we know that all packets dequeued from a qdisc are
      for a particular and known txq :
      
      - If device is non multi queue
      - For all MQ/MQPRIO slave qdiscs
      
      This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark
      this capability, so that dequeue_skb() is allowed to dequeue a packet
      only if the associated txq is not stopped.
      
      This indeed reduce latencies for high prio packets (or improve fairness
      with sfq/fq_codel), and almost remove qdisc 'requeues'.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1abbe139
    • L
      mm, memory-hotplug: dynamic configure movable memory and portion memory · 511c2aba
      Lai Jiangshan 提交于
      Add online_movable and online_kernel for logic memory hotplug.  This is
      the dynamic version of "movablecore" & "kernelcore".
      
      We have the same reason to introduce it as to introduce "movablecore" &
      "kernelcore".  It has the same motive as "movablecore" & "kernelcore", but
      it is dynamic/running-time:
      
      o We can configure memory as kernelcore or movablecore after boot.
      
        Userspace workload is increased, we need more hugepage, we can't use
        "online_movable" to add memory and allow the system use more
        THP(transparent-huge-page), vice-verse when kernel workload is increase.
      
        Also help for virtualization to dynamic configure host/guest's memory,
        to save/(reduce waste) memory.
      
        Memory capacity on Demand
      
      o When a new node is physically online after boot, we need to use
        "online_movable" or "online_kernel" to configure/portion it as we
        expected when we logic-online it.
      
        This configuration also helps for physically-memory-migrate.
      
      o all benefit as the same as existed "movablecore" & "kernelcore".
      
      o Preparing for movable-node, which is very important for power-saving,
        hardware partitioning and high-available-system(hardware fault
        management).
      
      (Note, we don't introduce movable-node here.)
      
      Action behavior:
      When a memoryblock/memorysection is onlined by "online_movable", the kernel
      will not have directly reference to the page of the memoryblock,
      thus we can remove that memory any time when needed.
      
      When it is online by "online_kernel", the kernel can use it.
      When it is online by "online", the zone type doesn't changed.
      
      Current constraints:
      Only the memoryblock which is adjacent to the ZONE_MOVABLE
      can be online from ZONE_NORMAL to ZONE_MOVABLE.
      
      [akpm@linux-foundation.org: use min_t, cleanups]
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      511c2aba
    • J
      bootmem: fix wrong call parameter for free_bootmem() · 81df9bff
      Joonsoo Kim 提交于
      It is strange that alloc_bootmem() returns a virtual address and
      free_bootmem() requires a physical address.  Anyway, free_bootmem()'s
      first parameter should be physical address.
      
      There are some call sites for free_bootmem() with virtual address.  So fix
      them.
      
      [akpm@linux-foundation.org: improve free_bootmem() and free_bootmem_pate() documentation]
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81df9bff
    • M
      mm: cma: remove watermark hacks · bc357f43
      Marek Szyprowski 提交于
      Commits 2139cbe6 ("cma: fix counting of isolated pages") and
      d95ea5d1 ("cma: fix watermark checking") introduced a reliable
      method of free page accounting when memory is being allocated from CMA
      regions, so the workaround introduced earlier by commit 49f223a9
      ("mm: trigger page reclaim in alloc_contig_range() to stabilise
      watermarks") can be finally removed.
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc357f43
    • D
      mm, oom: fix race when specifying a thread as the oom origin · e1e12d2f
      David Rientjes 提交于
      test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to
      specify that current should be killed first if an oom condition occurs in
      between the two calls.
      
      The usage is
      
      	short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
      	...
      	compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);
      
      to store the thread's oom_score_adj, temporarily change it to the maximum
      score possible, and then restore the old value if it is still the same.
      
      This happens to still be racy, however, if the user writes
      OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls.
      The compare_swap_oom_score_adj() will then incorrectly reset the old value
      prior to the write of OOM_SCORE_ADJ_MAX.
      
      To fix this, introduce a new oom_flags_t member in struct signal_struct
      that will be used for per-thread oom killer flags.  KSM and swapoff can
      now use a bit in this member to specify that threads should be killed
      first in oom conditions without playing around with oom_score_adj.
      
      This also allows the correct oom_score_adj to always be shown when reading
      /proc/pid/oom_score.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1e12d2f
    • D
      mm, oom: change type of oom_score_adj to short · a9c58b90
      David Rientjes 提交于
      The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
      so this range can be represented by the signed short type with no
      functional change.  The extra space this frees up in struct signal_struct
      will be used for per-thread oom kill flags in the next patch.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9c58b90
    • Y
      mm: cleanup register_node() · fa264375
      Yasuaki Ishimatsu 提交于
      register_node() is defined as extern in include/linux/node.h.  But the
      function is only called from register_one_node() in driver/base/node.c.
      
      So the patch defines register_node() as static.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa264375
    • R
      mm: introduce putback_movable_pages() · 5733c7d1
      Rafael Aquini 提交于
      The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
      hacks around putback_lru_pages() in order to allow ballooned pages to be
      re-inserted on balloon page list as if a ballooned page was like a LRU page.
      
      As ballooned pages are not legitimate LRU pages, this patch introduces
      putback_movable_pages() to properly cope with cases where the isolated
      pageset contains ballooned pages and LRU pages, thus fixing the mentioned
      inelegant hack around putback_lru_pages().
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5733c7d1
    • R
      mm: introduce a common interface for balloon pages mobility · 18468d93
      Rafael Aquini 提交于
      Memory fragmentation introduced by ballooning might reduce significantly
      the number of 2MB contiguous memory blocks that can be used within a guest,
      thus imposing performance penalties associated with the reduced number of
      transparent huge pages that could be used by the guest workload.
      
      This patch introduces a common interface to help a balloon driver on
      making its page set movable to compaction, and thus allowing the system
      to better leverage the compation efforts on memory defragmentation.
      
      [akpm@linux-foundation.org: use PAGE_FLAGS_CHECK_AT_PREP, s/__balloon_page_flags/page_flags_cleared/, small cleanups]
      [rientjes@google.com: allow balloon compaction for any system with memory compaction enabled, which is the defconfig]
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18468d93
    • R
      mm: redefine address_space.assoc_mapping · 252aa6f5
      Rafael Aquini 提交于
      Overhaul struct address_space.assoc_mapping renaming it to
      address_space.private_data and its type is redefined to void*.  By this
      approach we consistently name the .private_* elements from struct
      address_space as well as allow extended usage for address_space
      association with other data structures through ->private_data.
      
      Also, all users of old ->assoc_mapping element are converted to reflect
      its new name and type change (->private_data).
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      252aa6f5
    • R
      mm: adjust address_space_operations.migratepage() return code · 78bd5209
      Rafael Aquini 提交于
      Memory fragmentation introduced by ballooning might reduce significantly
      the number of 2MB contiguous memory blocks that can be used within a
      guest, thus imposing performance penalties associated with the reduced
      number of transparent huge pages that could be used by the guest workload.
      
      This patch-set follows the main idea discussed at 2012 LSFMMS session:
      "Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
      to introduce the required changes to the virtio_balloon driver, as well as
      the changes to the core compaction & migration bits, in order to make
      those subsystems aware of ballooned pages and allow memory balloon pages
      become movable within a guest, thus avoiding the aforementioned
      fragmentation issue
      
      Following are numbers that prove this patch benefits on allowing
      compaction to be more effective at memory ballooned guests.
      
      Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
      running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB
      chunks, at every minute (inflating/deflating), while test was running:
      
      ===BEGIN stress-highalloc
      
      STRESS-HIGHALLOC
                       highalloc-3.7     highalloc-3.7
                           rc4-clean         rc4-patch
      Pass 1          55.00 ( 0.00%)    62.00 ( 7.00%)
      Pass 2          54.00 ( 0.00%)    62.00 ( 8.00%)
      while Rested    75.00 ( 0.00%)    80.00 ( 5.00%)
      
      MMTests Statistics: duration
                       3.7         3.7
                 rc4-clean   rc4-patch
      User         1207.59     1207.46
      System       1300.55     1299.61
      Elapsed      2273.72     2157.06
      
      MMTests Statistics: vmstat
                                      3.7         3.7
                                rc4-clean   rc4-patch
      Page Ins                    3581516     2374368
      Page Outs                  11148692    10410332
      Swap Ins                         80          47
      Swap Outs                      3641         476
      Direct pages scanned          37978       33826
      Kswapd pages scanned        1828245     1342869
      Kswapd pages reclaimed      1710236     1304099
      Direct pages reclaimed        32207       31005
      Kswapd efficiency               93%         97%
      Kswapd velocity             804.077     622.546
      Direct efficiency               84%         91%
      Direct velocity              16.703      15.682
      Percentage direct scans          2%          2%
      Page writes by reclaim        79252        9704
      Page writes file              75611        9228
      Page writes anon               3641         476
      Page reclaim immediate        16764       11014
      Page rescued immediate            0           0
      Slabs scanned               2171904     2152448
      Direct inode steals             385        2261
      Kswapd inode steals          659137      609670
      Kswapd skipped wait               1          69
      THP fault alloc                 546         631
      THP collapse alloc              361         339
      THP splits                      259         263
      THP fault fallback               98          50
      THP collapse fail                20          17
      Compaction stalls               747         499
      Compaction success              244         145
      Compaction failures             503         354
      Compaction pages moved       370888      474837
      Compaction move failure       77378       65259
      
      ===END stress-highalloc
      
      This patch:
      
      Introduce MIGRATEPAGE_SUCCESS as the default return code for
      address_space_operations.migratepage() method and documents the expected
      return code for the same method in failure cases.
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78bd5209
    • M
      mm: vm_unmapped_area() lookup function · db4fbfb9
      Michel Lespinasse 提交于
      Implement vm_unmapped_area() using the rb_subtree_gap and highest_vm_end
      information to look up for suitable virtual address space gaps.
      
      struct vm_unmapped_area_info is used to define the desired allocation
      request:
       - lowest or highest possible address matching the remaining constraints
       - desired gap length
       - low/high address limits that the gap must fit into
       - alignment mask and offset
      
      Also update the generic arch_get_unmapped_area[_topdown] functions to make
      use of vm_unmapped_area() instead of implementing a brute force search.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db4fbfb9
    • R
      mm: rearrange vm_area_struct for fewer cache misses · e4c6bfd2
      Rik van Riel 提交于
      The kernel walks the VMA rbtree in various places, including the page
      fault path.  However, the vm_rb node spanned two cache lines, on 64 bit
      systems with 64 byte cache lines (most x86 systems).
      
      Rearrange vm_area_struct a little, so all the information we need to do a
      VMA tree walk is in the first cache line.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4c6bfd2
    • M
      mm: augment vma rbtree with rb_subtree_gap · d3737187
      Michel Lespinasse 提交于
      Define vma->rb_subtree_gap as the largest gap between any vma in the
      subtree rooted at that vma, and their predecessor.  Or, for a recursive
      definition, vma->rb_subtree_gap is the max of:
      
       - vma->vm_start - vma->vm_prev->vm_end
       - rb_subtree_gap fields of the vmas pointed by vma->rb.rb_left and
         vma->rb.rb_right
      
      This will allow get_unmapped_area_* to find a free area of the right
      size in O(log(N)) time, instead of potentially having to do a linear
      walk across all the VMAs.
      
      Also define mm->highest_vm_end as the vm_end field of the highest vma,
      so that we can easily check if the following gap is suitable.
      
      This does have the potential to make unmapping VMAs more expensive,
      especially for processes with very large numbers of VMAs, where the VMA
      rbtree can grow quite deep.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3737187
    • A
      mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB · 42d7395f
      Andi Kleen 提交于
      There was some desire in large applications using MAP_HUGETLB or
      SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
      others.  This is useful together with NUMA policy: use 2MB interleaving
      on some mappings, but 1GB on local mappings.
      
      This patch extends the IPC/SHM syscall interfaces slightly to allow
      specifying the page size.
      
      It borrows some upper bits in the existing flag arguments and allows
      encoding the log of the desired page size in addition to the *_HUGETLB
      flag.  When 0 is specified the default size is used, this makes the
      change fully compatible.
      
      Extending the internal hugetlb code to handle this is straight forward.
      Instead of a single mount it just keeps an array of them and selects the
      right mount based on the specified page size.  When no page size is
      specified it uses the mount of the default page size.
      
      The change is not visible in /proc/mounts because internal mounts don't
      appear there.  It also has very little overhead: the additional mounts
      just consume a super block, but not more memory when not used.
      
      I also exported the new flags to the user headers (they were previously
      under __KERNEL__).  Right now only symbols for x86 and some other
      architecture for 1GB and 2MB are defined.  The interface should already
      work for all other architectures though.  Only architectures that define
      multiple hugetlb sizes actually need it (that is currently x86, tile,
      powerpc).  However tile and powerpc have user configurable hugetlb
      sizes, so it's not easy to add defines.  A program on those
      architectures would need to query sysfs and use the appropiate log2.
      
      [akpm@linux-foundation.org: cleanups]
      [rientjes@google.com: fix build]
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42d7395f
    • W
      mm: thp: set the accessed flag for old pages on access fault · a1dd450b
      Will Deacon 提交于
      On x86 memory accesses to pages without the ACCESSED flag set result in
      the ACCESSED flag being set automatically.  With the ARM architecture a
      page access fault is raised instead (and it will continue to be raised
      until the ACCESSED flag is set for the appropriate PTE/PMD).
      
      For normal memory pages, handle_pte_fault will call pte_mkyoung
      (effectively setting the ACCESSED flag).  For transparent huge pages,
      pmd_mkyoung will only be called for a write fault.
      
      This patch ensures that faults on transparent hugepages which do not
      result in a CoW update the access flags for the faulting pmd.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Ni zhan Chen <nizhan.chen@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1dd450b
    • L
      memory_hotplug: fix possible incorrect node_states[N_NORMAL_MEMORY] · d9713679
      Lai Jiangshan 提交于
      Currently memory_hotplug only manages the node_states[N_HIGH_MEMORY], it
      forgets to manage node_states[N_NORMAL_MEMORY].  This may cause
      node_states[N_NORMAL_MEMORY] to become incorrect.
      
      Example, if a node is empty before online, and we online a memory which is
      in ZONE_NORMAL.  And after online, node_states[N_HIGH_MEMORY] is correct,
      but node_states[N_NORMAL_MEMORY] is incorrect, the online code doesn't set
      the new online node to node_states[N_NORMAL_MEMORY].
      
      The same thing will happen when offlining (the offline code doesn't clear
      the node from node_states[N_NORMAL_MEMORY] when needed).  Some memory
      managment code depends node_states[N_NORMAL_MEMORY], so we have to fix up
      the node_states[N_NORMAL_MEMORY].
      
      We add node_states_check_changes_online() and
      node_states_check_changes_offline() to detect whether
      node_states[N_HIGH_MEMORY] and node_states[N_NORMAL_MEMORY] are changed
      while hotpluging.
      
      Also add @status_change_nid_normal to struct memory_notify, thus the
      memory hotplug callbacks know whether the node_states[N_NORMAL_MEMORY] are
      changed.  (We can add a @flags and reuse @status_change_nid instead of
      introducing @status_change_nid_normal, but it will add much more
      complexity in memory hotplug callback in every subsystem.  So introducing
      @status_change_nid_normal is better and it doesn't change the sematics of
      @status_change_nid)
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Rob Landley <rob@landley.net>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9713679
    • W
      numa: convert static memory to dynamically allocated memory for per node device · 8732794b
      Wen Congyang 提交于
      We use a static array to store struct node.  In many cases, we don't have
      too many nodes, and some memory will be unused.  Convert it to per-device
      dynamically allocated memory.
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8732794b
    • W
      memory-hotplug: skip HWPoisoned page when offlining pages · b023f468
      Wen Congyang 提交于
      hwpoisoned may be set when we offline a page by the sysfs interface
      /sys/devices/system/memory/soft_offline_page or
      /sys/devices/system/memory/hard_offline_page. If we don't clear
      this flag when onlining pages, this page can't be freed, and will
      not in free list. So we can't offline these pages again. So we
      should skip such page when offlining pages.
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b023f468
    • K
      mm: use IS_ENABLED(CONFIG_COMPACTION) instead of COMPACTION_BUILD · d84da3f9
      Kirill A. Shutemov 提交于
      We don't need custom COMPACTION_BUILD anymore, since we have handy
      IS_ENABLED().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d84da3f9
    • K
      mm: use IS_ENABLED(CONFIG_NUMA) instead of NUMA_BUILD · e5adfffc
      Kirill A. Shutemov 提交于
      We don't need custom NUMA_BUILD anymore, since we have handy
      IS_ENABLED().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5adfffc
    • D
      mm, memcg: make mem_cgroup_out_of_memory() static · 19965460
      David Rientjes 提交于
      mem_cgroup_out_of_memory() is only referenced from within file scope, so
      it can be marked static.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19965460
    • N
      writeback: remove nr_pages_dirtied arg from balance_dirty_pages_ratelimited_nr() · d0e1d66b
      Namjae Jeon 提交于
      There is no reason to pass the nr_pages_dirtied argument, because
      nr_pages_dirtied value from the caller is unused in
      balance_dirty_pages_ratelimited_nr().
      Signed-off-by: NNamjae Jeon <linkinjeon@gmail.com>
      Signed-off-by: NVivek Trivedi <vtrivedi018@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0e1d66b
    • E
      net: fix a race in gro_cell_poll() · f8e8f97c
      Eric Dumazet 提交于
      Dmitry Kravkov reported packet drops for GRE packets since GRO support
      was added.
      
      There is a race in gro_cell_poll() because we call napi_complete()
      without any synchronization with a concurrent gro_cells_receive()
      
      Once bug was triggered, we queued packets but did not schedule NAPI
      poll.
      
      We can fix this issue using the spinlock protected the napi_skbs queue,
      as we have to hold it to perform skb dequeue anyway.
      
      As we open-code skb_dequeue(), we no longer need to mask IRQS, as both
      producer and consumer run under BH context.
      
      Bug added in commit c9e6bc64 (net: add gro_cells infrastructure)
      Reported-by: NDmitry Kravkov <dmitry@broadcom.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NDmitry Kravkov <dmitry@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8e8f97c
  3. 11 12月, 2012 4 次提交
    • V
      drivers: cma: represent physical addresses as phys_addr_t · 4009793e
      Vitaly Andrianov 提交于
      This commit changes the CMA early initialization code to use phys_addr_t
      for representing physical addresses instead of unsigned long.
      
      Without this change, among other things, dma_declare_contiguous() simply
      discards any memory regions whose address is not representable as unsigned
      long.
      
      This is a problem on 32-bit PAE machines where unsigned long is 32-bit
      but physical address space is larger.
      Signed-off-by: NVitaly Andrianov <vitalya@ti.com>
      Signed-off-by: NCyril Chemparathy <cyril@ti.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      4009793e
    • M
      clk: introduce optional disable_unused callback · 7c045a55
      Mike Turquette 提交于
      Some gate clocks have special needs which must be handled during the
      disable-unused clocks sequence.  These needs might be driven by software
      due to the fact that we're disabling a clock outside of the normal
      clk_disable path and a clk's enable_count will not be accurate.  On the
      other hand a specific hardware programming sequence might need to be
      followed for this corner case.
      
      This change is needed for the upcoming OMAP port to the common clock
      framework.  Specifically, it is undesirable to treat the disable-unused
      path identically to the normal clk_disable path since other software
      layers are involved.  In this case OMAP's clockdomain code throws WARNs
      and bails early due to the clock's enable_count being set to zero.  A
      custom callback mitigates this problem nicely.
      
      Cc: Paul Walmsley <paul@pwsan.com>
      Acked-by: NUlf Hansson <ulf.hansson@linaro.org>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NMike Turquette <mturquette@linaro.org>
      7c045a55
    • G
      ath9k: allow to load EEPROM content via firmware API · ab5c4f71
      Gabor Juhos 提交于
      The calibration data for devices w/o a separate
      EEPROM chip can be specified via the 'eeprom_data'
      field of 'ath9k_platform_data'. The 'eeprom_data'
      is usually filled from board specific setup
      functions. It is easy if the EEPROM data is mapped
      to the memory, but it can be complicated if it is
      stored elsewhere.
      
      The patch adds support for loading of the EEPROM
      data via the firmware API to avoid this limitation.
      Signed-off-by: NGabor Juhos <juhosg@openwrt.org>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      ab5c4f71
    • L
      Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damage · caf49191
      Linus Torvalds 提交于
      This reverts commits a5091539 and
      d7c3b937.
      
      This is a revert of a revert of a revert.  In addition, it reverts the
      even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the
      original commits in linux-next.
      
      It turns out that the original patch really was bogus, and that the
      original revert was the correct thing to do after all.  We thought we
      had fixed the problem, and then reverted the revert, but the problem
      really is fundamental: waking up kswapd simply isn't the right thing to
      do, and direct reclaim sometimes simply _is_ the right thing to do.
      
      When certain allocations fail, we simply should try some direct reclaim,
      and if that fails, fail the allocation.  That's the right thing to do
      for THP allocations, which can easily fail, and the GPU allocations want
      to do that too.
      
      So starting kswapd is sometimes simply wrong, and removing the flag that
      said "don't start kswapd" was a mistake.  Let's hope we never revisit
      this mistake again - and certainly not this many times ;)
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      caf49191
  4. 10 12月, 2012 1 次提交
  5. 09 12月, 2012 2 次提交
    • J
      virtio_net: multiqueue support · 986a4f4d
      Jason Wang 提交于
      This patch adds the multiqueue (VIRTIO_NET_F_MQ) support to virtio_net
      driver. VIRTIO_NET_F_MQ capable device could allow the driver to do packet
      transmission and reception through multiple queue pairs and does the packet
      steering to get better performance. By default, one one queue pair is used, user
      could change the number of queue pairs by ethtool in the next patch.
      
      When multiple queue pairs is used and the number of queue pairs is equal to the
      number of vcpus. Driver does the following optimizations to implement per-cpu
      virt queue pairs:
      
      - select the txq based on the smp processor id.
      - smp affinity hint to the cpu that owns the queue pairs.
      
      This could be used with the flow steering support of the device to guarantee the
      packets of a single flow is handled by the same cpu.
      Signed-off-by: NKrishna Kumar <krkumar2@in.ibm.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      986a4f4d
    • J
      net: Add support for hardware-offloaded encapsulation · 6a674e9c
      Joseph Gasparakis 提交于
      This patch adds support in the kernel for offloading in the NIC Tx and Rx
      checksumming for encapsulated packets (such as VXLAN and IP GRE).
      
      For Tx encapsulation offload, the driver will need to set the right bits
      in netdev->hw_enc_features. The protocol driver will have to set the
      skb->encapsulation bit and populate the inner headers, so the NIC driver will
      use those inner headers to calculate the csum in hardware.
      
      For Rx encapsulation offload, the driver will need to set again the
      skb->encapsulation flag and the skb->ip_csum to CHECKSUM_UNNECESSARY.
      In that case the protocol driver should push the decapsulated packet up
      to the stack, again with CHECKSUM_UNNECESSARY. In ether case, the protocol
      driver should set the skb->encapsulation flag back to zero. Finally the
      protocol driver should have NETIF_F_RXCSUM flag set in its features.
      Signed-off-by: NJoseph Gasparakis <joseph.gasparakis@intel.com>
      Signed-off-by: NPeter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a674e9c
  6. 08 12月, 2012 3 次提交
    • E
      net: gro: fix possible panic in skb_gro_receive() · c3c7c254
      Eric Dumazet 提交于
      commit 2e71a6f8 (net: gro: selective flush of packets) added
      a bug for skbs using frag_list. This part of the GRO stack is rarely
      used, as it needs skb not using a page fragment for their skb->head.
      
      Most drivers do use a page fragment, but some of them use GFP_KERNEL
      allocations for the initial fill of their RX ring buffer.
      
      napi_gro_flush() overwrite skb->prev that was used for these skb to
      point to the last skb in frag_list.
      
      Fix this using a separate field in struct napi_gro_cb to point to the
      last fragment.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3c7c254
    • Y
      tcp: bug fix Fast Open client retransmission · 93b174ad
      Yuchung Cheng 提交于
      If SYN-ACK partially acks SYN-data, the client retransmits the
      remaining data by tcp_retransmit_skb(). This increments lost recovery
      state variables like tp->retrans_out in Open state. If loss recovery
      happens before the retransmission is acked, it triggers the WARN_ON
      check in tcp_fastretrans_alert(). For example: the client sends
      SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends
      another 4 data packets and get 3 dupacks.
      
      Since the retransmission is not caused by network drop it should not
      update the recovery state variables. Further the server may return a
      smaller MSS than the cached MSS used for SYN-data, so the retranmission
      needs a loop. Otherwise some data will not be retransmitted until timeout
      or other loss recovery events.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93b174ad
    • C
      bridge: export multicast database via netlink · ee07c6e7
      Cong Wang 提交于
      V5: fix two bugs pointed out by Thomas
          remove seq check for now, mark it as TODO
      
      V4: remove some useless #include
          some coding style fix
      
      V3: drop debugging printk's
          update selinux perm table as well
      
      V2: drop patch 1/2, export ifindex directly
          Redesign netlink attributes
          Improve netlink seq check
          Handle IPv6 addr as well
      
      This patch exports bridge multicast database via netlink
      message type RTM_GETMDB. Similar to fdb, but currently bridge-specific.
      We may need to support modify multicast database too (RTM_{ADD,DEL}MDB).
      
      (Thanks to Thomas for patient reviews)
      
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee07c6e7