1. 09 10月, 2013 6 次提交
    • M
      sched/numa: Set the scan rate proportional to the memory usage of the task being scanned · 598f0ec0
      Mel Gorman 提交于
      The NUMA PTE scan rate is controlled with a combination of the
      numa_balancing_scan_period_min, numa_balancing_scan_period_max and
      numa_balancing_scan_size. This scan rate is independent of the size
      of the task and as an aside it is further complicated by the fact that
      numa_balancing_scan_size controls how many pages are marked pte_numa and
      not how much virtual memory is scanned.
      
      In combination, it is almost impossible to meaningfully tune the min and
      max scan periods and reasoning about performance is complex when the time
      to complete a full scan is is partially a function of the tasks memory
      size. This patch alters the semantic of the min and max tunables to be
      about tuning the length time it takes to complete a scan of a tasks occupied
      virtual address space. Conceptually this is a lot easier to understand. There
      is a "sanity" check to ensure the scan rate is never extremely fast based on
      the amount of virtual memory that should be scanned in a second. The default
      of 2.5G seems arbitrary but it is to have the maximum scan rate after the
      patch roughly match the maximum scan rate before the patch was applied.
      
      On a similar note, numa_scan_period is in milliseconds and not
      jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
      to numa_scan_period means that the rate scanning slows depends on HZ which
      is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      598f0ec0
    • M
      sched/numa: Initialise numa_next_scan properly · 7e8d16b6
      Mel Gorman 提交于
      Scan delay logic and resets are currently initialised to start scanning
      immediately instead of delaying properly. Initialise them properly at
      fork time and catch when a new mm has been allocated.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-17-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7e8d16b6
    • M
      Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node" · b726b7df
      Mel Gorman 提交于
      PTE scanning and NUMA hinting fault handling is expensive so commit
      5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
      on a new node") deferred the PTE scan until a task had been scheduled on
      another node. The problem is that in the purely shared memory case that
      this may never happen and no NUMA hinting fault information will be
      captured. We are not ruling out the possibility that something better
      can be done here but for now, this patch needs to be reverted and depend
      entirely on the scan_delay to avoid punishing short-lived processes.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b726b7df
    • P
      sched/numa: Continue PTE scanning even if migrate rate limited · 9e645ab6
      Peter Zijlstra 提交于
      Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
      limited sees like a bad idea. Even if this node can't migrate anymore other
      nodes might and we want up-to-date information to do balance decisions.
      We already rate limit the actual migrations, this should leave enough
      bandwidth to allow the non-migrating scanning. I think its important we
      keep up-to-date information if we're going to do placement based on it.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-15-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9e645ab6
    • P
      sched/numa: Mitigate chance that same task always updates PTEs · 19a78d11
      Peter Zijlstra 提交于
      With a trace_printk("working\n"); right after the cmpxchg in
      task_numa_work() we can see that of a 4 thread process, its always the
      same task winning the race and doing the protection change.
      
      This is a problem since the task doing the protection change has a
      penalty for taking faults -- it is busy when marking the PTEs. If its
      always the same task the ->numa_faults[] get severely skewed.
      
      Avoid this by delaying the task doing the protection change such that
      it is unlikely to win the privilege again.
      
      Before:
      
      root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
            thread 0/0-3232  [022] ....   212.787402: task_numa_work: working
            thread 0/0-3232  [022] ....   212.888473: task_numa_work: working
            thread 0/0-3232  [022] ....   212.989538: task_numa_work: working
            thread 0/0-3232  [022] ....   213.090602: task_numa_work: working
            thread 0/0-3232  [022] ....   213.191667: task_numa_work: working
            thread 0/0-3232  [022] ....   213.292734: task_numa_work: working
            thread 0/0-3232  [022] ....   213.393804: task_numa_work: working
            thread 0/0-3232  [022] ....   213.494869: task_numa_work: working
            thread 0/0-3232  [022] ....   213.596937: task_numa_work: working
            thread 0/0-3232  [022] ....   213.699000: task_numa_work: working
            thread 0/0-3232  [022] ....   213.801067: task_numa_work: working
            thread 0/0-3232  [022] ....   213.903155: task_numa_work: working
            thread 0/0-3232  [022] ....   214.005201: task_numa_work: working
            thread 0/0-3232  [022] ....   214.107266: task_numa_work: working
            thread 0/0-3232  [022] ....   214.209342: task_numa_work: working
      
      After:
      
      root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
            thread 0/0-3253  [005] ....   136.865051: task_numa_work: working
            thread 0/2-3255  [026] ....   136.965134: task_numa_work: working
            thread 0/3-3256  [024] ....   137.065217: task_numa_work: working
            thread 0/3-3256  [024] ....   137.165302: task_numa_work: working
            thread 0/3-3256  [024] ....   137.265382: task_numa_work: working
            thread 0/0-3253  [004] ....   137.366465: task_numa_work: working
            thread 0/2-3255  [026] ....   137.466549: task_numa_work: working
            thread 0/0-3253  [004] ....   137.566629: task_numa_work: working
            thread 0/0-3253  [004] ....   137.666711: task_numa_work: working
            thread 0/1-3254  [028] ....   137.766799: task_numa_work: working
            thread 0/0-3253  [004] ....   137.866876: task_numa_work: working
            thread 0/2-3255  [026] ....   137.966960: task_numa_work: working
            thread 0/1-3254  [028] ....   138.067041: task_numa_work: working
            thread 0/2-3255  [026] ....   138.167123: task_numa_work: working
            thread 0/3-3256  [024] ....   138.267207: task_numa_work: working
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-14-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      19a78d11
    • P
      sched/numa: Fix comments · c69307d5
      Peter Zijlstra 提交于
      Fix a 80 column violation and a PTE vs PMD reference.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-4-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c69307d5
  2. 06 10月, 2013 1 次提交
  3. 01 10月, 2013 4 次提交
    • F
      irq: Force hardirq exit's softirq processing on its own stack · ded79754
      Frederic Weisbecker 提交于
      The commit facd8b80
      ("irq: Sanitize invoke_softirq") converted irq exit
      calls of do_softirq() to __do_softirq() on all architectures,
      assuming it was only used there for its irq disablement
      properties.
      
      But as a side effect, the softirqs processed in the end
      of the hardirq are always called on the inline current
      stack that is used by irq_exit() instead of the softirq
      stack provided by the archs that override do_softirq().
      
      The result is mostly safe if the architecture runs irq_exit()
      on a separate irq stack because then softirqs are processed
      on that same stack that is near empty at this stage (assuming
      hardirq aren't nesting).
      
      Otherwise irq_exit() runs in the task stack and so does the softirq
      too. The interrupted call stack can be randomly deep already and
      the softirq can dig through it even further. To add insult to the
      injury, this softirq can be interrupted by a new hardirq, maximizing
      the chances for a stack overrun as reported in powerpc for example:
      
      	do_IRQ: stack overflow: 1920
      	CPU: 0 PID: 1602 Comm: qemu-system-ppc Not tainted 3.10.4-300.1.fc19.ppc64p7 #1
      	Call Trace:
      	[c0000000050a8740] .show_stack+0x130/0x200 (unreliable)
      	[c0000000050a8810] .dump_stack+0x28/0x3c
      	[c0000000050a8880] .do_IRQ+0x2b8/0x2c0
      	[c0000000050a8930] hardware_interrupt_common+0x154/0x180
      	--- Exception: 501 at .cp_start_xmit+0x3a4/0x820 [8139cp]
      		LR = .cp_start_xmit+0x390/0x820 [8139cp]
      	[c0000000050a8d40] .dev_hard_start_xmit+0x394/0x640
      	[c0000000050a8e00] .sch_direct_xmit+0x110/0x260
      	[c0000000050a8ea0] .dev_queue_xmit+0x260/0x630
      	[c0000000050a8f40] .br_dev_queue_push_xmit+0xc4/0x130 [bridge]
      	[c0000000050a8fc0] .br_dev_xmit+0x198/0x270 [bridge]
      	[c0000000050a9070] .dev_hard_start_xmit+0x394/0x640
      	[c0000000050a9130] .dev_queue_xmit+0x428/0x630
      	[c0000000050a91d0] .ip_finish_output+0x2a4/0x550
      	[c0000000050a9290] .ip_local_out+0x50/0x70
      	[c0000000050a9310] .ip_queue_xmit+0x148/0x420
      	[c0000000050a93b0] .tcp_transmit_skb+0x4e4/0xaf0
      	[c0000000050a94a0] .__tcp_ack_snd_check+0x7c/0xf0
      	[c0000000050a9520] .tcp_rcv_established+0x1e8/0x930
      	[c0000000050a95f0] .tcp_v4_do_rcv+0x21c/0x570
      	[c0000000050a96c0] .tcp_v4_rcv+0x734/0x930
      	[c0000000050a97a0] .ip_local_deliver_finish+0x184/0x360
      	[c0000000050a9840] .ip_rcv_finish+0x148/0x400
      	[c0000000050a98d0] .__netif_receive_skb_core+0x4f8/0xb00
      	[c0000000050a99d0] .netif_receive_skb+0x44/0x110
      	[c0000000050a9a70] .br_handle_frame_finish+0x2bc/0x3f0 [bridge]
      	[c0000000050a9b20] .br_nf_pre_routing_finish+0x2ac/0x420 [bridge]
      	[c0000000050a9bd0] .br_nf_pre_routing+0x4dc/0x7d0 [bridge]
      	[c0000000050a9c70] .nf_iterate+0x114/0x130
      	[c0000000050a9d30] .nf_hook_slow+0xb4/0x1e0
      	[c0000000050a9e00] .br_handle_frame+0x290/0x330 [bridge]
      	[c0000000050a9ea0] .__netif_receive_skb_core+0x34c/0xb00
      	[c0000000050a9fa0] .netif_receive_skb+0x44/0x110
      	[c0000000050aa040] .napi_gro_receive+0xe8/0x120
      	[c0000000050aa0c0] .cp_rx_poll+0x31c/0x590 [8139cp]
      	[c0000000050aa1d0] .net_rx_action+0x1dc/0x310
      	[c0000000050aa2b0] .__do_softirq+0x158/0x330
      	[c0000000050aa3b0] .irq_exit+0xc8/0x110
      	[c0000000050aa430] .do_IRQ+0xdc/0x2c0
      	[c0000000050aa4e0] hardware_interrupt_common+0x154/0x180
      	 --- Exception: 501 at .bad_range+0x1c/0x110
      		 LR = .get_page_from_freelist+0x908/0xbb0
      	[c0000000050aa7d0] .list_del+0x18/0x50 (unreliable)
      	[c0000000050aa850] .get_page_from_freelist+0x908/0xbb0
      	[c0000000050aa9e0] .__alloc_pages_nodemask+0x21c/0xae0
      	[c0000000050aaba0] .alloc_pages_vma+0xd0/0x210
      	[c0000000050aac60] .handle_pte_fault+0x814/0xb70
      	[c0000000050aad50] .__get_user_pages+0x1a4/0x640
      	[c0000000050aae60] .get_user_pages_fast+0xec/0x160
      	[c0000000050aaf10] .__gfn_to_pfn_memslot+0x3b0/0x430 [kvm]
      	[c0000000050aafd0] .kvmppc_gfn_to_pfn+0x64/0x130 [kvm]
      	[c0000000050ab070] .kvmppc_mmu_map_page+0x94/0x530 [kvm]
      	[c0000000050ab190] .kvmppc_handle_pagefault+0x174/0x610 [kvm]
      	[c0000000050ab270] .kvmppc_handle_exit_pr+0x464/0x9b0 [kvm]
      	[c0000000050ab320]  kvm_start_lightweight+0x1ec/0x1fc [kvm]
      	[c0000000050ab4f0] .kvmppc_vcpu_run_pr+0x168/0x3b0 [kvm]
      	[c0000000050ab9c0] .kvmppc_vcpu_run+0xc8/0xf0 [kvm]
      	[c0000000050aba50] .kvm_arch_vcpu_ioctl_run+0x5c/0x1a0 [kvm]
      	[c0000000050abae0] .kvm_vcpu_ioctl+0x478/0x730 [kvm]
      	[c0000000050abc90] .do_vfs_ioctl+0x4ec/0x7c0
      	[c0000000050abd80] .SyS_ioctl+0xd4/0xf0
      	[c0000000050abe30] syscall_exit+0x0/0x98
      
      Since this is a regression, this patch proposes a minimalistic
      and low-risk solution by blindly forcing the hardirq exit processing of
      softirqs on the softirq stack. This way we should reduce significantly
      the opportunities for task stack overflow dug by softirqs.
      
      Longer term solutions may involve extending the hardirq stack coverage to
      irq_exit(), etc...
      Reported-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: #3.9.. <stable@vger.kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@au1.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@au1.ibm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: James E.J. Bottomley <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      ded79754
    • O
      pidns: fix free_pid() to handle the first fork failure · 314a8ad0
      Oleg Nesterov 提交于
      "case 0" in free_pid() assumes that disable_pid_allocation() should
      clear PIDNS_HASH_ADDING before the last pid goes away.
      
      However this doesn't happen if the first fork() fails to create the
      child reaper which should call disable_pid_allocation().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314a8ad0
    • T
      kernel/kmod.c: check for NULL in call_usermodehelper_exec() · 4c1c7be9
      Tetsuo Handa 提交于
      If /proc/sys/kernel/core_pattern contains only "|", a NULL pointer
      dereference happens upon core dump because argv_split("") returns
      argv[0] == NULL.
      
      This bug was once fixed by commit 264b83c0 ("usermodehelper: check
      subprocess_info->path != NULL") but was by error reintroduced by commit
      7f57cfa4 ("usermodehelper: kill the sub_info->path[0] check").
      
      This bug seems to exist since 2.6.19 (the version which core dump to
      pipe was added).  Depending on kernel version and config, some side
      effect might happen immediately after this oops (e.g.  kernel panic with
      2.6.32-358.18.1.el6).
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c1c7be9
    • R
      PM / hibernate: Fix user space driven resume regression · aab17289
      Rafael J. Wysocki 提交于
      Recent commit 8fd37a4c (PM / hibernate: Create memory bitmaps after
      freezing user space) broke the resume part of the user space driven
      hibernation (s2disk), because I forgot that the resume utility
      loaded the image into memory without freezing user space (it still
      freezes tasks after loading the image).  This means that during user
      space driven resume we need to create the memory bitmaps at the
      "device open" time rather than at the "freeze tasks" time, so make
      that happen (that's a special case anyway, so it needs to be treated
      in a special way).
      Reported-and-tested-by: NRonald <ronald645@gmail.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      aab17289
  4. 29 9月, 2013 1 次提交
  5. 27 9月, 2013 1 次提交
  6. 25 9月, 2013 12 次提交
  7. 20 9月, 2013 8 次提交
  8. 16 9月, 2013 1 次提交
  9. 13 9月, 2013 6 次提交