1. 07 5月, 2014 2 次提交
    • V
      sched: Rework sched_domain topology definition · 143e1e28
      Vincent Guittot 提交于
      We replace the old way to configure the scheduler topology with a new method
      which enables a platform to declare additionnal level (if needed).
      
      We still have a default topology table definition that can be used by platform
      that don't want more level than the SMT, MC, CPU and NUMA ones. This table can
      be overwritten by an arch which either wants to add new level where a load
      balance make sense like BOOK or powergating level or wants to change the flags
      configuration of some levels.
      
      For each level, we need a function pointer that returns cpumask for each cpu,
      a function pointer that returns the flags for the level and a name. Only flags
      that describe topology, can be set by an architecture. The current topology
      flags are:
      
       SD_SHARE_CPUPOWER
       SD_SHARE_PKG_RESOURCES
       SD_NUMA
       SD_ASYM_PACKING
      
      Then, each level must be a subset on the next one. The build sequence of the
      sched_domain will take care of removing useless levels like those with 1 CPU
      and those with the same CPU span and no more relevant information for
      load balancing than its children.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hanjun Guo <hanjun.guo@linaro.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux390@de.ibm.com
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Link: http://lkml.kernel.org/r/1397209481-28542-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      143e1e28
    • J
      sched/deadline: Fix sched_yield() behavior · 5bfd126e
      Juri Lelli 提交于
      yield_task_dl() is broken:
      
       o it forces current to be throttled setting its runtime to zero;
       o it sets current's dl_se->dl_new to one, expecting that dl_task_timer()
         will queue it back with proper parameters at replenish time.
      
      Unfortunately, dl_task_timer() has this check at the very beginning:
      
      	if (!dl_task(p) || dl_se->dl_new)
      		goto unlock;
      
      So, it just bails out and the task is never replenished. It actually
      yielded forever.
      
      To fix this, introduce a new flag indicating that the task properly yielded
      the CPU before its current runtime expired. While this is a little overdoing
      at the moment, the flag would be useful in the future to discriminate between
      "good" jobs (of which remaining runtime could be reclaimed, i.e. recycled)
      and "bad" jobs (for which dl_throttled task has been set) that needed to be
      stopped.
      Reported-by: Nyjay.kim <yjay.kim@lge.com>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140429103953.e68eba1b2ac3309214e3dc5a@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5bfd126e
  2. 19 4月, 2014 2 次提交
  3. 18 4月, 2014 5 次提交
  4. 17 4月, 2014 5 次提交
  5. 16 4月, 2014 1 次提交
    • I
      x86: Remove the PCI reboot method from the default chain · 5be44a6f
      Ingo Molnar 提交于
      Steve reported a reboot hang and bisected it back to this commit:
      
        a4f1987e x86, reboot: Add EFI and CF9 reboot methods into the default list
      
      He heroically tested all reboot methods and found the following:
      
        reboot=t       # triple fault                  ok
        reboot=k       # keyboard ctrl                 FAIL
        reboot=b       # BIOS                          ok
        reboot=a       # ACPI                          FAIL
        reboot=e       # EFI                           FAIL   [system has no EFI]
        reboot=p       # PCI 0xcf9                     FAIL
      
      And I think it's pretty obvious that we should only try PCI 0xcf9 as a
      last resort - if at all.
      
      The other observation is that (on this box) we should never try
      the PCI reboot method, but close with either the 'triple fault'
      or the 'BIOS' (terminal!) reboot methods.
      
      Thirdly, CF9_COND is a total misnomer - it should be something like
      CF9_SAFE or CF9_CAREFUL, and 'CF9' should be 'CF9_FORCE' ...
      
      So this patch fixes the worst problems:
      
       - it orders the actual reboot logic to follow the reboot ordering
         pattern - it was in a pretty random order before for no good
         reason.
      
       - it fixes the CF9 misnomers and uses BOOT_CF9_FORCE and
         BOOT_CF9_SAFE flags to make the code more obvious.
      
       - it tries the BIOS reboot method before the PCI reboot method.
         (Since 'BIOS' is a terminal reboot method resulting in a hang
          if it does not work, this is essentially equivalent to removing
          the PCI reboot method from the default reboot chain.)
      
       - just for the miraculous possibility of terminal (resulting
         in hang) reboot methods of triple fault or BIOS returning
         without having done their job, there's an ordering between
         them as well.
      Reported-and-bisected-and-tested-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Li Aubrey <aubrey.li@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Link: http://lkml.kernel.org/r/20140404064120.GB11877@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5be44a6f
  6. 15 4月, 2014 1 次提交
    • D
      net: filter: seccomp: fix wrong decoding of BPF_S_ANC_SECCOMP_LD_W · 8c482cdc
      Daniel Borkmann 提交于
      While reviewing seccomp code, we found that BPF_S_ANC_SECCOMP_LD_W has
      been wrongly decoded by commit a8fc9277 ("sk-filter: Add ability to
      get socket filter program (v2)") into the opcode BPF_LD|BPF_B|BPF_ABS
      although it should have been decoded as BPF_LD|BPF_W|BPF_ABS.
      
      In practice, this should not have much side-effect though, as such
      conversion is/was being done through prctl(2) PR_SET_SECCOMP. Reverse
      operation PR_GET_SECCOMP will only return the current seccomp mode, but
      not the filter itself. Since the transition to the new BPF infrastructure,
      it's also not used anymore, so we can simply remove this as it's
      unreachable.
      
      Fixes: a8fc9277 ("sk-filter: Add ability to get socket filter program (v2)")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c482cdc
  7. 12 4月, 2014 1 次提交
    • D
      net: Fix use after free by removing length arg from sk_data_ready callbacks. · 676d2369
      David S. Miller 提交于
      Several spots in the kernel perform a sequence like:
      
      	skb_queue_tail(&sk->s_receive_queue, skb);
      	sk->sk_data_ready(sk, skb->len);
      
      But at the moment we place the SKB onto the socket receive queue it
      can be consumed and freed up.  So this skb->len access is potentially
      to freed up memory.
      
      Furthermore, the skb->len can be modified by the consumer so it is
      possible that the value isn't accurate.
      
      And finally, no actual implementation of this callback actually uses
      the length argument.  And since nobody actually cared about it's
      value, lots of call sites pass arbitrary values in such as '0' and
      even '1'.
      
      So just remove the length argument from the callback, that way there
      is no confusion whatsoever and all of these use-after-free cases get
      fixed as a side effect.
      
      Based upon a patch by Eric Dumazet and his suggestion to audit this
      issue tree-wide.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      676d2369
  8. 11 4月, 2014 7 次提交
  9. 10 4月, 2014 4 次提交
    • J
      block: fix regression with block enabled tagging · 360f92c2
      Jens Axboe 提交于
      Martin reported that his test system would not boot with
      current git, it oopsed with this:
      
      BUG: unable to handle kernel paging request at ffff88046c6c9e80
      IP: [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
      PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060
      Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
      Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm
      rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon
      CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246
      Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013
      Workqueue: events_unbound async_run_entry_fn
      task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000
      RIP: 0010:[<ffffffff812971e0>]  [<ffffffff812971e0>]
      blk_queue_start_tag+0x90/0x150
      RSP: 0018:ffff880273d03a58  EFLAGS: 00010092
      RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6
      RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d
      RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410
      R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0
      R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e
      FS:  0000000000000000(0000) GS:ffff880277b00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0
      Stack:
       ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000
       ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62
       ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8
      Call Trace:
       [<ffffffff813b9e62>] scsi_request_fn+0xf2/0x510
       [<ffffffff81293167>] __blk_run_queue+0x37/0x50
       [<ffffffff8129ac43>] blk_execute_rq_nowait+0xb3/0x130
       [<ffffffff8129ad24>] blk_execute_rq+0x64/0xf0
       [<ffffffff8108d2b0>] ? bit_waitqueue+0xd0/0xd0
       [<ffffffff813bba35>] scsi_execute+0xe5/0x180
       [<ffffffff813bbe4a>] scsi_execute_req_flags+0x9a/0x110
       [<ffffffffa01b1304>] sd_spinup_disk+0x94/0x460 [sd_mod]
       [<ffffffff81160000>] ? __unmap_hugepage_range+0x200/0x2f0
       [<ffffffffa01b2b9a>] sd_revalidate_disk+0xaa/0x3f0 [sd_mod]
       [<ffffffffa01b2fb8>] sd_probe_async+0xd8/0x200 [sd_mod]
       [<ffffffff8107703f>] async_run_entry_fn+0x3f/0x140
       [<ffffffff8106a1c5>] process_one_work+0x175/0x410
       [<ffffffff8106b373>] worker_thread+0x123/0x400
       [<ffffffff8106b250>] ? manage_workers+0x160/0x160
       [<ffffffff8107104e>] kthread+0xce/0xf0
       [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
       [<ffffffff815f0bac>] ret_from_fork+0x7c/0xb0
       [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
      Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89
      df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 <48> 89
      58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d
      RIP  [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
       RSP <ffff880273d03a58>
      CR2: ffff88046c6c9e80
      
      Martin bisected and found this to be the problem patch;
      
      	commit 6d113398
      	Author: Jan Kara <jack@suse.cz>
      	Date:   Mon Feb 24 16:39:54 2014 +0100
      
      	    block: Stop abusing rq->csd.list in blk-softirq
      
      and the problem was immediately apparent. The patch states that
      it is safe to reuse queuelist at completion time, since it is
      no longer used. However, that is not true if a device is using
      block enabled tagging. If that is the case, then the queuelist
      is reused to keep track of busy tags. If a device also ended
      up using softirq completions, we'd reuse ->queuelist for the
      IPI handling while block tagging was still using it. Boom.
      
      Fix this by adding a new ipi_list list head, and share the
      memory used with the request hash table. The hash table is
      never used after the request is moved to the dispatch list,
      which happens long before any potential completion of the
      request. Add a new request bit for this, so we don't have
      cases that check rq->hash while it could potentially have
      been reused for the IPI completion.
      Reported-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Tested-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      360f92c2
    • M
      tracing: Fix anonymous unions in struct ftrace_event_call · abb43f69
      Mathieu Desnoyers 提交于
      gcc <= 4.5.x has significant limitations with respect to initialization
      of anonymous unions within structures. They need to be surrounded by
      brackets, _and_ they need to be initialized in the same order in which
      they appear in the structure declaration.
      
      Link: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676
      Link: http://lkml.kernel.org/r/1397077568-3156-1-git-send-email-mathieu.desnoyers@efficios.comSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      abb43f69
    • B
      x86: LLVMLinux: Fix "incomplete type const struct x86cpu_device_id" · c4586256
      Behan Webster 提交于
      Similar to the fix in 40413dcb
      
      MODULE_DEVICE_TABLE(x86cpu, ...) expects the struct to be called struct
      x86cpu_device_id, and not struct x86_cpu_id which is what is used in the rest
      of the kernel code.  Although gcc seems to ignore this error, clang fails
      without this define to fix the name.
      
      Code from drivers/thermal/x86_pkg_temp_thermal.c
      static const struct x86_cpu_id __initconst pkg_temp_thermal_ids[] = { ... };
      MODULE_DEVICE_TABLE(x86cpu, pkg_temp_thermal_ids);
      
      Error from clang:
      drivers/thermal/x86_pkg_temp_thermal.c:577:1: error: variable has
            incomplete type 'const struct x86cpu_device_id'
      MODULE_DEVICE_TABLE(x86cpu, pkg_temp_thermal_ids);
      ^
      include/linux/module.h:145:3: note: expanded from macro
            'MODULE_DEVICE_TABLE'
        MODULE_GENERIC_TABLE(type##_device, name)
        ^
      include/linux/module.h:87:32: note: expanded from macro
            'MODULE_GENERIC_TABLE'
      extern const struct gtype##_id __mod_##gtype##_table            \
                                     ^
      <scratch space>:143:1: note: expanded from here
      __mod_x86cpu_device_table
      ^
      drivers/thermal/x86_pkg_temp_thermal.c:577:1: note: forward declaration of
            'struct x86cpu_device_id'
      include/linux/module.h:145:3: note: expanded from macro
            'MODULE_DEVICE_TABLE'
        MODULE_GENERIC_TABLE(type##_device, name)
        ^
      include/linux/module.h:87:21: note: expanded from macro
            'MODULE_GENERIC_TABLE'
      extern const struct gtype##_id __mod_##gtype##_table            \
                          ^
      <scratch space>:141:1: note: expanded from here
      x86cpu_device_id
      ^
      1 error generated.
      Signed-off-by: NBehan Webster <behanw@converseincode.com>
      Signed-off-by: NJan-Simon Möller <dl9pf@gmx.de>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c4586256
    • M
      LLVMLinux: Add support for clang to compiler.h and new compiler-clang.h · 565cbdc2
      Mark Charlebois 提交于
      Add a compiler-clang.h file to add specific macros needed for compiling the
      kernel with clang.
      
      Initially the only override required is the macro for silencing the
      compiler for a purposefully uninintialized variable.
      
      Author: Mark Charlebois <charlebm@gmail.com>
      Signed-off-by: NMark Charlebois <charlebm@gmail.com>
      Signed-off-by: NBehan Webster <behanw@converseincode.com>
      565cbdc2
  10. 09 4月, 2014 5 次提交
  11. 08 4月, 2014 7 次提交
    • A
      netfilter: nf_conntrack: flush net_gre->keymap_list only from gre helper · 8142b227
      Andrey Vagin 提交于
      nf_ct_gre_keymap_flush() removes a nf_ct_gre_keymap object from
      net_gre->keymap_list and frees the object. But it doesn't clean
      a reference on this object from ct_pptp_info->keymap[dir].
      Then nf_ct_gre_keymap_destroy() may release the same object again.
      
      So nf_ct_gre_keymap_flush() can be called only when we are sure that
      when nf_ct_gre_keymap_destroy will not be called.
      
      nf_ct_gre_keymap is created by nf_ct_gre_keymap_add() and the right way
      to destroy it is to call nf_ct_gre_keymap_destroy().
      
      This patch marks nf_ct_gre_keymap_flush() as static, so this patch can
      break compilation of third party modules, which use
      nf_ct_gre_keymap_flush. I'm not sure this is the right way to deprecate
      this function.
      
      [  226.540793] general protection fault: 0000 [#1] SMP
      [  226.541750] Modules linked in: nf_nat_pptp nf_nat_proto_gre
      nf_conntrack_pptp nf_conntrack_proto_gre ip_gre ip_tunnel gre
      ppp_deflate bsd_comp ppp_async crc_ccitt ppp_generic slhc xt_nat
      iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
      nf_conntrack veth tun bridge stp llc ppdev microcode joydev pcspkr
      serio_raw virtio_console virtio_balloon floppy parport_pc parport
      pvpanic i2c_piix4 virtio_net drm_kms_helper ttm ata_generic virtio_pci
      virtio_ring virtio drm i2c_core pata_acpi [last unloaded: ip_tunnel]
      [  226.541776] CPU: 0 PID: 49 Comm: kworker/u4:2 Not tainted 3.14.0-rc8+ #101
      [  226.541776] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  226.541776] Workqueue: netns cleanup_net
      [  226.541776] task: ffff8800371e0000 ti: ffff88003730c000 task.ti: ffff88003730c000
      [  226.541776] RIP: 0010:[<ffffffff81389ba9>]  [<ffffffff81389ba9>] __list_del_entry+0x29/0xd0
      [  226.541776] RSP: 0018:ffff88003730dbd0  EFLAGS: 00010a83
      [  226.541776] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8800374e6c40 RCX: dead000000200200
      [  226.541776] RDX: 6b6b6b6b6b6b6b6b RSI: ffff8800371e07d0 RDI: ffff8800374e6c40
      [  226.541776] RBP: ffff88003730dbd0 R08: 0000000000000000 R09: 0000000000000000
      [  226.541776] R10: 0000000000000001 R11: ffff88003730d92e R12: 0000000000000002
      [  226.541776] R13: ffff88007a4c42d0 R14: ffff88007aef0000 R15: ffff880036cf0018
      [  226.541776] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
      [  226.541776] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  226.541776] CR2: 00007f07f643f7d0 CR3: 0000000036fd2000 CR4: 00000000000006f0
      [  226.541776] Stack:
      [  226.541776]  ffff88003730dbe8 ffffffff81389c5d ffff8800374ffbe4 ffff88003730dc28
      [  226.541776]  ffffffffa0162a43 ffffffffa01627c5 ffff88007a4c42d0 ffff88007aef0000
      [  226.541776]  ffffffffa01651c0 ffff88007a4c45e0 ffff88007aef0000 ffff88003730dc40
      [  226.541776] Call Trace:
      [  226.541776]  [<ffffffff81389c5d>] list_del+0xd/0x30
      [  226.541776]  [<ffffffffa0162a43>] nf_ct_gre_keymap_destroy+0x283/0x2d0 [nf_conntrack_proto_gre]
      [  226.541776]  [<ffffffffa01627c5>] ? nf_ct_gre_keymap_destroy+0x5/0x2d0 [nf_conntrack_proto_gre]
      [  226.541776]  [<ffffffffa0162ab7>] gre_destroy+0x27/0x70 [nf_conntrack_proto_gre]
      [  226.541776]  [<ffffffffa0117de3>] destroy_conntrack+0x83/0x200 [nf_conntrack]
      [  226.541776]  [<ffffffffa0117d87>] ? destroy_conntrack+0x27/0x200 [nf_conntrack]
      [  226.541776]  [<ffffffffa0117d60>] ? nf_conntrack_hash_check_insert+0x2e0/0x2e0 [nf_conntrack]
      [  226.541776]  [<ffffffff81630142>] nf_conntrack_destroy+0x72/0x180
      [  226.541776]  [<ffffffff816300d5>] ? nf_conntrack_destroy+0x5/0x180
      [  226.541776]  [<ffffffffa011ef80>] ? kill_l3proto+0x20/0x20 [nf_conntrack]
      [  226.541776]  [<ffffffffa011847e>] nf_ct_iterate_cleanup+0x14e/0x170 [nf_conntrack]
      [  226.541776]  [<ffffffffa011f74b>] nf_ct_l4proto_pernet_unregister+0x5b/0x90 [nf_conntrack]
      [  226.541776]  [<ffffffffa0162409>] proto_gre_net_exit+0x19/0x30 [nf_conntrack_proto_gre]
      [  226.541776]  [<ffffffff815edf89>] ops_exit_list.isra.1+0x39/0x60
      [  226.541776]  [<ffffffff815eecc0>] cleanup_net+0x100/0x1d0
      [  226.541776]  [<ffffffff810a608a>] process_one_work+0x1ea/0x4f0
      [  226.541776]  [<ffffffff810a6028>] ? process_one_work+0x188/0x4f0
      [  226.541776]  [<ffffffff810a64ab>] worker_thread+0x11b/0x3a0
      [  226.541776]  [<ffffffff810a6390>] ? process_one_work+0x4f0/0x4f0
      [  226.541776]  [<ffffffff810af42d>] kthread+0xed/0x110
      [  226.541776]  [<ffffffff8173d4dc>] ? _raw_spin_unlock_irq+0x2c/0x40
      [  226.541776]  [<ffffffff810af340>] ? kthread_create_on_node+0x200/0x200
      [  226.541776]  [<ffffffff8174747c>] ret_from_fork+0x7c/0xb0
      [  226.541776]  [<ffffffff810af340>] ? kthread_create_on_node+0x200/0x200
      [  226.541776] Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de
      48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48
      39 c8 74 7a <4c> 8b 00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89
      42 08
      [  226.541776] RIP  [<ffffffff81389ba9>] __list_del_entry+0x29/0xd0
      [  226.541776]  RSP <ffff88003730dbd0>
      [  226.612193] ---[ end trace 985ae23ddfcc357c ]---
      
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8142b227
    • J
      lglock: map to spinlock when !CONFIG_SMP · 64b47e8f
      Josh Triplett 提交于
      When the system has only one CPU, lglock is effectively a spinlock; map
      it directly to spinlock to eliminate the indirection and duplicate code.
      
      In addition to removing overhead, this drops 1.6k of code with a
      defconfig modified to have !CONFIG_SMP, and 1.1k with a minimal config.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64b47e8f
    • C
      percpu: add preemption checks to __this_cpu ops · 188a8140
      Christoph Lameter 提交于
      We define a check function in order to avoid trouble with the include
      files.  Then the higher level __this_cpu macros are modified to invoke
      the preemption check.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Tested-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      188a8140
    • C
      vmstat: use raw_cpu_ops to avoid false positives on preemption checks · 293b6a4c
      Christoph Lameter 提交于
      vm counters are allowed to be racy.  Use raw_cpu_ops to avoid the
      local_irq_disable overhead and to avoid preemption checks which will be
      added to the __this_cpu operations.
      
      [akpm@linux-foundation.org: Add comment.  Again.]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Reported-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      293b6a4c
    • C
      mm: use raw_cpu ops for determining current NUMA node · dc322a99
      Christoph Lameter 提交于
      With the preempt checking logic for __this_cpu_ops we will get false
      positives from locations in the code that use numa_node_id.
      
      Before the __this_cpu ops where introduced there were no checks for
      preemption present either.  smp_raw_processor_id() was used.  See
      
        http://www.spinics.net/lists/linux-numa/msg00641.html
      
      Therefore we need to use raw_cpu_read here to avoid false postives.
      
      Note that this issue has been discussed in prior years.  If the process
      changes nodes after retrieving the current numa node then that is
      acceptable since most uses of numa_node etc are for optimization and not
      for correctness.
      
      There were suggestions to implement a raw_numa_node_id in order to do
      preempt checks for numa_node_id as well.  But I think we better defer
      that to another patch since that would mean investigating how
      numa_node_id() is used throughout the kernel which would increase the
      scope of this patchset significantly.  After all preemption was never
      checked before when numa_node_id() was used.
      
      Some sample traces:
      
      __this_cpu_read operation in preemptible [00000000] code: login/1456
      caller is __this_cpu_preempt_check+0x2b/0x2d
      CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3b-dirty #185
      Call Trace:
        dump_stack+0x4e/0x82
        check_preemption_disabled+0xc5/0xe0
        __this_cpu_preempt_check+0x2b/0x2d
        get_task_policy+0x1d/0x49
        get_vma_policy+0x14/0x76
        alloc_pages_vma+0x35/0xff
        handle_mm_fault+0x290/0x73b
        __do_page_fault+0x3fe/0x44d
        do_page_fault+0x9/0xc
        page_fault+0x22/0x30
        generic_file_aio_read+0x38e/0x624
        do_sync_read+0x54/0x73
        vfs_read+0x9d/0x12a
        SyS_read+0x47/0x7e
        cstar_dispatch+0x7/0x23
      
      caller is __this_cpu_preempt_check+0x2b/0x2d
      CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3b-dirty #185
      Call Trace:
        dump_stack+0x4e/0x82
        check_preemption_disabled+0xc5/0xe0
        __this_cpu_preempt_check+0x2b/0x2d
        alloc_pages_current+0x8f/0xbc
        __page_cache_alloc+0xb/0xd
        __do_page_cache_readahead+0xf4/0x219
        ra_submit+0x1c/0x20
        ondemand_readahead+0x28c/0x2b4
        page_cache_sync_readahead+0x38/0x3a
        generic_file_aio_read+0x261/0x624
        do_sync_read+0x54/0x73
        vfs_read+0x9d/0x12a
        SyS_read+0x47/0x7e
        cstar_dispatch+0x7/0x23
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Alex Shi <alex.shi@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc322a99
    • C
      percpu: add raw_cpu_ops · b3ca1c10
      Christoph Lameter 提交于
      The kernel has never been audited to ensure that this_cpu operations are
      consistently used throughout the kernel.  The code generated in many
      places can be improved through the use of this_cpu operations (which
      uses a segment register for relocation of per cpu offsets instead of
      performing address calculations).
      
      The patch set also addresses various consistency issues in general with
      the per cpu macros.
      
      A. The semantics of __this_cpu_ptr() differs from this_cpu_ptr only
         because checks are skipped. This is typically shown through a raw_
         prefix. So this patch set changes the places where __this_cpu_ptr()
         is used to raw_cpu_ptr().
      
      B. There has been the long term wish by some that __this_cpu operations
         would check for preemption. However, there are cases where preemption
         checks need to be skipped. This patch set adds raw_cpu operations that
         do not check for preemption and then adds preemption checks to the
         __this_cpu operations.
      
      C. The use of __get_cpu_var is always a reference to a percpu variable
         that can also be handled via a this_cpu operation. This patch set
         replaces all uses of __get_cpu_var with this_cpu operations.
      
      D. We can then use this_cpu RMW operations in various places replacing
         sequences of instructions by a single one.
      
      E. The use of this_cpu operations throughout will allow other arches than
         x86 to implement optimized references and RMV operations to work with
         per cpu local data.
      
      F. The use of this_cpu operations opens up the possibility to
         further optimize code that relies on synchronization through
         per cpu data.
      
      The patch set works in a couple of stages:
      
      I. Patch 1 adds the additional raw_cpu operations and raw_cpu_ptr().
          Also converts the existing __this_cpu_xx_# primitive in the x86
          code to raw_cpu_xx_#.
      
      II. Patch 2-4 use the raw_cpu operations in places that would give
           us false positives once they are enabled.
      
      III. Patch 5 adds preemption checks to __this_cpu operations to allow
          checking if preemption is properly disabled when these functions
          are used.
      
      IV. Patches 6-20 are patches that simply replace uses of __get_cpu_var
         with this_cpu_ptr. They do not depend on any changes to the percpu
         code. No preemption tests are skipped if they are applied.
      
      V. Patches 21-46 are conversion patches that use this_cpu operations
         in various kernel subsystems/drivers or arch code.
      
      VI.  Patches 47/48 (not included in this series) remove no longer used
          functions (__this_cpu_ptr and __get_cpu_var).  These should only be
          applied after all the conversion patches have made it and after we
          have done additional passes through the kernel to ensure that none of
          the uses of these functions remain.
      
      This patch (of 46):
      
      The patches following this one will add preemption checks to __this_cpu
      ops so we need to have an alternative way to use this_cpu operations
      without preemption checks.
      
      raw_cpu_ops will be the basis for all other ops since these will be the
      operations that do not implement any checks.
      
      Primitive operations are renamed by this patch from __this_cpu_xxx to
      raw_cpu_xxxx.
      
      Also change the uses of the x86 percpu primitives in preempt.h.
      These depend directly on asm/percpu.h (header #include nesting issue).
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Alex Shi <alex.shi@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bryan Wu <cooloney@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: David Daney <david.daney@cavium.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Hedi Berriche <hedi@sgi.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wim Van Sebroeck <wim@iguana.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3ca1c10
    • V
      slub: rework sysfs layout for memcg caches · 9a41707b
      Vladimir Davydov 提交于
      Currently, we try to arrange sysfs entries for memcg caches in the same
      manner as for global caches.  Apart from turning /sys/kernel/slab into a
      mess when there are a lot of kmem-active memcgs created, it actually
      does not work properly - we won't create more than one link to a memcg
      cache in case its parent is merged with another cache.  For instance, if
      A is a root cache merged with another root cache B, we will have the
      following sysfs setup:
      
        X
        A -> X
        B -> X
      
      where X is some unique id (see create_unique_id()).  Now if memcgs M and
      N start to allocate from cache A (or B, which is the same), we will get:
      
        X
        X:M
        X:N
        A -> X
        B -> X
        A:M -> X:M
        A:N -> X:N
      
      Since B is an alias for A, we won't get entries B:M and B:N, which is
      confusing.
      
      It is more logical to have entries for memcg caches under the
      corresponding root cache's sysfs directory.  This would allow us to keep
      sysfs layout clean, and avoid such inconsistencies like one described
      above.
      
      This patch does the trick.  It creates a "cgroup" kset in each root
      cache kobject to keep its children caches there.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a41707b