1. 20 8月, 2015 22 次提交
  2. 17 8月, 2015 1 次提交
  3. 14 8月, 2015 2 次提交
  4. 12 8月, 2015 2 次提交
    • M
      perf/x86/intel/cqm: Do not access cpu_data() from CPU_UP_PREPARE handler · d7a702f0
      Matt Fleming 提交于
      Tony reports that booting his 144-cpu machine with maxcpus=10 triggers
      the following WARN_ON():
      
      [   21.045727] WARNING: CPU: 8 PID: 647 at arch/x86/kernel/cpu/perf_event_intel_cqm.c:1267 intel_cqm_cpu_prepare+0x75/0x90()
      [   21.045744] CPU: 8 PID: 647 Comm: systemd-udevd Not tainted 4.2.0-rc4 #1
      [   21.045745] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0066.R00.1506021730 06/02/2015
      [   21.045747]  0000000000000000 0000000082771b09 ffff880856333ba8 ffffffff81669b67
      [   21.045748]  0000000000000000 0000000000000000 ffff880856333be8 ffffffff8107b02a
      [   21.045750]  ffff88085b789800 ffff88085f68a020 ffffffff819e2470 000000000000000a
      [   21.045750] Call Trace:
      [   21.045757]  [<ffffffff81669b67>] dump_stack+0x45/0x57
      [   21.045759]  [<ffffffff8107b02a>] warn_slowpath_common+0x8a/0xc0
      [   21.045761]  [<ffffffff8107b15a>] warn_slowpath_null+0x1a/0x20
      [   21.045762]  [<ffffffff81036725>] intel_cqm_cpu_prepare+0x75/0x90
      [   21.045764]  [<ffffffff81036872>] intel_cqm_cpu_notifier+0x42/0x160
      [   21.045767]  [<ffffffff8109a33d>] notifier_call_chain+0x4d/0x80
      [   21.045769]  [<ffffffff8109a44e>] __raw_notifier_call_chain+0xe/0x10
      [   21.045770]  [<ffffffff8107b538>] _cpu_up+0xe8/0x190
      [   21.045771]  [<ffffffff8107b65a>] cpu_up+0x7a/0xa0
      [   21.045774]  [<ffffffff8165e920>] cpu_subsys_online+0x40/0x90
      [   21.045777]  [<ffffffff81433b37>] device_online+0x67/0x90
      [   21.045778]  [<ffffffff81433bea>] online_store+0x8a/0xa0
      [   21.045782]  [<ffffffff81430e78>] dev_attr_store+0x18/0x30
      [   21.045785]  [<ffffffff8126b6ba>] sysfs_kf_write+0x3a/0x50
      [   21.045786]  [<ffffffff8126ad40>] kernfs_fop_write+0x120/0x170
      [   21.045789]  [<ffffffff811f0b77>] __vfs_write+0x37/0x100
      [   21.045791]  [<ffffffff811f38b8>] ? __sb_start_write+0x58/0x110
      [   21.045795]  [<ffffffff81296d2d>] ? security_file_permission+0x3d/0xc0
      [   21.045796]  [<ffffffff811f1279>] vfs_write+0xa9/0x190
      [   21.045797]  [<ffffffff811f2075>] SyS_write+0x55/0xc0
      [   21.045800]  [<ffffffff81067300>] ? do_page_fault+0x30/0x80
      [   21.045804]  [<ffffffff816709ae>] entry_SYSCALL_64_fastpath+0x12/0x71
      [   21.045805] ---[ end trace fe228b836d8af405 ]---
      
      The root cause is that CPU_UP_PREPARE is completely the wrong notifier
      action from which to access cpu_data(), because smp_store_cpu_info()
      won't have been executed by the target CPU at that point, which in turn
      means that ->x86_cache_max_rmid and ->x86_cache_occ_scale haven't been
      filled out.
      
      Instead let's invoke our handler from CPU_STARTING and rename it
      appropriately.
      Reported-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ashok Raj <ashok.raj@intel.com>
      Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vikas Shivappa <vikas.shivappa@intel.com>
      Link: http://lkml.kernel.org/r/1438863163-14083-1-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d7a702f0
    • P
      perf/x86/intel: Fix memory leak on hot-plug allocation fail · dbc72b7a
      Peter Zijlstra 提交于
      We fail to free the shared_regs allocation if the constraint_list
      allocation fails.
      
      Cure this and be more consistent in NULL-ing the pointers after free.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      dbc72b7a
  5. 10 8月, 2015 1 次提交
    • J
      x86/xen: build "Xen PV" APIC driver for domU as well · fc5fee86
      Jason A. Donenfeld 提交于
      It turns out that a PV domU also requires the "Xen PV" APIC
      driver. Otherwise, the flat driver is used and we get stuck in busy
      loops that never exit, such as in this stack trace:
      
      (gdb) target remote localhost:9999
      Remote debugging using localhost:9999
      __xapic_wait_icr_idle () at ./arch/x86/include/asm/ipi.h:56
      56              while (native_apic_mem_read(APIC_ICR) & APIC_ICR_BUSY)
      (gdb) bt
       #0  __xapic_wait_icr_idle () at ./arch/x86/include/asm/ipi.h:56
       #1  __default_send_IPI_shortcut (shortcut=<optimized out>,
      dest=<optimized out>, vector=<optimized out>) at
      ./arch/x86/include/asm/ipi.h:75
       #2  apic_send_IPI_self (vector=246) at arch/x86/kernel/apic/probe_64.c:54
       #3  0xffffffff81011336 in arch_irq_work_raise () at
      arch/x86/kernel/irq_work.c:47
       #4  0xffffffff8114990c in irq_work_queue (work=0xffff88000fc0e400) at
      kernel/irq_work.c:100
       #5  0xffffffff8110c29d in wake_up_klogd () at kernel/printk/printk.c:2633
       #6  0xffffffff8110ca60 in vprintk_emit (facility=0, level=<optimized
      out>, dict=0x0 <irq_stack_union>, dictlen=<optimized out>,
      fmt=<optimized out>, args=<optimized out>)
          at kernel/printk/printk.c:1778
       #7  0xffffffff816010c8 in printk (fmt=<optimized out>) at
      kernel/printk/printk.c:1868
       #8  0xffffffffc00013ea in ?? ()
       #9  0x0000000000000000 in ?? ()
      
      Mailing-list-thread: https://lkml.org/lkml/2015/8/4/755Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      fc5fee86
  6. 08 8月, 2015 2 次提交
  7. 07 8月, 2015 2 次提交
    • H
      KVM: x86: Use adjustment in guest cycles when handling MSR_IA32_TSC_ADJUST · d7add054
      Haozhong Zhang 提交于
      When kvm_set_msr_common() handles a guest's write to
      MSR_IA32_TSC_ADJUST, it will calcuate an adjustment based on the data
      written by guest and then use it to adjust TSC offset by calling a
      call-back adjust_tsc_offset(). The 3rd parameter of adjust_tsc_offset()
      indicates whether the adjustment is in host TSC cycles or in guest TSC
      cycles. If SVM TSC scaling is enabled, adjust_tsc_offset()
      [i.e. svm_adjust_tsc_offset()] will first scale the adjustment;
      otherwise, it will just use the unscaled one. As the MSR write here
      comes from the guest, the adjustment is in guest TSC cycles. However,
      the current kvm_set_msr_common() uses it as a value in host TSC
      cycles (by using true as the 3rd parameter of adjust_tsc_offset()),
      which can result in an incorrect adjustment of TSC offset if SVM TSC
      scaling is enabled. This patch fixes this problem.
      Signed-off-by: NHaozhong Zhang <haozhong.zhang@intel.com>
      Cc: stable@vger.linux.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d7add054
    • P
      KVM: x86: zero IDT limit on entry to SMM · 18c3626e
      Paolo Bonzini 提交于
      The recent BlackHat 2015 presentation "The Memory Sinkhole"
      mentions that the IDT limit is zeroed on entry to SMM.
      
      This is not documented, and must have changed some time after 2010
      (see http://www.ssi.gouv.fr/uploads/IMG/pdf/IT_Defense_2010_final.pdf).
      KVM was not doing it, but the fix is easy.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      18c3626e
  8. 05 8月, 2015 1 次提交
  9. 31 7月, 2015 5 次提交
    • A
      x86/ldt: Make modify_ldt synchronous · 37868fe1
      Andy Lutomirski 提交于
      modify_ldt() has questionable locking and does not synchronize
      threads.  Improve it: redesign the locking and synchronize all
      threads' LDTs using an IPI on all modifications.
      
      This will dramatically slow down modify_ldt in multithreaded
      programs, but there shouldn't be any multithreaded programs that
      care about modify_ldt's performance in the first place.
      
      This fixes some fallout from the CVE-2015-5157 fixes.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: security@kernel.org <security@kernel.org>
      Cc: <stable@vger.kernel.org>
      Cc: xen-devel <xen-devel@lists.xen.org>
      Link: http://lkml.kernel.org/r/4c6978476782160600471bd865b318db34c7b628.1438291540.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37868fe1
    • A
      x86/xen: Probe target addresses in set_aliased_prot() before the hypercall · aa1acff3
      Andy Lutomirski 提交于
      The update_va_mapping hypercall can fail if the VA isn't present
      in the guest's page tables.  Under certain loads, this can
      result in an OOPS when the target address is in unpopulated vmap
      space.
      
      While we're at it, add comments to help explain what's going on.
      
      This isn't a great long-term fix.  This code should probably be
      changed to use something like set_memory_ro.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Vrabel <dvrabel@cantab.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: security@kernel.org <security@kernel.org>
      Cc: <stable@vger.kernel.org>
      Cc: xen-devel <xen-devel@lists.xen.org>
      Link: http://lkml.kernel.org/r/0b0e55b995cda11e7829f140b833ef932fcabe3a.1438291540.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      aa1acff3
    • J
      x86/irq: Use the caller provided polarity setting in mp_check_pin_attr() · 646c4b75
      Jiang Liu 提交于
      Commit d32932d0 ("x86/irq: Convert IOAPIC to use hierarchical
      irqdomain interfaces") introduced a regression which causes
      malfunction of interrupt lines.
      
      The reason is that the conversion of mp_check_pin_attr() missed to
      update the polarity selection of the interrupt pin with the caller
      provided setting and instead uses a stale attribute value. That in
      turn results in chosing the wrong interrupt flow handler.
      
      Use the caller supplied setting to configure the pin correctly which
      also choses the correct interrupt flow handler.
      
      This restores the original behaviour and on the affected
      machine/driver (Surface Pro 3, i2c controller) all IOAPIC IRQ
      configuration are identical to v4.1.
      
      Fixes: d32932d0 ("x86/irq: Convert IOAPIC to use hierarchical irqdomain interfaces")
      Reported-and-tested-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Reported-and-tested-by: NChen Yu <yu.c.chen@intel.com>
      Signed-off-by: NJiang Liu <jiang.liu@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Chen Yu <yu.c.chen@intel.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1438242695-23531-1-git-send-email-jiang.liu@linux.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      646c4b75
    • R
      efi: Check for NULL efi kernel parameters · 9115c758
      Ricardo Neri 提交于
      Even though it is documented how to specifiy efi parameters, it is
      possible to cause a kernel panic due to a dereference of a NULL pointer when
      parsing such parameters if "efi" alone is given:
      
      PANIC: early exception 0e rip 10:ffffffff812fb361 error 0 cr2 0
      [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.2.0-rc1+ #450
      [ 0.000000]  ffffffff81fe20a9 ffffffff81e03d50 ffffffff8184bb0f 00000000000003f8
      [ 0.000000]  0000000000000000 ffffffff81e03e08 ffffffff81f371a1 64656c62616e6520
      [ 0.000000]  0000000000000069 000000000000005f 0000000000000000 0000000000000000
      [ 0.000000] Call Trace:
      [ 0.000000]  [<ffffffff8184bb0f>] dump_stack+0x45/0x57
      [ 0.000000]  [<ffffffff81f371a1>] early_idt_handler_common+0x81/0xae
      [ 0.000000]  [<ffffffff812fb361>] ? parse_option_str+0x11/0x90
      [ 0.000000]  [<ffffffff81f4dd69>] arch_parse_efi_cmdline+0x15/0x42
      [ 0.000000]  [<ffffffff81f376e1>] do_early_param+0x50/0x8a
      [ 0.000000]  [<ffffffff8106b1b3>] parse_args+0x1e3/0x400
      [ 0.000000]  [<ffffffff81f37a43>] parse_early_options+0x24/0x28
      [ 0.000000]  [<ffffffff81f37691>] ? loglevel+0x31/0x31
      [ 0.000000]  [<ffffffff81f37a78>] parse_early_param+0x31/0x3d
      [ 0.000000]  [<ffffffff81f3ae98>] setup_arch+0x2de/0xc08
      [ 0.000000]  [<ffffffff8109629a>] ? vprintk_default+0x1a/0x20
      [ 0.000000]  [<ffffffff81f37b20>] start_kernel+0x90/0x423
      [ 0.000000]  [<ffffffff81f37495>] x86_64_start_reservations+0x2a/0x2c
      [ 0.000000]  [<ffffffff81f37582>] x86_64_start_kernel+0xeb/0xef
      [ 0.000000] RIP 0xffffffff81ba2efc
      
      This panic is not reproducible with "efi=" as this will result in a non-NULL
      zero-length string.
      
      Thus, verify that the pointer to the parameter string is not NULL. This is
      consistent with other parameter-parsing functions which check for NULL pointers.
      Signed-off-by: NRicardo Neri <ricardo.neri-calderon@linux.intel.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      9115c758
    • D
      x86/efi: Use all 64 bit of efi_memmap in setup_e820() · 7cc03e48
      Dmitry Skorodumov 提交于
      The efi_info structure stores low 32 bits of memory map
      in efi_memmap and high 32 bits in efi_memmap_hi.
      
      While constructing pointer in the setup_e820(), need
      to take into account all 64 bit of the pointer.
      
      It is because on 64bit machine the function
      efi_get_memory_map() may return full 64bit pointer and before
      the patch that pointer was truncated.
      
      The issue is triggered on Parallles virtual machine and
      fixed with this patch.
      Signed-off-by: NDmitry Skorodumov <sdmitry@parallels.com>
      Cc: Denis V. Lunev <den@openvz.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      7cc03e48
  10. 30 7月, 2015 1 次提交
    • D
      ebpf, x86: fix general protection fault when tail call is invoked · 2482abb9
      Daniel Borkmann 提交于
      With eBPF JIT compiler enabled on x86_64, I was able to reliably trigger
      the following general protection fault out of an eBPF program with a simple
      tail call, f.e. tracex5 (or a stripped down version of it):
      
        [  927.097918] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
        [...]
        [  927.100870] task: ffff8801f228b780 ti: ffff880016a64000 task.ti: ffff880016a64000
        [  927.102096] RIP: 0010:[<ffffffffa002440d>]  [<ffffffffa002440d>] 0xffffffffa002440d
        [  927.103390] RSP: 0018:ffff880016a67a68  EFLAGS: 00010006
        [  927.104683] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: 0000000000000001
        [  927.105921] RDX: 0000000000000000 RSI: ffff88014e438000 RDI: ffff880016a67e00
        [  927.107137] RBP: ffff880016a67c90 R08: 0000000000000000 R09: 0000000000000001
        [  927.108351] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880016a67e00
        [  927.109567] R13: 0000000000000000 R14: ffff88026500e460 R15: ffff880220a81520
        [  927.110787] FS:  00007fe7d5c1f740(0000) GS:ffff880265000000(0000) knlGS:0000000000000000
        [  927.112021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  927.113255] CR2: 0000003e7bbb91a0 CR3: 000000006e04b000 CR4: 00000000001407e0
        [  927.114500] Stack:
        [  927.115737]  ffffc90008cdb000 ffff880016a67e00 ffff88026500e460 ffff880220a81520
        [  927.117005]  0000000100000000 000000000000001b ffff880016a67aa8 ffffffff8106c548
        [  927.118276]  00007ffcdaf22e58 0000000000000000 0000000000000000 ffff880016a67ff0
        [  927.119543] Call Trace:
        [  927.120797]  [<ffffffff8106c548>] ? lookup_address+0x28/0x30
        [  927.122058]  [<ffffffff8113d176>] ? __module_text_address+0x16/0x70
        [  927.123314]  [<ffffffff8117bf0e>] ? is_ftrace_trampoline+0x3e/0x70
        [  927.124562]  [<ffffffff810c1a0f>] ? __kernel_text_address+0x5f/0x80
        [  927.125806]  [<ffffffff8102086f>] ? print_context_stack+0x7f/0xf0
        [  927.127033]  [<ffffffff810f7852>] ? __lock_acquire+0x572/0x2050
        [  927.128254]  [<ffffffff810f7852>] ? __lock_acquire+0x572/0x2050
        [  927.129461]  [<ffffffff8119edfa>] ? trace_call_bpf+0x3a/0x140
        [  927.130654]  [<ffffffff8119ee4a>] trace_call_bpf+0x8a/0x140
        [  927.131837]  [<ffffffff8119edfa>] ? trace_call_bpf+0x3a/0x140
        [  927.133015]  [<ffffffff8119f008>] kprobe_perf_func+0x28/0x220
        [  927.134195]  [<ffffffff811a1668>] kprobe_dispatcher+0x38/0x60
        [  927.135367]  [<ffffffff81174b91>] ? seccomp_phase1+0x1/0x230
        [  927.136523]  [<ffffffff81061400>] kprobe_ftrace_handler+0xf0/0x150
        [  927.137666]  [<ffffffff81174b95>] ? seccomp_phase1+0x5/0x230
        [  927.138802]  [<ffffffff8117950c>] ftrace_ops_recurs_func+0x5c/0xb0
        [  927.139934]  [<ffffffffa022b0d5>] 0xffffffffa022b0d5
        [  927.141066]  [<ffffffff81174b91>] ? seccomp_phase1+0x1/0x230
        [  927.142199]  [<ffffffff81174b95>] seccomp_phase1+0x5/0x230
        [  927.143323]  [<ffffffff8102c0a4>] syscall_trace_enter_phase1+0xc4/0x150
        [  927.144450]  [<ffffffff81174b95>] ? seccomp_phase1+0x5/0x230
        [  927.145572]  [<ffffffff8102c0a4>] ? syscall_trace_enter_phase1+0xc4/0x150
        [  927.146666]  [<ffffffff817f9a9f>] tracesys+0xd/0x44
        [  927.147723] Code: 48 8b 46 10 48 39 d0 76 2c 8b 85 fc fd ff ff 83 f8 20 77 21 83
                             c0 01 89 85 fc fd ff ff 48 8d 44 d6 80 48 8b 00 48 83 f8 00 74
                             0a <48> 8b 40 20 48 83 c0 33 ff e0 48 89 d8 48 8b 9d d8 fd ff
                             ff 4c
        [  927.150046] RIP  [<ffffffffa002440d>] 0xffffffffa002440d
      
      The code section with the instructions that traps points into the eBPF JIT
      image of the root program (the one invoking the tail call instruction).
      
      Using bpf_jit_disasm -o on the eBPF root program image:
      
        [...]
        4e:   mov    -0x204(%rbp),%eax
              8b 85 fc fd ff ff
        54:   cmp    $0x20,%eax               <--- if (tail_call_cnt > MAX_TAIL_CALL_CNT)
              83 f8 20
        57:   ja     0x000000000000007a
              77 21
        59:   add    $0x1,%eax                <--- tail_call_cnt++
              83 c0 01
        5c:   mov    %eax,-0x204(%rbp)
              89 85 fc fd ff ff
        62:   lea    -0x80(%rsi,%rdx,8),%rax  <--- prog = array->prog[index]
              48 8d 44 d6 80
        67:   mov    (%rax),%rax
              48 8b 00
        6a:   cmp    $0x0,%rax                <--- check for NULL
              48 83 f8 00
        6e:   je     0x000000000000007a
              74 0a
        70:   mov    0x20(%rax),%rax          <--- GPF triggered here! fetch of bpf_func
              48 8b 40 20                              [ matches <48> 8b 40 20 ... from above ]
        74:   add    $0x33,%rax               <--- prologue skip of new prog
              48 83 c0 33
        78:   jmpq   *%rax                    <--- jump to new prog insns
              ff e0
        [...]
      
      The problem is that rax has 5a5a5a5a5a5a5a5a, which suggests a tail call
      jump to map slot 0 is pointing to a poisoned page. The issue is the following:
      
      lea instruction has a wrong offset, i.e. it should be ...
      
        lea    0x80(%rsi,%rdx,8),%rax
      
      ... but it actually seems to be ...
      
        lea   -0x80(%rsi,%rdx,8),%rax
      
      ... where 0x80 is offsetof(struct bpf_array, prog), thus the offset needs
      to be positive instead of negative. Disassembling the interpreter, we btw
      similarly do:
      
        [...]
        c88:  lea     0x80(%rax,%rdx,8),%rax  <--- prog = array->prog[index]
              48 8d 84 d0 80 00 00 00
        c90:  add     $0x1,%r13d
              41 83 c5 01
        c94:  mov     (%rax),%rax
              48 8b 00
        [...]
      
      Now the other interesting fact is that this panic triggers only when things
      like CONFIG_LOCKDEP are being used. In that case offsetof(struct bpf_array,
      prog) starts at offset 0x80 and in non-CONFIG_LOCKDEP case at offset 0x50.
      Reason is that the work_struct inside struct bpf_map grows by 48 bytes in my
      case due to the lockdep_map member (which also has CONFIG_LOCK_STAT enabled
      members).
      
      Changing the emitter to always use the 4 byte displacement in the lea
      instruction fixes the panic on my side. It increases the tail call instruction
      emission by 3 more byte, but it should cover us from various combinations
      (and perhaps other future increases on related structures).
      
      After patch, disassembly:
      
        [...]
        9e:   lea    0x80(%rsi,%rdx,8),%rax   <--- CONFIG_LOCKDEP/CONFIG_LOCK_STAT
              48 8d 84 d6 80 00 00 00
        a6:   mov    (%rax),%rax
              48 8b 00
        [...]
      
        [...]
        9e:   lea    0x50(%rsi,%rdx,8),%rax   <--- No CONFIG_LOCKDEP
              48 8d 84 d6 50 00 00 00
        a6:   mov    (%rax),%rax
              48 8b 00
        [...]
      
      Fixes: b52f00e6 ("x86: bpf_jit: implement bpf_tail_call() helper")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2482abb9
  11. 26 7月, 2015 1 次提交