1. 20 3月, 2018 1 次提交
  2. 09 3月, 2018 5 次提交
    • D
      arm64: signal: Ensure si_code is valid for all fault signals · af40ff68
      Dave Martin 提交于
      Currently, as reported by Eric, an invalid si_code value 0 is
      passed in many signals delivered to userspace in response to faults
      and other kernel errors.  Typically 0 is passed when the fault is
      insufficiently diagnosable or when there does not appear to be any
      sensible alternative value to choose.
      
      This appears to violate POSIX, and is intuitively wrong for at
      least two reasons arising from the fact that 0 == SI_USER:
      
       1) si_code is a union selector, and SI_USER (and si_code <= 0 in
          general) implies the existence of a different set of fields
          (siginfo._kill) from that which exists for a fault signal
          (siginfo._sigfault).  However, the code raising the signal
          typically writes only the _sigfault fields, and the _kill
          fields make no sense in this case.
      
          Thus when userspace sees si_code == 0 (SI_USER) it may
          legitimately inspect fields in the inactive union member _kill
          and obtain garbage as a result.
      
          There appears to be software in the wild relying on this,
          albeit generally only for printing diagnostic messages.
      
       2) Software that wants to be robust against spurious signals may
          discard signals where si_code == SI_USER (or <= 0), or may
          filter such signals based on the si_uid and si_pid fields of
          siginfo._sigkill.  In the case of fault signals, this means
          that important (and usually fatal) error conditions may be
          silently ignored.
      
      In practice, many of the faults for which arm64 passes si_code == 0
      are undiagnosable conditions such as exceptions with syndrome
      values in ESR_ELx to which the architecture does not yet assign any
      meaning, or conditions indicative of a bug or error in the kernel
      or system and thus that are unrecoverable and should never occur in
      normal operation.
      
      The approach taken in this patch is to translate all such
      undiagnosable or "impossible" synchronous fault conditions to
      SIGKILL, since these are at least probably localisable to a single
      process.  Some of these conditions should really result in a kernel
      panic, but due to the lack of diagnostic information it is
      difficult to be certain: this patch does not add any calls to
      panic(), but this could change later if justified.
      
      Although si_code will not reach userspace in the case of SIGKILL,
      it is still desirable to pass a nonzero value so that the common
      siginfo handling code can detect incorrect use of si_code == 0
      without false positives.  In this case the si_code dependent
      siginfo fields will not be correctly initialised, but since they
      are not passed to userspace I deem this not to matter.
      
      A few faults can reasonably occur in realistic userspace scenarios,
      and _should_ raise a regular, handleable (but perhaps not
      ignorable/blockable) signal: for these, this patch attempts to
      choose a suitable standard si_code value for the raised signal in
      each case instead of 0.
      
      arm64 was the only arch to define a BUS_FIXME code, so after this
      patch nobody defines it.  This patch therefore also removes the
      relevant code from siginfo_layout().
      
      Cc: James Morse <james.morse@arm.com>
      Reported-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDave Martin <Dave.Martin@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      af40ff68
    • S
      arm64: Add support for new control bits CTR_EL0.DIC and CTR_EL0.IDC · 6ae4b6e0
      Shanker Donthineni 提交于
      The DCache clean & ICache invalidation requirements for instructions
      to be data coherence are discoverable through new fields in CTR_EL0.
      The following two control bits DIC and IDC were defined for this
      purpose. No need to perform point of unification cache maintenance
      operations from software on systems where CPU caches are transparent.
      
      This patch optimize the three functions __flush_cache_user_range(),
      clean_dcache_area_pou() and invalidate_icache_range() if the hardware
      reports CTR_EL0.IDC and/or CTR_EL0.IDC. Basically it skips the two
      instructions 'DC CVAU' and 'IC IVAU', and the associated loop logic
      in order to avoid the unnecessary overhead.
      
      CTR_EL0.DIC: Instruction cache invalidation requirements for
       instruction to data coherence. The meaning of this bit[29].
        0: Instruction cache invalidation to the point of unification
           is required for instruction to data coherence.
        1: Instruction cache cleaning to the point of unification is
            not required for instruction to data coherence.
      
      CTR_EL0.IDC: Data cache clean requirements for instruction to data
       coherence. The meaning of this bit[28].
        0: Data cache clean to the point of unification is required for
           instruction to data coherence, unless CLIDR_EL1.LoC == 0b000
           or (CLIDR_EL1.LoUIS == 0b000 && CLIDR_EL1.LoUU == 0b000).
        1: Data cache clean to the point of unification is not required
           for instruction to data coherence.
      Co-authored-by: NPhilip Elcan <pelcan@codeaurora.org>
      Reviewed-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NShanker Donthineni <shankerd@codeaurora.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      6ae4b6e0
    • A
      arm64/kernel: enable A53 erratum #8434319 handling at runtime · ca79acca
      Ard Biesheuvel 提交于
      Omit patching of ADRP instruction at module load time if the current
      CPUs are not susceptible to the erratum.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      [will: Drop duplicate initialisation of .def_scope field]
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      ca79acca
    • A
      arm64/errata: add REVIDR handling to framework · e8002e02
      Ard Biesheuvel 提交于
      In some cases, core variants that are affected by a certain erratum
      also exist in versions that have the erratum fixed, and this fact is
      recorded in a dedicated bit in system register REVIDR_EL1.
      
      Since the architecture does not require that a certain bit retains
      its meaning across different variants of the same model, each such
      REVIDR bit is tightly coupled to a certain revision/variant value,
      and so we need a list of revidr_mask/midr pairs to carry this
      information.
      
      So add the struct member and the associated macros and handling to
      allow REVIDR fixes to be taken into account.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      e8002e02
    • A
      arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 · a257e025
      Ard Biesheuvel 提交于
      Working around Cortex-A53 erratum #843419 involves special handling of
      ADRP instructions that end up in the last two instruction slots of a
      4k page, or whose output register gets overwritten without having been
      read. (Note that the latter instruction sequence is never emitted by
      a properly functioning compiler, which is why it is disregarded by the
      handling of the same erratum in the bfd.ld linker which we rely on for
      the core kernel)
      
      Normally, this gets taken care of by the linker, which can spot such
      sequences at final link time, and insert a veneer if the ADRP ends up
      at a vulnerable offset. However, linux kernel modules are partially
      linked ELF objects, and so there is no 'final link time' other than the
      runtime loading of the module, at which time all the static relocations
      are resolved.
      
      For this reason, we have implemented the #843419 workaround for modules
      by avoiding ADRP instructions altogether, by using the large C model,
      and by passing -mpc-relative-literal-loads to recent versions of GCC
      that may emit adrp/ldr pairs to perform literal loads. However, this
      workaround forces us to keep literal data mixed with the instructions
      in the executable .text segment, and literal data may inadvertently
      turn into an exploitable speculative gadget depending on the relative
      offsets of arbitrary symbols.
      
      So let's reimplement this workaround in a way that allows us to switch
      back to the small C model, and to drop the -mpc-relative-literal-loads
      GCC switch, by patching affected ADRP instructions at runtime:
      - ADRP instructions that do not appear at 4k relative offset 0xff8 or
        0xffc are ignored
      - ADRP instructions that are within 1 MB of their target symbol are
        converted into ADR instructions
      - remaining ADRP instructions are redirected via a veneer that performs
        the load using an unaffected movn/movk sequence.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      [will: tidied up ADRP -> ADR instruction patching.]
      [will: use ULL suffix for 64-bit immediate]
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      a257e025
  3. 08 3月, 2018 2 次提交
    • A
      arm64/kernel: kaslr: reduce module randomization range to 4 GB · f2b9ba87
      Ard Biesheuvel 提交于
      We currently have to rely on the GCC large code model for KASLR for
      two distinct but related reasons:
      - if we enable full randomization, modules will be loaded very far away
        from the core kernel, where they are out of range for ADRP instructions,
      - even without full randomization, the fact that the 128 MB module region
        is now no longer fully reserved for kernel modules means that there is
        a very low likelihood that the normal bottom-up allocation of other
        vmalloc regions may collide, and use up the range for other things.
      
      Large model code is suboptimal, given that each symbol reference involves
      a literal load that goes through the D-cache, reducing cache utilization.
      But more importantly, literals are not instructions but part of .text
      nonetheless, and hence mapped with executable permissions.
      
      So let's get rid of our dependency on the large model for KASLR, by:
      - reducing the full randomization range to 4 GB, thereby ensuring that
        ADRP references between modules and the kernel are always in range,
      - reduce the spillover range to 4 GB as well, so that we fallback to a
        region that is still guaranteed to be in range
      - move the randomization window of the core kernel to the middle of the
        VMALLOC space
      
      Note that KASAN always uses the module region outside of the vmalloc space,
      so keep the kernel close to that if KASAN is enabled.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      f2b9ba87
    • A
      arm64: module: don't BUG when exceeding preallocated PLT count · 5e8307b9
      Ard Biesheuvel 提交于
      When PLTs are emitted at relocation time, we really should not exceed
      the number that we counted when parsing the relocation tables, and so
      currently, we BUG() on this condition. However, even though this is a
      clear bug in this particular piece of code, we can easily recover by
      failing to load the module.
      
      So instead, return 0 from module_emit_plt_entry() if this condition
      occurs, which is not a valid kernel address, and can hence serve as
      a flag value that makes the relocation routine bail out.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      5e8307b9
  4. 07 3月, 2018 14 次提交
  5. 05 3月, 2018 4 次提交
  6. 23 2月, 2018 13 次提交
    • P
      arm64: fix unwind_frame() for filtered out fn for function graph tracing · 9f416319
      Pratyush Anand 提交于
      do_task_stat() calls get_wchan(), which further does unwind_frame().
      unwind_frame() restores frame->pc to original value in case function
      graph tracer has modified a return address (LR) in a stack frame to hook
      a function return. However, if function graph tracer has hit a filtered
      function, then we can't unwind it as ftrace_push_return_trace() has
      biased the index(frame->graph) with a 'huge negative'
      offset(-FTRACE_NOTRACE_DEPTH).
      
      Moreover, arm64 stack walker defines index(frame->graph) as unsigned
      int, which can not compare a -ve number.
      
      Similar problem we can have with calling of walk_stackframe() from
      save_stack_trace_tsk() or dump_backtrace().
      
      This patch fixes unwind_frame() to test the index for -ve value and
      restore index accordingly before we can restore frame->pc.
      
      Reproducer:
      
      cd /sys/kernel/debug/tracing/
      echo schedule > set_graph_notrace
      echo 1 > options/display-graph
      echo wakeup > current_tracer
      ps -ef | grep -i agent
      
      Above commands result in:
      Unable to handle kernel paging request at virtual address ffff801bd3d1e000
      pgd = ffff8003cbe97c00
      [ffff801bd3d1e000] *pgd=0000000000000000, *pud=0000000000000000
      Internal error: Oops: 96000006 [#1] SMP
      [...]
      CPU: 5 PID: 11696 Comm: ps Not tainted 4.11.0+ #33
      [...]
      task: ffff8003c21ba000 task.stack: ffff8003cc6c0000
      PC is at unwind_frame+0x12c/0x180
      LR is at get_wchan+0xd4/0x134
      pc : [<ffff00000808892c>] lr : [<ffff0000080860b8>] pstate: 60000145
      sp : ffff8003cc6c3ab0
      x29: ffff8003cc6c3ab0 x28: 0000000000000001
      x27: 0000000000000026 x26: 0000000000000026
      x25: 00000000000012d8 x24: 0000000000000000
      x23: ffff8003c1c04000 x22: ffff000008c83000
      x21: ffff8003c1c00000 x20: 000000000000000f
      x19: ffff8003c1bc0000 x18: 0000fffffc593690
      x17: 0000000000000000 x16: 0000000000000001
      x15: 0000b855670e2b60 x14: 0003e97f22cf1d0f
      x13: 0000000000000001 x12: 0000000000000000
      x11: 00000000e8f4883e x10: 0000000154f47ec8
      x9 : 0000000070f367c0 x8 : 0000000000000000
      x7 : 00008003f7290000 x6 : 0000000000000018
      x5 : 0000000000000000 x4 : ffff8003c1c03cb0
      x3 : ffff8003c1c03ca0 x2 : 00000017ffe80000
      x1 : ffff8003cc6c3af8 x0 : ffff8003d3e9e000
      
      Process ps (pid: 11696, stack limit = 0xffff8003cc6c0000)
      Stack: (0xffff8003cc6c3ab0 to 0xffff8003cc6c4000)
      [...]
      [<ffff00000808892c>] unwind_frame+0x12c/0x180
      [<ffff000008305008>] do_task_stat+0x864/0x870
      [<ffff000008305c44>] proc_tgid_stat+0x3c/0x48
      [<ffff0000082fde0c>] proc_single_show+0x5c/0xb8
      [<ffff0000082b27e0>] seq_read+0x160/0x414
      [<ffff000008289e6c>] __vfs_read+0x58/0x164
      [<ffff00000828b164>] vfs_read+0x88/0x144
      [<ffff00000828c2e8>] SyS_read+0x60/0xc0
      [<ffff0000080834a0>] __sys_trace_return+0x0/0x4
      
      Fixes: 20380bb3 (arm64: ftrace: fix a stack tracer's output under function graph tracer)
      Signed-off-by: NPratyush Anand <panand@redhat.com>
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      [catalin.marinas@arm.com: replace WARN_ON with WARN_ON_ONCE]
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      9f416319
    • S
      x86/topology: Update the 'cpu cores' field in /proc/cpuinfo correctly across CPU hotplug operations · 45967493
      Samuel Neves 提交于
      Without this fix, /proc/cpuinfo will display an incorrect amount
      of CPU cores, after bringing them offline and online again, as
      exemplified below:
      
        $ cat /proc/cpuinfo | grep cores
        cpu cores	: 4
        cpu cores	: 8
        cpu cores	: 8
        cpu cores	: 20
        cpu cores	: 4
        cpu cores	: 3
        cpu cores	: 2
        cpu cores	: 2
      
      This patch fixes this by always zeroing the booted_cores variable
      upon turning off a logical CPU.
      Tested-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Signed-off-by: NSamuel Neves <sneves@dei.uc.pt>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jgross@suse.com
      Cc: luto@kernel.org
      Cc: prarit@redhat.com
      Cc: vkuznets@redhat.com
      Link: http://lkml.kernel.org/r/20180221205036.5244-1-sneves@dei.uc.ptSigned-off-by: NIngo Molnar <mingo@kernel.org>
      45967493
    • A
      locking/xchg/alpha: Fix xchg() and cmpxchg() memory ordering bugs · 472e8c55
      Andrea Parri 提交于
      Successful RMW operations are supposed to be fully ordered, but
      Alpha's xchg() and cmpxchg() do not meet this requirement.
      
      Will Deacon noticed the bug:
      
        > So MP using xchg:
        >
        > WRITE_ONCE(x, 1)
        > xchg(y, 1)
        >
        > smp_load_acquire(y) == 1
        > READ_ONCE(x) == 0
        >
        > would be allowed.
      
      ... which thus violates the above requirement.
      
      Fix it by adding a leading smp_mb() to the xchg() and cmpxchg() implementations.
      Reported-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrea Parri <parri.andrea@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-alpha@vger.kernel.org
      Link: http://lkml.kernel.org/r/1519291488-5752-1-git-send-email-parri.andrea@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      472e8c55
    • A
      locking/xchg/alpha: Clean up barrier usage by using smp_mb() in place of __ASM__MB · 79d44246
      Andrea Parri 提交于
      Replace each occurrence of __ASM__MB with a (trailing) smp_mb() in
      xchg(), cmpxchg(), and remove the now unused __ASM__MB definitions;
      this improves readability, with no additional synchronization cost.
      Suggested-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrea Parri <parri.andrea@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-alpha@vger.kernel.org
      Link: http://lkml.kernel.org/r/1519291469-5702-1-git-send-email-parri.andrea@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      79d44246
    • W
      x86/intel_rdt: Fix incorrect returned value when creating rdgroup... · 36e74d35
      Wang Hui 提交于
      x86/intel_rdt: Fix incorrect returned value when creating rdgroup sub-directory in resctrl file system
      
      If no monitoring feature is detected because all monitoring features are
      disabled during boot time or there is no monitoring feature in hardware,
      creating rdtgroup sub-directory by "mkdir" command reports error:
      
        mkdir: cannot create directory ‘/sys/fs/resctrl/p1’: No such file or directory
      
      But the sub-directory actually is generated and content is correct:
      
        cpus  cpus_list  schemata  tasks
      
      The error is because rdtgroup_mkdir_ctrl_mon() returns non zero value after
      the sub-directory is created and the returned value is reported as an error
      to user.
      
      Clear the returned value to report to user that the sub-directory is
      actually created successfully.
      Signed-off-by: NWang Hui <john.wanghui@huawei.com>
      Signed-off-by: NZhang Yanfei <yanfei.zhang@huawei.com>
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi V Shankar <ravi.v.shankar@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vikas <vikas.shivappa@intel.com>
      Cc: Xiaochen Shen <xiaochen.shen@intel.com>
      Link: http://lkml.kernel.org/r/1519356363-133085-1-git-send-email-fenghua.yu@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      36e74d35
    • T
      x86/apic/vector: Handle vector release on CPU unplug correctly · e84cf6aa
      Thomas Gleixner 提交于
      When a irq vector is replaced, then the previous vector is normally
      released when the first interrupt happens on the new vector. If the target
      CPU of the previous vector is already offline when the new vector is
      installed, then the previous vector is silently discarded, which leads to
      accounting issues causing suspend failures and other problems.
      
      Adjust the logic so that the previous vector is freed in the underlying
      matrix allocator to ensure that the accounting stays correct.
      
      Fixes: 69cde000 ("x86/vector: Use matrix allocator for vector assignment")
      Reported-by: NYuriy Vostrikov <delamonpansie@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NYuriy Vostrikov <delamonpansie@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180222112316.930791749@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e84cf6aa
    • M
      powerpc/powernv: Support firmware disable of RFI flush · eb0a2d26
      Michael Ellerman 提交于
      Some versions of firmware will have a setting that can be configured
      to disable the RFI flush, add support for it.
      
      Fixes: 6e032b35 ("powerpc/powernv: Check device-tree for RFI flush settings")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      eb0a2d26
    • M
      powerpc/pseries: Support firmware disable of RFI flush · 582605a4
      Michael Ellerman 提交于
      Some versions of firmware will have a setting that can be configured
      to disable the RFI flush, add support for it.
      
      Fixes: 8989d568 ("powerpc/pseries: Query hypervisor for RFI flush settings")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      582605a4
    • B
      powerpc/mm/drmem: Fix unexpected flag value in ibm,dynamic-memory-v2 · 2f7d03e0
      Bharata B Rao 提交于
      Memory addtion and removal by count and indexed-count methods
      temporarily mark the LMBs that are being added/removed by a special
      flag value DRMEM_LMB_RESERVED. Accessing flags value directly at a few
      places without proper accessor method is causing two unexpected
      side-effects:
      
      - DRMEM_LMB_RESERVED bit is becoming part of the flags word of
        drconf_cell_v2 entries in ibm,dynamic-memory-v2 DT property.
      - This results in extra drconf_cell entries in ibm,dynamic-memory-v2.
        For example if 1G memory is added, it leads to one entry for 3 LMBs
        and 1 separate entry for the last LMB. All the 4 LMBs should be
        defined by one entry here.
      
      Fix this by always accessing the flags by its accessor method
      drmem_lmb_flags().
      
      Fixes: 2b31e3ae ("powerpc/drmem: Add support for ibm, dynamic-memory-v2 property")
      Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
      Reviewed-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2f7d03e0
    • K
      MIPS: boot: Define __ASSEMBLY__ for its.S build · 0f9da844
      Kees Cook 提交于
      The MIPS %.its.S compiler command did not define __ASSEMBLY__, which meant
      when compiler_types.h was added to kconfig.h, unexpected things appeared
      (e.g. struct declarations) which should not have been present. As done in
      the general %.S compiler command, __ASSEMBLY__ is now included here too.
      
      The failure was:
      
          Error: arch/mips/boot/vmlinux.gz.its:201.1-2 syntax error
          FATAL ERROR: Unable to parse input tree
          /usr/bin/mkimage: Can't read arch/mips/boot/vmlinux.gz.itb.tmp: Invalid argument
          /usr/bin/mkimage Can't add hashes to FIT blob
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Fixes: 28128c61 ("kconfig.h: Include compiler types to avoid missed struct attributes")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f9da844
    • D
      bpf, arm64: fix out of bounds access in tail call · 16338a9b
      Daniel Borkmann 提交于
      I recently noticed a crash on arm64 when feeding a bogus index
      into BPF tail call helper. The crash would not occur when the
      interpreter is used, but only in case of JIT. Output looks as
      follows:
      
        [  347.007486] Unable to handle kernel paging request at virtual address fffb850e96492510
        [...]
        [  347.043065] [fffb850e96492510] address between user and kernel address ranges
        [  347.050205] Internal error: Oops: 96000004 [#1] SMP
        [...]
        [  347.190829] x13: 0000000000000000 x12: 0000000000000000
        [  347.196128] x11: fffc047ebe782800 x10: ffff808fd7d0fd10
        [  347.201427] x9 : 0000000000000000 x8 : 0000000000000000
        [  347.206726] x7 : 0000000000000000 x6 : 001c991738000000
        [  347.212025] x5 : 0000000000000018 x4 : 000000000000ba5a
        [  347.217325] x3 : 00000000000329c4 x2 : ffff808fd7cf0500
        [  347.222625] x1 : ffff808fd7d0fc00 x0 : ffff808fd7cf0500
        [  347.227926] Process test_verifier (pid: 4548, stack limit = 0x000000007467fa61)
        [  347.235221] Call trace:
        [  347.237656]  0xffff000002f3a4fc
        [  347.240784]  bpf_test_run+0x78/0xf8
        [  347.244260]  bpf_prog_test_run_skb+0x148/0x230
        [  347.248694]  SyS_bpf+0x77c/0x1110
        [  347.251999]  el0_svc_naked+0x30/0x34
        [  347.255564] Code: 9100075a d280220a 8b0a002a d37df04b (f86b694b)
        [...]
      
      In this case the index used in BPF r3 is the same as in r1
      at the time of the call, meaning we fed a pointer as index;
      here, it had the value 0xffff808fd7cf0500 which sits in x2.
      
      While I found tail calls to be working in general (also for
      hitting the error cases), I noticed the following in the code
      emission:
      
        # bpftool p d j i 988
        [...]
        38:   ldr     w10, [x1,x10]
        3c:   cmp     w2, w10
        40:   b.ge    0x000000000000007c              <-- signed cmp
        44:   mov     x10, #0x20                      // #32
        48:   cmp     x26, x10
        4c:   b.gt    0x000000000000007c
        50:   add     x26, x26, #0x1
        54:   mov     x10, #0x110                     // #272
        58:   add     x10, x1, x10
        5c:   lsl     x11, x2, #3
        60:   ldr     x11, [x10,x11]                  <-- faulting insn (f86b694b)
        64:   cbz     x11, 0x000000000000007c
        [...]
      
      Meaning, the tests passed because commit ddb55992 ("arm64:
      bpf: implement bpf_tail_call() helper") was using signed compares
      instead of unsigned which as a result had the test wrongly passing.
      
      Change this but also the tail call count test both into unsigned
      and cap the index as u32. Latter we did as well in 90caccdd
      ("bpf: fix bpf_tail_call() x64 JIT") and is needed in addition here,
      too. Tested on HiSilicon Hi1616.
      
      Result after patch:
      
        # bpftool p d j i 268
        [...]
        38:	ldr	w10, [x1,x10]
        3c:	add	w2, w2, #0x0
        40:	cmp	w2, w10
        44:	b.cs	0x0000000000000080
        48:	mov	x10, #0x20                  	// #32
        4c:	cmp	x26, x10
        50:	b.hi	0x0000000000000080
        54:	add	x26, x26, #0x1
        58:	mov	x10, #0x110                 	// #272
        5c:	add	x10, x1, x10
        60:	lsl	x11, x2, #3
        64:	ldr	x11, [x10,x11]
        68:	cbz	x11, 0x0000000000000080
        [...]
      
      Fixes: ddb55992 ("arm64: bpf: implement bpf_tail_call() helper")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      16338a9b
    • D
      bpf, x64: implement retpoline for tail call · a493a87f
      Daniel Borkmann 提交于
      Implement a retpoline [0] for the BPF tail call JIT'ing that converts
      the indirect jump via jmp %rax that is used to make the long jump into
      another JITed BPF image. Since this is subject to speculative execution,
      we need to control the transient instruction sequence here as well
      when CONFIG_RETPOLINE is set, and direct it into a pause + lfence loop.
      The latter aligns also with what gcc / clang emits (e.g. [1]).
      
      JIT dump after patch:
      
        # bpftool p d x i 1
         0: (18) r2 = map[id:1]
         2: (b7) r3 = 0
         3: (85) call bpf_tail_call#12
         4: (b7) r0 = 2
         5: (95) exit
      
      With CONFIG_RETPOLINE:
      
        # bpftool p d j i 1
        [...]
        33:	cmp    %edx,0x24(%rsi)
        36:	jbe    0x0000000000000072  |*
        38:	mov    0x24(%rbp),%eax
        3e:	cmp    $0x20,%eax
        41:	ja     0x0000000000000072  |
        43:	add    $0x1,%eax
        46:	mov    %eax,0x24(%rbp)
        4c:	mov    0x90(%rsi,%rdx,8),%rax
        54:	test   %rax,%rax
        57:	je     0x0000000000000072  |
        59:	mov    0x28(%rax),%rax
        5d:	add    $0x25,%rax
        61:	callq  0x000000000000006d  |+
        66:	pause                      |
        68:	lfence                     |
        6b:	jmp    0x0000000000000066  |
        6d:	mov    %rax,(%rsp)         |
        71:	retq                       |
        72:	mov    $0x2,%eax
        [...]
      
        * relative fall-through jumps in error case
        + retpoline for indirect jump
      
      Without CONFIG_RETPOLINE:
      
        # bpftool p d j i 1
        [...]
        33:	cmp    %edx,0x24(%rsi)
        36:	jbe    0x0000000000000063  |*
        38:	mov    0x24(%rbp),%eax
        3e:	cmp    $0x20,%eax
        41:	ja     0x0000000000000063  |
        43:	add    $0x1,%eax
        46:	mov    %eax,0x24(%rbp)
        4c:	mov    0x90(%rsi,%rdx,8),%rax
        54:	test   %rax,%rax
        57:	je     0x0000000000000063  |
        59:	mov    0x28(%rax),%rax
        5d:	add    $0x25,%rax
        61:	jmpq   *%rax               |-
        63:	mov    $0x2,%eax
        [...]
      
        * relative fall-through jumps in error case
        - plain indirect jump as before
      
        [0] https://support.google.com/faqs/answer/7625886
        [1] https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2bSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a493a87f
    • H
      x86: Treat R_X86_64_PLT32 as R_X86_64_PC32 · b21ebf2f
      H.J. Lu 提交于
      On i386, there are 2 types of PLTs, PIC and non-PIC.  PIE and shared
      objects must use PIC PLT.  To use PIC PLT, you need to load
      _GLOBAL_OFFSET_TABLE_ into EBX first.  There is no need for that on
      x86-64 since x86-64 uses PC-relative PLT.
      
      On x86-64, for 32-bit PC-relative branches, we can generate PLT32
      relocation, instead of PC32 relocation, which can also be used as
      a marker for 32-bit PC-relative branches.  Linker can always reduce
      PLT32 relocation to PC32 if function is defined locally.   Local
      functions should use PC32 relocation.  As far as Linux kernel is
      concerned, R_X86_64_PLT32 can be treated the same as R_X86_64_PC32
      since Linux kernel doesn't use PLT.
      
      R_X86_64_PLT32 for 32-bit PC-relative branches has been enabled in
      binutils master branch which will become binutils 2.31.
      
      [ hjl is working on having better documentation on this all, but a few
        more notes from him:
      
         "PLT32 relocation is used as marker for PC-relative branches. Because
          of EBX, it looks odd to generate PLT32 relocation on i386 when EBX
          doesn't have GOT.
      
          As for symbol resolution, PLT32 and PC32 relocations are almost
          interchangeable. But when linker sees PLT32 relocation against a
          protected symbol, it can resolved locally at link-time since it is
          used on a branch instruction. Linker can't do that for PC32
          relocation"
      
        but for the kernel use, the two are basically the same, and this
        commit gets things building and working with the current binutils
        master   - Linus ]
      Signed-off-by: NH.J. Lu <hjl.tools@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b21ebf2f
  7. 22 2月, 2018 1 次提交