1. 20 2月, 2019 4 次提交
    • E
      signal: Restore the stop PTRACE_EVENT_EXIT · a2b3e2c0
      Eric W. Biederman 提交于
      commit cf43a757fd49442bc38f76088b70c2299eed2c2f upstream.
      
      In the middle of do_exit() there is there is a call
      "ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process
      in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for
      for the debugger to release the task or SIGKILL to be delivered.
      
      Skipping past dequeue_signal when we know a fatal signal has already
      been delivered resulted in SIGKILL remaining pending and
      TIF_SIGPENDING remaining set.  This in turn caused the
      scheduler to not sleep in PTACE_EVENT_EXIT as it figured
      a fatal signal was pending.  This also caused ptrace_freeze_traced
      in ptrace_check_attach to fail because it left a per thread
      SIGKILL pending which is what fatal_signal_pending tests for.
      
      This difference in signal state caused strace to report
      strace: Exit of unknown pid NNNNN ignored
      
      Therefore update the signal handling state like dequeue_signal
      would when removing a per thread SIGKILL, by removing SIGKILL
      from the per thread signal mask and clearing TIF_SIGPENDING.
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NIvan Delalande <colona@arista.com>
      Cc: stable@vger.kernel.org
      Fixes: 35634ffa1751 ("signal: Always notice exiting tasks")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a2b3e2c0
    • A
      tracing/uprobes: Fix output for multiple string arguments · 45649b99
      Andreas Ziegler 提交于
      commit 0722069a5374b904ec1a67f91249f90e1cfae259 upstream.
      
      When printing multiple uprobe arguments as strings the output for the
      earlier arguments would also include all later string arguments.
      
      This is best explained in an example:
      
      Consider adding a uprobe to a function receiving two strings as
      parameters which is at offset 0xa0 in strlib.so and we want to print
      both parameters when the uprobe is hit (on x86_64):
      
      $ echo 'p:func /lib/strlib.so:0xa0 +0(%di):string +0(%si):string' > \
          /sys/kernel/debug/tracing/uprobe_events
      
      When the function is called as func("foo", "bar") and we hit the probe,
      the trace file shows a line like the following:
      
        [...] func: (0x7f7e683706a0) arg1="foobar" arg2="bar"
      
      Note the extra "bar" printed as part of arg1. This behaviour stacks up
      for additional string arguments.
      
      The strings are stored in a dynamically growing part of the uprobe
      buffer by fetch_store_string() after copying them from userspace via
      strncpy_from_user(). The return value of strncpy_from_user() is then
      directly used as the required size for the string. However, this does
      not take the terminating null byte into account as the documentation
      for strncpy_from_user() cleary states that it "[...] returns the
      length of the string (not including the trailing NUL)" even though the
      null byte will be copied to the destination.
      
      Therefore, subsequent calls to fetch_store_string() will overwrite
      the terminating null byte of the most recently fetched string with
      the first character of the current string, leading to the
      "accumulation" of strings in earlier arguments in the output.
      
      Fix this by incrementing the return value of strncpy_from_user() by
      one if we did not hit the maximum buffer size.
      
      Link: http://lkml.kernel.org/r/20190116141629.5752-1-andreas.ziegler@fau.de
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 5baaa59e ("tracing/probes: Implement 'memory' fetch method for uprobes")
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NAndreas Ziegler <andreas.ziegler@fau.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      45649b99
    • J
      perf/x86: Add check_period PMU callback · 74cbb754
      Jiri Olsa 提交于
      commit 81ec3f3c4c4d78f2d3b6689c9816bfbdf7417dbb upstream.
      
      Vince (and later on Ravi) reported crashes in the BTS code during
      fuzzing with the following backtrace:
      
        general protection fault: 0000 [#1] SMP PTI
        ...
        RIP: 0010:perf_prepare_sample+0x8f/0x510
        ...
        Call Trace:
         <IRQ>
         ? intel_pmu_drain_bts_buffer+0x194/0x230
         intel_pmu_drain_bts_buffer+0x160/0x230
         ? tick_nohz_irq_exit+0x31/0x40
         ? smp_call_function_single_interrupt+0x48/0xe0
         ? call_function_single_interrupt+0xf/0x20
         ? call_function_single_interrupt+0xa/0x20
         ? x86_schedule_events+0x1a0/0x2f0
         ? x86_pmu_commit_txn+0xb4/0x100
         ? find_busiest_group+0x47/0x5d0
         ? perf_event_set_state.part.42+0x12/0x50
         ? perf_mux_hrtimer_restart+0x40/0xb0
         intel_pmu_disable_event+0xae/0x100
         ? intel_pmu_disable_event+0xae/0x100
         x86_pmu_stop+0x7a/0xb0
         x86_pmu_del+0x57/0x120
         event_sched_out.isra.101+0x83/0x180
         group_sched_out.part.103+0x57/0xe0
         ctx_sched_out+0x188/0x240
         ctx_resched+0xa8/0xd0
         __perf_event_enable+0x193/0x1e0
         event_function+0x8e/0xc0
         remote_function+0x41/0x50
         flush_smp_call_function_queue+0x68/0x100
         generic_smp_call_function_single_interrupt+0x13/0x30
         smp_call_function_single_interrupt+0x3e/0xe0
         call_function_single_interrupt+0xf/0x20
         </IRQ>
      
      The reason is that while event init code does several checks
      for BTS events and prevents several unwanted config bits for
      BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
      to create BTS event without those checks being done.
      
      Following sequence will cause the crash:
      
      If we create an 'almost' BTS event with precise_ip and callchains,
      and it into a BTS event it will crash the perf_prepare_sample()
      function because precise_ip events are expected to come
      in with callchain data initialized, but that's not the
      case for intel_pmu_drain_bts_buffer() caller.
      
      Adding a check_period callback to be called before the period
      is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
      if the event would become BTS. Plus adding also the limit_period
      check as well.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190204123532.GA4794@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      74cbb754
    • I
      perf/core: Fix impossible ring-buffer sizes warning · d10e77c2
      Ingo Molnar 提交于
      commit 528871b456026e6127d95b1b2bd8e3a003dc1614 upstream.
      
      The following commit:
      
        9dff0aa95a32 ("perf/core: Don't WARN() for impossible ring-buffer sizes")
      
      results in perf recording failures with larger mmap areas:
      
        root@skl:/tmp# perf record -g -a
        failed to mmap with 12 (Cannot allocate memory)
      
      The root cause is that the following condition is buggy:
      
      	if (order_base_2(size) >= MAX_ORDER)
      		goto fail;
      
      The problem is that @size is in bytes and MAX_ORDER is in pages,
      so the right test is:
      
      	if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER)
      		goto fail;
      
      Fix it.
      Reported-by: N"Jin, Yao" <yao.jin@linux.intel.com>
      Bisected-by: NBorislav Petkov <bp@alien8.de>
      Analyzed-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Julien Thierry <julien.thierry@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Fixes: 9dff0aa95a32 ("perf/core: Don't WARN() for impossible ring-buffer sizes")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d10e77c2
  2. 15 2月, 2019 3 次提交
    • A
      tracing: uprobes: Fix typo in pr_fmt string · 7e44aab9
      Andreas Ziegler 提交于
      commit ea6eb5e7d15e1838de335609994b4546e2abcaaf upstream.
      
      The subsystem-specific message prefix for uprobes was also
      "trace_kprobe: " instead of "trace_uprobe: " as described in
      the original commit message.
      
      Link: http://lkml.kernel.org/r/20190117133023.19292-1-andreas.ziegler@fau.de
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Fixes: 72576341 ("tracing/probe: Show subsystem name in messages")
      Signed-off-by: NAndreas Ziegler <andreas.ziegler@fau.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7e44aab9
    • E
      signal: Better detection of synchronous signals · 959e46af
      Eric W. Biederman 提交于
      commit 7146db3317c67b517258cb5e1b08af387da0618b upstream.
      
      Recently syzkaller was able to create unkillablle processes by
      creating a timer that is delivered as a thread local signal on SIGHUP,
      and receiving SIGHUP SA_NODEFERER.  Ultimately causing a loop failing
      to deliver SIGHUP but always trying.
      
      When the stack overflows delivery of SIGHUP fails and force_sigsegv is
      called.  Unfortunately because SIGSEGV is numerically higher than
      SIGHUP next_signal tries again to deliver a SIGHUP.
      
      From a quality of implementation standpoint attempting to deliver the
      timer SIGHUP signal is wrong.  We should attempt to deliver the
      synchronous SIGSEGV signal we just forced.
      
      We can make that happening in a fairly straight forward manner by
      instead of just looking at the signal number we also look at the
      si_code.  In particular for exceptions (aka synchronous signals) the
      si_code is always greater than 0.
      
      That still has the potential to pick up a number of asynchronous
      signals as in a few cases the same si_codes that are used
      for synchronous signals are also used for asynchronous signals,
      and SI_KERNEL is also included in the list of possible si_codes.
      
      Still the heuristic is much better and timer signals are definitely
      excluded.  Which is enough to prevent all known ways for someone
      sending a process signals fast enough to cause unexpected and
      arguably incorrect behavior.
      
      Cc: stable@vger.kernel.org
      Fixes: a27341cd ("Prioritize synchronous signals over 'normal' signals")
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      959e46af
    • E
      signal: Always notice exiting tasks · f681f268
      Eric W. Biederman 提交于
      commit 35634ffa1751b6efd8cf75010b509dcb0263e29b upstream.
      
      Recently syzkaller was able to create unkillablle processes by
      creating a timer that is delivered as a thread local signal on SIGHUP,
      and receiving SIGHUP SA_NODEFERER.  Ultimately causing a loop
      failing to deliver SIGHUP but always trying.
      
      Upon examination it turns out part of the problem is actually most of
      the solution.  Since 2.5 signal delivery has found all fatal signals,
      marked the signal group for death, and queued SIGKILL in every threads
      thread queue relying on signal->group_exit_code to preserve the
      information of which was the actual fatal signal.
      
      The conversion of all fatal signals to SIGKILL results in the
      synchronous signal heuristic in next_signal kicking in and preferring
      SIGHUP to SIGKILL.  Which is especially problematic as all
      fatal signals have already been transformed into SIGKILL.
      
      Instead of dequeueing signals and depending upon SIGKILL to
      be the first signal dequeued, first test if the signal group
      has already been marked for death.  This guarantees that
      nothing in the signal queue can prevent a process that needs
      to exit from exiting.
      
      Cc: stable@vger.kernel.org
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Ref: ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4")
      History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gitSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f681f268
  3. 13 2月, 2019 12 次提交
    • M
      perf/core: Don't WARN() for impossible ring-buffer sizes · 1aeeb176
      Mark Rutland 提交于
      commit 9dff0aa95a324e262ffb03f425d00e4751f3294e upstream.
      
      The perf tool uses /proc/sys/kernel/perf_event_mlock_kb to determine how
      large its ringbuffer mmap should be. This can be configured to arbitrary
      values, which can be larger than the maximum possible allocation from
      kmalloc.
      
      When this is configured to a suitably large value (e.g. thanks to the
      perf fuzzer), attempting to use perf record triggers a WARN_ON_ONCE() in
      __alloc_pages_nodemask():
      
         WARNING: CPU: 2 PID: 5666 at mm/page_alloc.c:4511 __alloc_pages_nodemask+0x3f8/0xbc8
      
      Let's avoid this by checking that the requested allocation is possible
      before calling kzalloc.
      Reported-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJulien Thierry <julien.thierry@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20190110142745.25495-1-mark.rutland@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1aeeb176
    • J
      cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM · 97a7fa90
      Josh Poimboeuf 提交于
      commit b284909abad48b07d3071a9fc9b5692b3e64914b upstream.
      
      With the following commit:
      
        73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      
      ... the hotplug code attempted to detect when SMT was disabled by BIOS,
      in which case it reported SMT as permanently disabled.  However, that
      code broke a virt hotplug scenario, where the guest is booted with only
      primary CPU threads, and a sibling is brought online later.
      
      The problem is that there doesn't seem to be a way to reliably
      distinguish between the HW "SMT disabled by BIOS" case and the virt
      "sibling not yet brought online" case.  So the above-mentioned commit
      was a bit misguided, as it permanently disabled SMT for both cases,
      preventing future virt sibling hotplugs.
      
      Going back and reviewing the original problems which were attempted to
      be solved by that commit, when SMT was disabled in BIOS:
      
        1) /sys/devices/system/cpu/smt/control showed "on" instead of
           "notsupported"; and
      
        2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.
      
      I'd propose that we instead consider #1 above to not actually be a
      problem.  Because, at least in the virt case, it's possible that SMT
      wasn't disabled by BIOS and a sibling thread could be brought online
      later.  So it makes sense to just always default the smt control to "on"
      to allow for that possibility (assuming cpuid indicates that the CPU
      supports SMT).
      
      The real problem is #2, which has a simple fix: change vmx_vm_init() to
      query the actual current SMT state -- i.e., whether any siblings are
      currently online -- instead of looking at the SMT "control" sysfs value.
      
      So fix it by:
      
        a) reverting the original "fix" and its followup fix:
      
           73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
           bc2d8d26 ("cpu/hotplug: Fix SMT supported evaluation")
      
           and
      
        b) changing vmx_vm_init() to query the actual current SMT state --
           instead of the sysfs control value -- to determine whether the L1TF
           warning is needed.  This also requires the 'sched_smt_present'
           variable to exported, instead of 'cpu_smt_control'.
      
      Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      Reported-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      97a7fa90
    • T
      futex: Handle early deadlock return correctly · ee73954d
      Thomas Gleixner 提交于
      commit 1a1fb985f2e2b85ec0d3dc2e519ee48389ec2434 upstream.
      
      commit 56222b21 ("futex: Drop hb->lock before enqueueing on the
      rtmutex") changed the locking rules in the futex code so that the hash
      bucket lock is not longer held while the waiter is enqueued into the
      rtmutex wait list. This made the lock and the unlock path symmetric, but
      unfortunately the possible early exit from __rt_mutex_proxy_start() due to
      a detected deadlock was not updated accordingly. That allows a concurrent
      unlocker to observe inconsitent state which triggers the warning in the
      unlock path.
      
      futex_lock_pi()                         futex_unlock_pi()
        lock(hb->lock)
        queue(hb_waiter)				lock(hb->lock)
        lock(rtmutex->wait_lock)
        unlock(hb->lock)
                                              // acquired hb->lock
                                              hb_waiter = futex_top_waiter()
                                              lock(rtmutex->wait_lock)
        __rt_mutex_proxy_start()
           ---> fail
                remove(rtmutex_waiter);
           ---> returns -EDEADLOCK
        unlock(rtmutex->wait_lock)
                                              // acquired wait_lock
                                              wake_futex_pi()
                                              rt_mutex_next_owner()
      					  --> returns NULL
                                                --> WARN
      
        lock(hb->lock)
        unqueue(hb_waiter)
      
      The problem is caused by the remove(rtmutex_waiter) in the failure case of
      __rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the
      hash bucket but no waiter on the rtmutex, i.e. inconsistent state.
      
      The original commit handles this correctly for the other early return cases
      (timeout, signal) by delaying the removal of the rtmutex waiter until the
      returning task reacquired the hash bucket lock.
      
      Treat the failure case of __rt_mutex_proxy_start() in the same way and let
      the existing cleanup code handle the eventual handover of the rtmutex
      gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter
      removal for the failure case, so that the other callsites are still
      operating correctly.
      
      Add proper comments to the code so all these details are fully documented.
      
      Thanks to Peter for helping with the analysis and writing the really
      valuable code comments.
      
      Fixes: 56222b21 ("futex: Drop hb->lock before enqueueing on the rtmutex")
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Co-developed-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: linux-s390@vger.kernel.org
      Cc: Stefan Liebler <stli@linux.ibm.com>
      Cc: Sebastian Sewior <bigeasy@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1901292311410.1950@nanos.tec.linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee73954d
    • A
      kernel/kcov.c: mark write_comp_data() as notrace · 58e57bcb
      Anders Roxell 提交于
      [ Upstream commit 634724431607f6f46c495dfef801a1c8b44a96d9 ]
      
      Since __sanitizer_cov_trace_const_cmp4 is marked as notrace, the
      function called from __sanitizer_cov_trace_const_cmp4 shouldn't be
      traceable either.  ftrace_graph_caller() gets called every time func
      write_comp_data() gets called if it isn't marked 'notrace'.  This is the
      backtrace from gdb:
      
       #0  ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179
       #1  0xffffff8010201920 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:151
       #2  0xffffff8010439714 in write_comp_data (type=5, arg1=0, arg2=0, ip=18446743524224276596) at ../kernel/kcov.c:116
       #3  0xffffff8010439894 in __sanitizer_cov_trace_const_cmp4 (arg1=<optimized out>, arg2=<optimized out>) at ../kernel/kcov.c:188
       #4  0xffffff8010201874 in prepare_ftrace_return (self_addr=18446743524226602768, parent=0xffffff801014b918, frame_pointer=18446743524223531344) at ./include/generated/atomic-instrumented.h:27
       #5  0xffffff801020194c in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:182
      
      Rework so that write_comp_data() that are called from
      __sanitizer_cov_trace_*_cmp*() are marked as 'notrace'.
      
      Commit 903e8ff86753 ("kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace")
      missed to mark write_comp_data() as 'notrace'. When that patch was
      created gcc-7 was used. In lib/Kconfig.debug
      config KCOV_ENABLE_COMPARISONS
      	depends on $(cc-option,-fsanitize-coverage=trace-cmp)
      
      That code path isn't hit with gcc-7. However, it were that with gcc-8.
      
      Link: http://lkml.kernel.org/r/20181206143011.23719-1-anders.roxell@linaro.orgSigned-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Co-developed-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      58e57bcb
    • L
      kernel/hung_task.c: force console verbose before panic · f0d32c54
      Liu, Chuansheng 提交于
      [ Upstream commit 168e06f7937d96c7222037d8a05565e8a6eb00fe ]
      
      Based on commit 401c636a ("kernel/hung_task.c: show all hung tasks
      before panic"), we could get the call stack of hung task.
      
      However, if the console loglevel is not high, we still can not see the
      useful panic information in practice, and in most cases users don't set
      console loglevel to high level.
      
      This patch is to force console verbose before system panic, so that the
      real useful information can be seen in the console, instead of being
      like the following, which doesn't have hung task information.
      
        INFO: task init:1 blocked for more than 120 seconds.
              Tainted: G     U  W         4.19.0-quilt-2e5dc0ac-g51b6c21d76cc #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        Kernel panic - not syncing: hung_task: blocked tasks
        CPU: 2 PID: 479 Comm: khungtaskd Tainted: G     U  W         4.19.0-quilt-2e5dc0ac-g51b6c21d76cc #1
        Call Trace:
         dump_stack+0x4f/0x65
         panic+0xde/0x231
         watchdog+0x290/0x410
         kthread+0x12c/0x150
         ret_from_fork+0x35/0x40
        reboot: panic mode set: p,w
        Kernel Offset: 0x34000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      
      Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A6015B675@SHSMSX101.ccr.corp.intel.comSigned-off-by: NChuansheng Liu <chuansheng.liu@intel.com>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f0d32c54
    • C
      proc/sysctl: fix return error for proc_doulongvec_minmax() · 9beb84c0
      Cheng Lin 提交于
      [ Upstream commit 09be178400829dddc1189b50a7888495dd26aa84 ]
      
      If the number of input parameters is less than the total parameters, an
      EINVAL error will be returned.
      
      For example, we use proc_doulongvec_minmax to pass up to two parameters
      with kern_table:
      
      {
      	.procname       = "monitor_signals",
      	.data           = &monitor_sigs,
      	.maxlen         = 2*sizeof(unsigned long),
      	.mode           = 0644,
      	.proc_handler   = proc_doulongvec_minmax,
      },
      
      Reproduce:
      
      When passing two parameters, it's work normal.  But passing only one
      parameter, an error "Invalid argument"(EINVAL) is returned.
      
        [root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
        [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
        1       2
        [root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
        -bash: echo: write error: Invalid argument
        [root@cl150 ~]# echo $?
        1
        [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
        3       2
        [root@cl150 ~]#
      
      The following is the result after apply this patch.  No error is
      returned when the number of input parameters is less than the total
      parameters.
      
        [root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
        [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
        1       2
        [root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
        [root@cl150 ~]# echo $?
        0
        [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
        3       2
        [root@cl150 ~]#
      
      There are three processing functions dealing with digital parameters,
      __do_proc_dointvec/__do_proc_douintvec/__do_proc_doulongvec_minmax.
      
      This patch deals with __do_proc_doulongvec_minmax, just as
      __do_proc_dointvec does, adding a check for parameters 'left'.  In
      __do_proc_douintvec, its code implementation explicitly does not support
      multiple inputs.
      
      static int __do_proc_douintvec(...){
               ...
               /*
                * Arrays are not supported, keep this simple. *Do not* add
                * support for them.
                */
               if (vleft != 1) {
                       *lenp = 0;
                       return -EINVAL;
               }
               ...
      }
      
      So, just __do_proc_doulongvec_minmax has the problem.  And most use of
      proc_doulongvec_minmax/proc_doulongvec_ms_jiffies_minmax just have one
      parameter.
      
      Link: http://lkml.kernel.org/r/1544081775-15720-1-git-send-email-cheng.lin130@zte.com.cnSigned-off-by: NCheng Lin <cheng.lin130@zte.com.cn>
      Acked-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9beb84c0
    • T
      kernel/hung_task.c: break RCU locks based on jiffies · 9c8939b0
      Tetsuo Handa 提交于
      [ Upstream commit 304ae42739b108305f8d7b3eb3c1aec7c2b643a9 ]
      
      check_hung_uninterruptible_tasks() is currently calling rcu_lock_break()
      for every 1024 threads.  But check_hung_task() is very slow if printk()
      was called, and is very fast otherwise.
      
      If many threads within some 1024 threads called printk(), the RCU grace
      period might be extended enough to trigger RCU stall warnings.
      Therefore, calling rcu_lock_break() for every some fixed jiffies will be
      safer.
      
      Link: http://lkml.kernel.org/r/1544800658-11423-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9c8939b0
    • D
      kdb: Don't back trace on a cpu that didn't round up · 3818c29a
      Douglas Anderson 提交于
      [ Upstream commit 162bc7f5afd75b72acbe3c5f3488ef7e64a3fe36 ]
      
      If you have a CPU that fails to round up and then run 'btc' you'll end
      up crashing in kdb becaue we dereferenced NULL.  Let's add a check.
      It's wise to also set the task to NULL when leaving the debugger so
      that if we fail to round up on a later entry into the debugger we
      won't backtrace a stale task.
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Acked-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3818c29a
    • O
      cgroup: fix parsing empty mount option string · 4b5abffd
      Ondrej Mosnacek 提交于
      [ Upstream commit e250d91d65750a0c0c62483ac4f9f357e7317617 ]
      
      This fixes the case where all mount options specified are consumed by an
      LSM and all that's left is an empty string. In this case cgroupfs should
      accept the string and not fail.
      
      How to reproduce (with SELinux enabled):
      
          # umount /sys/fs/cgroup/unified
          # mount -o context=system_u:object_r:cgroup_t:s0 -t cgroup2 cgroup2 /sys/fs/cgroup/unified
          mount: /sys/fs/cgroup/unified: wrong fs type, bad option, bad superblock on cgroup2, missing codepage or helper program, or other error.
          # dmesg | tail -n 1
          [   31.575952] cgroup: cgroup2: unknown option ""
      
      Fixes: 67e9c74b ("cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type")
      [NOTE: should apply on top of commit 5136f636 ("cgroup: implement "nsdelegate" mount option"), older versions need manual rebase]
      Suggested-by: NStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: NOndrej Mosnacek <omosnace@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4b5abffd
    • P
      kobject: return error code if writing /sys/.../uevent fails · f7debeeb
      Peter Rajnoha 提交于
      [ Upstream commit df44b479654f62b478c18ee4d8bc4e9f897a9844 ]
      
      Propagate error code back to userspace if writing the /sys/.../uevent
      file fails. Before, the write operation always returned with success,
      even if we failed to recognize the input string or if we failed to
      generate the uevent itself.
      
      With the error codes properly propagated back to userspace, we are
      able to react in userspace accordingly by not assuming and awaiting
      a uevent that is not delivered.
      Signed-off-by: NPeter Rajnoha <prajnoha@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f7debeeb
    • B
      timekeeping: Use proper seqcount initializer · 0105d80d
      Bart Van Assche 提交于
      [ Upstream commit ce10a5b3954f2514af726beb78ed8d7350c5e41c ]
      
      tk_core.seq is initialized open coded, but that misses to initialize the
      lockdep map when lockdep is enabled. Lockdep splats involving tk_core seq
      consequently lack a name and are hard to read.
      
      Use the proper initializer which takes care of the lockdep map
      initialization.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: peterz@infradead.org
      Cc: tj@kernel.org
      Cc: johannes.berg@intel.com
      Link: https://lkml.kernel.org/r/20181128234325.110011-12-bvanassche@acm.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
      0105d80d
    • L
      genirq/affinity: Spread IRQs to all available NUMA nodes · 46ed4f4f
      Long Li 提交于
      [ Upstream commit b82592199032bf7c778f861b936287e37ebc9f62 ]
      
      If the number of NUMA nodes exceeds the number of MSI/MSI-X interrupts
      which are allocated for a device, the interrupt affinity spreading code
      fails to spread them across all nodes.
      
      The reason is, that the spreading code starts from node 0 and continues up
      to the number of interrupts requested for allocation. This leaves the nodes
      past the last interrupt unused.
      
      This results in interrupt concentration on the first nodes which violates
      the assumption of the block layer that all nodes are covered evenly. As a
      consequence the NUMA nodes above the number of interrupts are all assigned
      to hardware queue 0 and therefore NUMA node 0, which results in bad
      performance and has CPU hotplug implications, because queue 0 gets shut
      down when the last CPU of node 0 is offlined.
      
      Go over all NUMA nodes and assign them round-robin to all requested
      interrupts to solve this.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NLong Li <longli@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Link: https://lkml.kernel.org/r/20181102180248.13583-1-longli@linuxonhyperv.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      46ed4f4f
  4. 07 2月, 2019 1 次提交
  5. 31 1月, 2019 13 次提交
    • D
      bpf: fix inner map masking to prevent oob under speculation · 37c9e3ee
      Daniel Borkmann 提交于
      [ commit 9d5564ddcf2a0f5ba3fa1c3a1f8a1b59ad309553 upstream ]
      
      During review I noticed that inner meta map setup for map in
      map is buggy in that it does not propagate all needed data
      from the reference map which the verifier is later accessing.
      
      In particular one such case is index masking to prevent out of
      bounds access under speculative execution due to missing the
      map's unpriv_array/index_mask field propagation. Fix this such
      that the verifier is generating the correct code for inlined
      lookups in case of unpriviledged use.
      
      Before patch (test_verifier's 'map in map access' dump):
      
        # bpftool prog dump xla id 3
           0: (62) *(u32 *)(r10 -4) = 0
           1: (bf) r2 = r10
           2: (07) r2 += -4
           3: (18) r1 = map[id:4]
           5: (07) r1 += 272                |
           6: (61) r0 = *(u32 *)(r2 +0)     |
           7: (35) if r0 >= 0x1 goto pc+6   | Inlined map in map lookup
           8: (54) (u32) r0 &= (u32) 0      | with index masking for
           9: (67) r0 <<= 3                 | map->unpriv_array.
          10: (0f) r0 += r1                 |
          11: (79) r0 = *(u64 *)(r0 +0)     |
          12: (15) if r0 == 0x0 goto pc+1   |
          13: (05) goto pc+1                |
          14: (b7) r0 = 0                   |
          15: (15) if r0 == 0x0 goto pc+11
          16: (62) *(u32 *)(r10 -4) = 0
          17: (bf) r2 = r10
          18: (07) r2 += -4
          19: (bf) r1 = r0
          20: (07) r1 += 272                |
          21: (61) r0 = *(u32 *)(r2 +0)     | Index masking missing (!)
          22: (35) if r0 >= 0x1 goto pc+3   | for inner map despite
          23: (67) r0 <<= 3                 | map->unpriv_array set.
          24: (0f) r0 += r1                 |
          25: (05) goto pc+1                |
          26: (b7) r0 = 0                   |
          27: (b7) r0 = 0
          28: (95) exit
      
      After patch:
      
        # bpftool prog dump xla id 1
           0: (62) *(u32 *)(r10 -4) = 0
           1: (bf) r2 = r10
           2: (07) r2 += -4
           3: (18) r1 = map[id:2]
           5: (07) r1 += 272                |
           6: (61) r0 = *(u32 *)(r2 +0)     |
           7: (35) if r0 >= 0x1 goto pc+6   | Same inlined map in map lookup
           8: (54) (u32) r0 &= (u32) 0      | with index masking due to
           9: (67) r0 <<= 3                 | map->unpriv_array.
          10: (0f) r0 += r1                 |
          11: (79) r0 = *(u64 *)(r0 +0)     |
          12: (15) if r0 == 0x0 goto pc+1   |
          13: (05) goto pc+1                |
          14: (b7) r0 = 0                   |
          15: (15) if r0 == 0x0 goto pc+12
          16: (62) *(u32 *)(r10 -4) = 0
          17: (bf) r2 = r10
          18: (07) r2 += -4
          19: (bf) r1 = r0
          20: (07) r1 += 272                |
          21: (61) r0 = *(u32 *)(r2 +0)     |
          22: (35) if r0 >= 0x1 goto pc+4   | Now fixed inlined inner map
          23: (54) (u32) r0 &= (u32) 0      | lookup with proper index masking
          24: (67) r0 <<= 3                 | for map->unpriv_array.
          25: (0f) r0 += r1                 |
          26: (05) goto pc+1                |
          27: (b7) r0 = 0                   |
          28: (b7) r0 = 0
          29: (95) exit
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      37c9e3ee
    • D
      bpf: fix sanitation of alu op with pointer / scalar type from different paths · eed84f94
      Daniel Borkmann 提交于
      [ commit d3bd7413e0ca40b60cf60d4003246d067cafdeda upstream ]
      
      While 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer
      arithmetic") took care of rejecting alu op on pointer when e.g. pointer
      came from two different map values with different map properties such as
      value size, Jann reported that a case was not covered yet when a given
      alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
      different branches where we would incorrectly try to sanitize based
      on the pointer's limit. Catch this corner case and reject the program
      instead.
      
      Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      eed84f94
    • D
      bpf: prevent out of bounds speculation on pointer arithmetic · f92a819b
      Daniel Borkmann 提交于
      [ commit 979d63d50c0c0f7bc537bf821e056cc9fe5abd38 upstream ]
      
      Jann reported that the original commit back in b2157399
      ("bpf: prevent out-of-bounds speculation") was not sufficient
      to stop CPU from speculating out of bounds memory access:
      While b2157399 only focussed on masking array map access
      for unprivileged users for tail calls and data access such
      that the user provided index gets sanitized from BPF program
      and syscall side, there is still a more generic form affected
      from BPF programs that applies to most maps that hold user
      data in relation to dynamic map access when dealing with
      unknown scalars or "slow" known scalars as access offset, for
      example:
      
        - Load a map value pointer into R6
        - Load an index into R7
        - Do a slow computation (e.g. with a memory dependency) that
          loads a limit into R8 (e.g. load the limit from a map for
          high latency, then mask it to make the verifier happy)
        - Exit if R7 >= R8 (mispredicted branch)
        - Load R0 = R6[R7]
        - Load R0 = R6[R0]
      
      For unknown scalars there are two options in the BPF verifier
      where we could derive knowledge from in order to guarantee
      safe access to the memory: i) While </>/<=/>= variants won't
      allow to derive any lower or upper bounds from the unknown
      scalar where it would be safe to add it to the map value
      pointer, it is possible through ==/!= test however. ii) another
      option is to transform the unknown scalar into a known scalar,
      for example, through ALU ops combination such as R &= <imm>
      followed by R |= <imm> or any similar combination where the
      original information from the unknown scalar would be destroyed
      entirely leaving R with a constant. The initial slow load still
      precedes the latter ALU ops on that register, so the CPU
      executes speculatively from that point. Once we have the known
      scalar, any compare operation would work then. A third option
      only involving registers with known scalars could be crafted
      as described in [0] where a CPU port (e.g. Slow Int unit)
      would be filled with many dependent computations such that
      the subsequent condition depending on its outcome has to wait
      for evaluation on its execution port and thereby executing
      speculatively if the speculated code can be scheduled on a
      different execution port, or any other form of mistraining
      as described in [1], for example. Given this is not limited
      to only unknown scalars, not only map but also stack access
      is affected since both is accessible for unprivileged users
      and could potentially be used for out of bounds access under
      speculation.
      
      In order to prevent any of these cases, the verifier is now
      sanitizing pointer arithmetic on the offset such that any
      out of bounds speculation would be masked in a way where the
      pointer arithmetic result in the destination register will
      stay unchanged, meaning offset masked into zero similar as
      in array_index_nospec() case. With regards to implementation,
      there are three options that were considered: i) new insn
      for sanitation, ii) push/pop insn and sanitation as inlined
      BPF, iii) reuse of ax register and sanitation as inlined BPF.
      
      Option i) has the downside that we end up using from reserved
      bits in the opcode space, but also that we would require
      each JIT to emit masking as native arch opcodes meaning
      mitigation would have slow adoption till everyone implements
      it eventually which is counter-productive. Option ii) and iii)
      have both in common that a temporary register is needed in
      order to implement the sanitation as inlined BPF since we
      are not allowed to modify the source register. While a push /
      pop insn in ii) would be useful to have in any case, it
      requires once again that every JIT needs to implement it
      first. While possible, amount of changes needed would also
      be unsuitable for a -stable patch. Therefore, the path which
      has fewer changes, less BPF instructions for the mitigation
      and does not require anything to be changed in the JITs is
      option iii) which this work is pursuing. The ax register is
      already mapped to a register in all JITs (modulo arm32 where
      it's mapped to stack as various other BPF registers there)
      and used in constant blinding for JITs-only so far. It can
      be reused for verifier rewrites under certain constraints.
      The interpreter's tmp "register" has therefore been remapped
      into extending the register set with hidden ax register and
      reusing that for a number of instructions that needed the
      prior temporary variable internally (e.g. div, mod). This
      allows for zero increase in stack space usage in the interpreter,
      and enables (restricted) generic use in rewrites otherwise as
      long as such a patchlet does not make use of these instructions.
      The sanitation mask is dynamic and relative to the offset the
      map value or stack pointer currently holds.
      
      There are various cases that need to be taken under consideration
      for the masking, e.g. such operation could look as follows:
      ptr += val or val += ptr or ptr -= val. Thus, the value to be
      sanitized could reside either in source or in destination
      register, and the limit is different depending on whether
      the ALU op is addition or subtraction and depending on the
      current known and bounded offset. The limit is derived as
      follows: limit := max_value_size - (smin_value + off). For
      subtraction: limit := umax_value + off. This holds because
      we do not allow any pointer arithmetic that would
      temporarily go out of bounds or would have an unknown
      value with mixed signed bounds where it is unclear at
      verification time whether the actual runtime value would
      be either negative or positive. For example, we have a
      derived map pointer value with constant offset and bounded
      one, so limit based on smin_value works because the verifier
      requires that statically analyzed arithmetic on the pointer
      must be in bounds, and thus it checks if resulting
      smin_value + off and umax_value + off is still within map
      value bounds at time of arithmetic in addition to time of
      access. Similarly, for the case of stack access we derive
      the limit as follows: MAX_BPF_STACK + off for subtraction
      and -off for the case of addition where off := ptr_reg->off +
      ptr_reg->var_off.value. Subtraction is a special case for
      the masking which can be in form of ptr += -val, ptr -= -val,
      or ptr -= val. In the first two cases where we know that
      the value is negative, we need to temporarily negate the
      value in order to do the sanitation on a positive value
      where we later swap the ALU op, and restore original source
      register if the value was in source.
      
      The sanitation of pointer arithmetic alone is still not fully
      sufficient as is, since a scenario like the following could
      happen ...
      
        PTR += 0x1000 (e.g. K-based imm)
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        PTR += 0x1000
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        [...]
      
      ... which under speculation could end up as ...
      
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        [...]
      
      ... and therefore still access out of bounds. To prevent such
      case, the verifier is also analyzing safety for potential out
      of bounds access under speculative execution. Meaning, it is
      also simulating pointer access under truncation. We therefore
      "branch off" and push the current verification state after the
      ALU operation with known 0 to the verification stack for later
      analysis. Given the current path analysis succeeded it is
      likely that the one under speculation can be pruned. In any
      case, it is also subject to existing complexity limits and
      therefore anything beyond this point will be rejected. In
      terms of pruning, it needs to be ensured that the verification
      state from speculative execution simulation must never prune
      a non-speculative execution path, therefore, we mark verifier
      state accordingly at the time of push_stack(). If verifier
      detects out of bounds access under speculative execution from
      one of the possible paths that includes a truncation, it will
      reject such program.
      
      Given we mask every reg-based pointer arithmetic for
      unprivileged programs, we've been looking into how it could
      affect real-world programs in terms of size increase. As the
      majority of programs are targeted for privileged-only use
      case, we've unconditionally enabled masking (with its alu
      restrictions on top of it) for privileged programs for the
      sake of testing in order to check i) whether they get rejected
      in its current form, and ii) by how much the number of
      instructions and size will increase. We've tested this by
      using Katran, Cilium and test_l4lb from the kernel selftests.
      For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
      and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
      we've used test_l4lb.o as well as test_l4lb_noinline.o. We
      found that none of the programs got rejected by the verifier
      with this change, and that impact is rather minimal to none.
      balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
      7,797 bytes JITed before and after the change. Most complex
      program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
      and 18,538 bytes JITed before and after and none of the other
      tail call programs in bpf_lxc.o had any changes either. For
      the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
      increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
      before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
      after the change. Other programs from that object file had
      similar small increase. Both test_l4lb.o had no change and
      remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
      JITed and for test_l4lb_noinline.o constant at 5,080 bytes
      (634 insns) xlated and 3,313 bytes JITed. This can be explained
      in that LLVM typically optimizes stack based pointer arithmetic
      by using K-based operations and that use of dynamic map access
      is not overly frequent. However, in future we may decide to
      optimize the algorithm further under known guarantees from
      branch and value speculation. Latter seems also unclear in
      terms of prediction heuristics that today's CPUs apply as well
      as whether there could be collisions in e.g. the predictor's
      Value History/Pattern Table for triggering out of bounds access,
      thus masking is performed unconditionally at this point but could
      be subject to relaxation later on. We were generally also
      brainstorming various other approaches for mitigation, but the
      blocker was always lack of available registers at runtime and/or
      overhead for runtime tracking of limits belonging to a specific
      pointer. Thus, we found this to be minimally intrusive under
      given constraints.
      
      With that in place, a simple example with sanitized access on
      unprivileged load at post-verification time looks as follows:
      
        # bpftool prog dump xlated id 282
        [...]
        28: (79) r1 = *(u64 *)(r7 +0)
        29: (79) r2 = *(u64 *)(r7 +8)
        30: (57) r1 &= 15
        31: (79) r3 = *(u64 *)(r0 +4608)
        32: (57) r3 &= 1
        33: (47) r3 |= 1
        34: (2d) if r2 > r3 goto pc+19
        35: (b4) (u32) r11 = (u32) 20479  |
        36: (1f) r11 -= r2                | Dynamic sanitation for pointer
        37: (4f) r11 |= r2                | arithmetic with registers
        38: (87) r11 = -r11               | containing bounded or known
        39: (c7) r11 s>>= 63              | scalars in order to prevent
        40: (5f) r11 &= r2                | out of bounds speculation.
        41: (0f) r4 += r11                |
        42: (71) r4 = *(u8 *)(r4 +0)
        43: (6f) r4 <<= r1
        [...]
      
      For the case where the scalar sits in the destination register
      as opposed to the source register, the following code is emitted
      for the above example:
      
        [...]
        16: (b4) (u32) r11 = (u32) 20479
        17: (1f) r11 -= r2
        18: (4f) r11 |= r2
        19: (87) r11 = -r11
        20: (c7) r11 s>>= 63
        21: (5f) r2 &= r11
        22: (0f) r2 += r0
        23: (61) r0 = *(u32 *)(r2 +0)
        [...]
      
      JIT blinding example with non-conflicting use of r10:
      
        [...]
         d5:	je     0x0000000000000106    _
         d7:	mov    0x0(%rax),%edi       |
         da:	mov    $0xf153246,%r10d     | Index load from map value and
         e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
         e7:	and    %r10,%rdi            |_
         ea:	mov    $0x2f,%r10d          |
         f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
         f3:	or     %rdi,%r10            | but do not interfere with each
         f6:	neg    %r10                 | other. (Neither do these instructions
         f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
         fd:	and    %r10,%rdi            | in interpreter.)
        100:	add    %rax,%rdi            |_
        103:	mov    0x0(%rdi),%eax
       [...]
      
      Tested that it fixes Jann's reproducer, and also checked that test_verifier
      and test_progs suite with interpreter, JIT and JIT with hardening enabled
      on x86-64 and arm64 runs successfully.
      
        [0] Speculose: Analyzing the Security Implications of Speculative
            Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
            https://arxiv.org/pdf/1801.04084.pdf
      
        [1] A Systematic Evaluation of Transient Execution Attacks and
            Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
            Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
            Dmitry Evtyushkin, Daniel Gruss,
            https://arxiv.org/pdf/1811.05441.pdf
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f92a819b
    • D
      bpf: fix check_map_access smin_value test when pointer contains offset · 4f7f708d
      Daniel Borkmann 提交于
      [ commit b7137c4eab85c1cf3d46acdde90ce1163b28c873 upstream ]
      
      In check_map_access() we probe actual bounds through __check_map_access()
      with offset of reg->smin_value + off for lower bound and offset of
      reg->umax_value + off for the upper bound. However, even though the
      reg->smin_value could have a negative value, the final result of the
      sum with off could be positive when pointer arithmetic with known and
      unknown scalars is combined. In this case we reject the program with
      an error such as "R<x> min value is negative, either use unsigned index
      or do a if (index >=0) check." even though the access itself would be
      fine. Therefore extend the check to probe whether the actual resulting
      reg->smin_value + off is less than zero.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4f7f708d
    • D
      bpf: restrict unknown scalars of mixed signed bounds for unprivileged · 44f8fc64
      Daniel Borkmann 提交于
      [ commit 9d7eceede769f90b66cfa06ad5b357140d5141ed upstream ]
      
      For unknown scalars of mixed signed bounds, meaning their smin_value is
      negative and their smax_value is positive, we need to reject arithmetic
      with pointer to map value. For unprivileged the goal is to mask every
      map pointer arithmetic and this cannot reliably be done when it is
      unknown at verification time whether the scalar value is negative or
      positive. Given this is a corner case, the likelihood of breaking should
      be very small.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      44f8fc64
    • D
      bpf: restrict stack pointer arithmetic for unprivileged · 5332dda9
      Daniel Borkmann 提交于
      [ commit e4298d25830a866cc0f427d4bccb858e76715859 upstream ]
      
      Restrict stack pointer arithmetic for unprivileged users in that
      arithmetic itself must not go out of bounds as opposed to the actual
      access later on. Therefore after each adjust_ptr_min_max_vals() with
      a stack pointer as a destination we simulate a check_stack_access()
      of 1 byte on the destination and once that fails the program is
      rejected for unprivileged program loads. This is analog to map
      value pointer arithmetic and needed for masking later on.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5332dda9
    • D
      bpf: restrict map value pointer arithmetic for unprivileged · 9e57b296
      Daniel Borkmann 提交于
      [ commit 0d6303db7970e6f56ae700fa07e11eb510cda125 upstream ]
      
      Restrict map value pointer arithmetic for unprivileged users in that
      arithmetic itself must not go out of bounds as opposed to the actual
      access later on. Therefore after each adjust_ptr_min_max_vals() with a
      map value pointer as a destination it will simulate a check_map_access()
      of 1 byte on the destination and once that fails the program is rejected
      for unprivileged program loads. We use this later on for masking any
      pointer arithmetic with the remainder of the map value space. The
      likelihood of breaking any existing real-world unprivileged eBPF
      program is very small for this corner case.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9e57b296
    • D
      bpf: enable access to ax register also from verifier rewrite · 232ac70d
      Daniel Borkmann 提交于
      [ commit 9b73bfdd08e73231d6a90ae6db4b46b3fbf56c30 upstream ]
      
      Right now we are using BPF ax register in JIT for constant blinding as
      well as in interpreter as temporary variable. Verifier will not be able
      to use it simply because its use will get overridden from the former in
      bpf_jit_blind_insn(). However, it can be made to work in that blinding
      will be skipped if there is prior use in either source or destination
      register on the instruction. Taking constraints of ax into account, the
      verifier is then open to use it in rewrites under some constraints. Note,
      ax register already has mappings in every eBPF JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      232ac70d
    • D
      bpf: move tmp variable into ax register in interpreter · b855e310
      Daniel Borkmann 提交于
      [ commit 144cd91c4c2bced6eb8a7e25e590f6618a11e854 upstream ]
      
      This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
      into the hidden ax register. The latter is currently only used in JITs
      for constant blinding as a temporary scratch register, meaning the BPF
      interpreter will never see the use of ax. Therefore it is safe to use
      it for the cases where tmp has been used earlier. This is needed to later
      on allow restricted hidden use of ax in both interpreter and JITs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b855e310
    • D
      bpf: move {prev_,}insn_idx into verifier env · 333a31c8
      Daniel Borkmann 提交于
      [ commit c08435ec7f2bc8f4109401f696fd55159b4b40cb upstream ]
      
      Move prev_insn_idx and insn_idx from the do_check() function into
      the verifier environment, so they can be read inside the various
      helper functions for handling the instructions. It's easier to put
      this into the environment rather than changing all call-sites only
      to pass it along. insn_idx is useful in particular since this later
      on allows to hold state in env->insn_aux_data[env->insn_idx].
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      333a31c8
    • A
      bpf: add per-insn complexity limit · 43711294
      Alexei Starovoitov 提交于
      [ commit ceefbc96fa5c5b975d87bf8e89ba8416f6b764d9 upstream ]
      
      malicious bpf program may try to force the verifier to remember
      a lot of distinct verifier states.
      Put a limit to number of per-insn 'struct bpf_verifier_state'.
      Note that hitting the limit doesn't reject the program.
      It potentially makes the verifier do more steps to analyze the program.
      It means that malicious programs will hit BPF_COMPLEXITY_LIMIT_INSNS sooner
      instead of spending cpu time walking long link list.
      
      The limit of BPF_COMPLEXITY_LIMIT_STATES==64 affects cilium progs
      with slight increase in number of "steps" it takes to successfully verify
      the programs:
                             before    after
      bpf_lb-DLB_L3.o         1940      1940
      bpf_lb-DLB_L4.o         3089      3089
      bpf_lb-DUNKNOWN.o       1065      1065
      bpf_lxc-DDROP_ALL.o     28052  |  28162
      bpf_lxc-DUNKNOWN.o      35487  |  35541
      bpf_netdev.o            10864     10864
      bpf_overlay.o           6643      6643
      bpf_lcx_jit.o           38437     38437
      
      But it also makes malicious program to be rejected in 0.4 seconds vs 6.5
      Hence apply this limit to unprivileged programs only.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      43711294
    • A
      bpf: improve verifier branch analysis · 7da6cd69
      Alexei Starovoitov 提交于
      [ commit 4f7b3e82589e0de723780198ec7983e427144c0a upstream ]
      
      pathological bpf programs may try to force verifier to explode in
      the number of branch states:
        20: (d5) if r1 s<= 0x24000028 goto pc+0
        21: (b5) if r0 <= 0xe1fa20 goto pc+2
        22: (d5) if r1 s<= 0x7e goto pc+0
        23: (b5) if r0 <= 0xe880e000 goto pc+0
        24: (c5) if r0 s< 0x2100ecf4 goto pc+0
        25: (d5) if r1 s<= 0xe880e000 goto pc+1
        26: (c5) if r0 s< 0xf4041810 goto pc+0
        27: (d5) if r1 s<= 0x1e007e goto pc+0
        28: (b5) if r0 <= 0xe86be000 goto pc+0
        29: (07) r0 += 16614
        30: (c5) if r0 s< 0x6d0020da goto pc+0
        31: (35) if r0 >= 0x2100ecf4 goto pc+0
      
      Teach verifier to recognize always taken and always not taken branches.
      This analysis is already done for == and != comparison.
      Expand it to all other branches.
      
      It also helps real bpf programs to be verified faster:
                             before  after
      bpf_lb-DLB_L3.o         2003    1940
      bpf_lb-DLB_L4.o         3173    3089
      bpf_lb-DUNKNOWN.o       1080    1065
      bpf_lxc-DDROP_ALL.o     29584   28052
      bpf_lxc-DUNKNOWN.o      36916   35487
      bpf_netdev.o            11188   10864
      bpf_overlay.o           6679    6643
      bpf_lcx_jit.o           39555   38437
      Reported-by: NAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7da6cd69
    • T
      posix-cpu-timers: Unbreak timer rearming · 21c0d162
      Thomas Gleixner 提交于
      commit 93ad0fc088c5b4631f796c995bdd27a082ef33a6 upstream.
      
      The recent commit which prevented a division by 0 issue in the alarm timer
      code broke posix CPU timers as an unwanted side effect.
      
      The reason is that the common rearm code checks for timer->it_interval
      being 0 now. What went unnoticed is that the posix cpu timer setup does not
      initialize timer->it_interval as it stores the interval in CPU timer
      specific storage. The reason for the separate storage is historical as the
      posix CPU timers always had a 64bit nanoseconds representation internally
      while timer->it_interval is type ktime_t which used to be a modified
      timespec representation on 32bit machines.
      
      Instead of reverting the offending commit and fixing the alarmtimer issue
      in the alarmtimer code, store the interval in timer->it_interval at CPU
      timer setup time so the common code check works. This also repairs the
      existing inconistency of the posix CPU timer code which kept a single shot
      timer armed despite of the interval being 0.
      
      The separate storage can be removed in mainline, but that needs to be a
      separate commit as the current one has to be backported to stable kernels.
      
      Fixes: 0e334db6bb4b ("posix-timers: Fix division by zero bug")
      Reported-by: NH.J. Lu <hjl.tools@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190111133500.840117406@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21c0d162
  6. 26 1月, 2019 2 次提交
    • J
      bpf: relax verifier restriction on BPF_MOV | BPF_ALU · 344b51e7
      Jiong Wang 提交于
      [ Upstream commit e434b8cdf788568ba65a0a0fd9f3cb41f3ca1803 ]
      
      Currently, the destination register is marked as unknown for 32-bit
      sub-register move (BPF_MOV | BPF_ALU) whenever the source register type is
      SCALAR_VALUE.
      
      This is too conservative that some valid cases will be rejected.
      Especially, this may turn a constant scalar value into unknown value that
      could break some assumptions of verifier.
      
      For example, test_l4lb_noinline.c has the following C code:
      
          struct real_definition *dst
      
      1:  if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6))
      2:    return TC_ACT_SHOT;
      3:
      4:  if (dst->flags & F_IPV6) {
      
      get_packet_dst is responsible for initializing "dst" into valid pointer and
      return true (1), otherwise return false (0). The compiled instruction
      sequence using alu32 will be:
      
        412: (54) (u32) r7 &= (u32) 1
        413: (bc) (u32) r0 = (u32) r7
        414: (95) exit
      
      insn 413, a BPF_MOV | BPF_ALU, however will turn r0 into unknown value even
      r7 contains SCALAR_VALUE 1.
      
      This causes trouble when verifier is walking the code path that hasn't
      initialized "dst" inside get_packet_dst, for which case 0 is returned and
      we would then expect verifier concluding line 1 in the above C code pass
      the "if" check, therefore would skip fall through path starting at line 4.
      Now, because r0 returned from callee has became unknown value, so verifier
      won't skip analyzing path starting at line 4 and "dst->flags" requires
      dereferencing the pointer "dst" which actually hasn't be initialized for
      this path.
      
      This patch relaxed the code marking sub-register move destination. For a
      SCALAR_VALUE, it is safe to just copy the value from source then truncate
      it into 32-bit.
      
      A unit test also included to demonstrate this issue. This test will fail
      before this patch.
      
      This relaxation could let verifier skipping more paths for conditional
      comparison against immediate. It also let verifier recording a more
      accurate/strict value for one register at one state, if this state end up
      with going through exit without rejection and it is used for state
      comparison later, then it is possible an inaccurate/permissive value is
      better. So the real impact on verifier processed insn number is complex.
      But in all, without this fix, valid program could be rejected.
      
      >From real benchmarking on kernel selftests and Cilium bpf tests, there is
      no impact on processed instruction number when tests ares compiled with
      default compilation options. There is slightly improvements when they are
      compiled with -mattr=+alu32 after this patch.
      
      Also, test_xdp_noinline/-mattr=+alu32 now passed verification. It is
      rejected before this fix.
      
      Insn processed before/after this patch:
      
                              default     -mattr=+alu32
      
      Kernel selftest
      
      ===
      test_xdp.o              371/371      369/369
      test_l4lb.o             6345/6345    5623/5623
      test_xdp_noinline.o     2971/2971    rejected/2727
      test_tcp_estates.o      429/429      430/430
      
      Cilium bpf
      ===
      bpf_lb-DLB_L3.o:        2085/2085     1685/1687
      bpf_lb-DLB_L4.o:        2287/2287     1986/1982
      bpf_lb-DUNKNOWN.o:      690/690       622/622
      bpf_lxc.o:              95033/95033   N/A
      bpf_netdev.o:           7245/7245     N/A
      bpf_overlay.o:          2898/2898     3085/2947
      
      NOTE:
        - bpf_lxc.o and bpf_netdev.o compiled by -mattr=+alu32 are rejected by
          verifier due to another issue inside verifier on supporting alu32
          binary.
        - Each cilium bpf program could generate several processed insn number,
          above number is sum of them.
      
      v1->v2:
       - Restrict the change on SCALAR_VALUE.
       - Update benchmark numbers on Cilium bpf tests.
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      344b51e7
    • A
      bpf: Allow narrow loads with offset > 0 · 464b01e4
      Andrey Ignatov 提交于
      [ Upstream commit 46f53a65d2de3e1591636c22b626b09d8684fd71 ]
      
      Currently BPF verifier allows narrow loads for a context field only with
      offset zero. E.g. if there is a __u32 field then only the following
      loads are permitted:
        * off=0, size=1 (narrow);
        * off=0, size=2 (narrow);
        * off=0, size=4 (full).
      
      On the other hand LLVM can generate a load with offset different than
      zero that make sense from program logic point of view, but verifier
      doesn't accept it.
      
      E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code:
      
        #define DST_IP4			0xC0A801FEU /* 192.168.1.254 */
        ...
        	if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) &&
      
      where ctx is struct bpf_sock_addr.
      
      Some versions of LLVM can produce the following byte code for it:
      
             8:       71 12 07 00 00 00 00 00         r2 = *(u8 *)(r1 + 7)
             9:       67 02 00 00 18 00 00 00         r2 <<= 24
            10:       18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00         r3 = 4261412864 ll
            12:       5d 32 07 00 00 00 00 00         if r2 != r3 goto +7 <LBB0_6>
      
      where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1
      and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently
      rejected by verifier.
      
      Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok()
      what means any is_valid_access implementation, that uses the function,
      works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or
      sock_addr_is_valid_access() for bpf_sock_addr.
      
      The patch makes such loads supported. Offset can be in [0; size_default)
      but has to be multiple of load size. E.g. for __u32 field the following
      loads are supported now:
        * off=0, size=1 (narrow);
        * off=1, size=1 (narrow);
        * off=2, size=1 (narrow);
        * off=3, size=1 (narrow);
        * off=0, size=2 (narrow);
        * off=2, size=2 (narrow);
        * off=0, size=4 (full).
      Reported-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      464b01e4
  7. 13 1月, 2019 5 次提交