1. 02 12月, 2020 1 次提交
    • G
      kernel: Implement selective syscall userspace redirection · 1446e1df
      Gabriel Krisman Bertazi 提交于
      Introduce a mechanism to quickly disable/enable syscall handling for a
      specific process and redirect to userspace via SIGSYS.  This is useful
      for processes with parts that require syscall redirection and parts that
      don't, but who need to perform this boundary crossing really fast,
      without paying the cost of a system call to reconfigure syscall handling
      on each boundary transition.  This is particularly important for Windows
      games running over Wine.
      
      The proposed interface looks like this:
      
        prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
      
      The range [<offset>,<offset>+<length>) is a part of the process memory
      map that is allowed to by-pass the redirection code and dispatch
      syscalls directly, such that in fast paths a process doesn't need to
      disable the trap nor the kernel has to check the selector.  This is
      essential to return from SIGSYS to a blocked area without triggering
      another SIGSYS from rt_sigreturn.
      
      selector is an optional pointer to a char-sized userspace memory region
      that has a key switch for the mechanism. This key switch is set to
      either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
      redirection without calling the kernel.
      
      The feature is meant to be set per-thread and it is disabled on
      fork/clone/execv.
      
      Internally, this doesn't add overhead to the syscall hot path, and it
      requires very little per-architecture support.  I avoided using seccomp,
      even though it duplicates some functionality, due to previous feedback
      that maybe it shouldn't mix with seccomp since it is not a security
      mechanism.  And obviously, this should never be considered a security
      mechanism, since any part of the program can by-pass it by using the
      syscall dispatcher.
      
      For the sysinfo benchmark, which measures the overhead added to
      executing a native syscall that doesn't require interception, the
      overhead using only the direct dispatcher region to issue syscalls is
      pretty much irrelevant.  The overhead of using the selector goes around
      40ns for a native (unredirected) syscall in my system, and it is (as
      expected) dominated by the supervisor-mode user-address access.  In
      fact, with SMAP off, the overhead is consistently less than 5ns on my
      test box.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
      1446e1df
  2. 25 11月, 2020 1 次提交
  3. 19 11月, 2020 1 次提交
    • F
      context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK · 179a9cf7
      Frederic Weisbecker 提交于
      The typical steps with context tracking are:
      
      1) Task runs in userspace
      2) Task enters the kernel (syscall/exception/IRQ)
      3) Task switches from context tracking state CONTEXT_USER to
         CONTEXT_KERNEL (user_exit())
      4) Task does stuff in kernel
      5) Task switches from context tracking state CONTEXT_KERNEL to
         CONTEXT_USER (user_enter())
      6) Task exits the kernel
      
      If an exception fires between 5) and 6), the pt_regs and the context
      tracking disagree on the context of the faulted/trapped instruction.
      CONTEXT_KERNEL must be set before the exception handler, that's
      unconditional for those handlers that want to be able to call into
      schedule(), but CONTEXT_USER must be restored when the exception exits
      whereas pt_regs tells that we are resuming to kernel space.
      
      This can't be fixed with storing the context tracking state in a per-cpu
      or per-task variable since another exception may fire onto the current
      one and overwrite the saved state. Also the task can schedule. So it
      has to be stored in a per task stack.
      
      This is how exception_enter()/exception_exit() paper over the problem:
      
      5) Task switches from context tracking state CONTEXT_KERNEL to
         CONTEXT_USER (user_enter())
      5.1) Exception fires
      5.2) prev_state = exception_enter() // save CONTEXT_USER to prev_state
                                          // and set CONTEXT_KERNEL
      5.3) Exception handler
      5.4) exception_enter(prev_state) // restore CONTEXT_USER
      5.5) Exception resumes
      6) Task exits the kernel
      
      The condition to live without exception_enter()/exception_exit() is to
      forbid exceptions and IRQs between 2) and 3) and between 5) and 6), or if
      any is allowed to trigger, it won't call into context tracking, eg: NMIs,
      and it won't schedule. These requirements are met by architectures
      supporting CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK and those can
      therefore afford not to implement this hack.
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201117151637.259084-3-frederic@kernel.org
      179a9cf7
  4. 17 11月, 2020 8 次提交
  5. 16 11月, 2020 1 次提交
  6. 05 11月, 2020 1 次提交
  7. 31 10月, 2020 1 次提交
  8. 30 10月, 2020 7 次提交
  9. 29 10月, 2020 6 次提交
  10. 28 10月, 2020 4 次提交
    • A
      module: use hidden visibility for weak symbol references · 13150bc5
      Ard Biesheuvel 提交于
      Geert reports that commit be288182 ("arm64/build: Assert for
      unwanted sections") results in build errors on arm64 for configurations
      that have CONFIG_MODULES disabled.
      
      The commit in question added ASSERT()s to the arm64 linker script to
      ensure that linker generated sections such as .got.plt etc are empty,
      but as it turns out, there are corner cases where the linker does emit
      content into those sections. More specifically, weak references to
      function symbols (which can remain unsatisfied, and can therefore not
      be emitted as relative references) will be emitted as GOT and PLT
      entries when linking the kernel in PIE mode (which is the case when
      CONFIG_RELOCATABLE is enabled, which is on by default).
      
      What happens is that code such as
      
      	struct device *(*fn)(struct device *dev);
      	struct device *iommu_device;
      
      	fn = symbol_get(mdev_get_iommu_device);
      	if (fn) {
      		iommu_device = fn(dev);
      
      essentially gets converted into the following when CONFIG_MODULES is off:
      
      	struct device *iommu_device;
      
      	if (&mdev_get_iommu_device) {
      		iommu_device = mdev_get_iommu_device(dev);
      
      where mdev_get_iommu_device is emitted as a weak symbol reference into
      the object file. The first reference is decorated with an ordinary
      ABS64 data relocation (which yields 0x0 if the reference remains
      unsatisfied). However, the indirect call is turned into a direct call
      covered by a R_AARCH64_CALL26 relocation, which is converted into a
      call via a PLT entry taking the target address from the associated
      GOT entry.
      
      Given that such GOT and PLT entries are unnecessary for fully linked
      binaries such as the kernel, let's give these weak symbol references
      hidden visibility, so that the linker knows that the weak reference
      via R_AARCH64_CALL26 can simply remain unsatisfied.
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: NFangrui Song <maskray@google.com>
      Acked-by: NJessica Yu <jeyu@kernel.org>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Link: https://lore.kernel.org/r/20201027151132.14066-1-ardb@kernel.orgSigned-off-by: NWill Deacon <will@kernel.org>
      13150bc5
    • M
      usb: fix kernel-doc markups · cbdc0f54
      Mauro Carvalho Chehab 提交于
      There is a common comment marked, instead, with kernel-doc
      notation.
      
      Also, some identifiers have different names between their
      prototypes and the kernel-doc markup.
      Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Acked-by: NFelipe Balbi <balbi@kernel.org>
      Link: https://lore.kernel.org/r/0b964be3884def04fcd20ea5c12cb90d0014871c.1603469755.git.mchehab+huawei@kernel.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cbdc0f54
    • S
      KVM: arm64: ARM_SMCCC_ARCH_WORKAROUND_1 doesn't return SMCCC_RET_NOT_REQUIRED · 1de111b5
      Stephen Boyd 提交于
      According to the SMCCC spec[1](7.5.2 Discovery) the
      ARM_SMCCC_ARCH_WORKAROUND_1 function id only returns 0, 1, and
      SMCCC_RET_NOT_SUPPORTED.
      
       0 is "workaround required and safe to call this function"
       1 is "workaround not required but safe to call this function"
       SMCCC_RET_NOT_SUPPORTED is "might be vulnerable or might not be, who knows, I give up!"
      
      SMCCC_RET_NOT_SUPPORTED might as well mean "workaround required, except
      calling this function may not work because it isn't implemented in some
      cases". Wonderful. We map this SMC call to
      
       0 is SPECTRE_MITIGATED
       1 is SPECTRE_UNAFFECTED
       SMCCC_RET_NOT_SUPPORTED is SPECTRE_VULNERABLE
      
      For KVM hypercalls (hvc), we've implemented this function id to return
      SMCCC_RET_NOT_SUPPORTED, 0, and SMCCC_RET_NOT_REQUIRED. One of those
      isn't supposed to be there. Per the code we call
      arm64_get_spectre_v2_state() to figure out what to return for this
      feature discovery call.
      
       0 is SPECTRE_MITIGATED
       SMCCC_RET_NOT_REQUIRED is SPECTRE_UNAFFECTED
       SMCCC_RET_NOT_SUPPORTED is SPECTRE_VULNERABLE
      
      Let's clean this up so that KVM tells the guest this mapping:
      
       0 is SPECTRE_MITIGATED
       1 is SPECTRE_UNAFFECTED
       SMCCC_RET_NOT_SUPPORTED is SPECTRE_VULNERABLE
      
      Note: SMCCC_RET_NOT_AFFECTED is 1 but isn't part of the SMCCC spec
      
      Fixes: c118bbb5 ("arm64: KVM: Propagate full Spectre v2 workaround state to KVM guests")
      Signed-off-by: NStephen Boyd <swboyd@chromium.org>
      Acked-by: NMarc Zyngier <maz@kernel.org>
      Acked-by: NWill Deacon <will@kernel.org>
      Cc: Andre Przywara <andre.przywara@arm.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://developer.arm.com/documentation/den0028/latest [1]
      Link: https://lore.kernel.org/r/20201023154751.1973872-1-swboyd@chromium.orgSigned-off-by: NWill Deacon <will@kernel.org>
      1de111b5
    • R
      cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS driver flag · 1c534352
      Rafael J. Wysocki 提交于
      Generally, a cpufreq driver may need to update some internal upper
      and lower frequency boundaries on policy max and min changes,
      respectively, but currently this does not work if the target
      frequency does not change along with the policy limit.
      
      Namely, if the target frequency does not change along with the
      policy min or max, the "target_freq == policy->cur" check in
      __cpufreq_driver_target() prevents driver callbacks from being
      invoked and they do not even have a chance to update the
      corresponding internal boundary.
      
      This particularly affects the "powersave" and "performance"
      governors that always set the target frequency to one of the
      policy limits and it never changes when the other limit is updated.
      
      To allow cpufreq the drivers needing to update internal frequency
      boundaries on policy limits changes to avoid this issue, introduce
      a new driver flag, CPUFREQ_NEED_UPDATE_LIMITS, that (when set) will
      neutralize the check mentioned above.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      1c534352
  11. 27 10月, 2020 1 次提交
    • P
      RDMA/mlx5: Fix devlink deadlock on net namespace deletion · fbdd0049
      Parav Pandit 提交于
      When a mlx5 core devlink instance is reloaded in different net namespace,
      its associated IB device is deleted and recreated.
      
      Example sequence is:
      $ ip netns add foo
      $ devlink dev reload pci/0000:00:08.0 netns foo
      $ ip netns del foo
      
      mlx5 IB device needs to attach and detach the netdevice to it through the
      netdev notifier chain during load and unload sequence.  A below call graph
      of the unload flow.
      
      cleanup_net()
         down_read(&pernet_ops_rwsem); <- first sem acquired
           ops_pre_exit_list()
             pre_exit()
               devlink_pernet_pre_exit()
                 devlink_reload()
                   mlx5_devlink_reload_down()
                     mlx5_unload_one()
                     [...]
                       mlx5_ib_remove()
                         mlx5_ib_unbind_slave_port()
                           mlx5_remove_netdev_notifier()
                             unregister_netdevice_notifier()
                               down_write(&pernet_ops_rwsem);<- recurrsive lock
      
      Hence, when net namespace is deleted, mlx5 reload results in deadlock.
      
      When deadlock occurs, devlink mutex is also held. This not only deadlocks
      the mlx5 device under reload, but all the processes which attempt to
      access unrelated devlink devices are deadlocked.
      
      Hence, fix this by mlx5 ib driver to register for per net netdev notifier
      instead of global one, which operats on the net namespace without holding
      the pernet_ops_rwsem.
      
      Fixes: 4383cfcc ("net/mlx5: Add devlink reload")
      Link: https://lore.kernel.org/r/20201026134359.23150-1-parav@nvidia.comSigned-off-by: NParav Pandit <parav@nvidia.com>
      Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      fbdd0049
  12. 26 10月, 2020 5 次提交
  13. 25 10月, 2020 2 次提交
    • W
      random32: add noise from network and scheduling activity · 3744741a
      Willy Tarreau 提交于
      With the removal of the interrupt perturbations in previous random32
      change (random32: make prandom_u32() output unpredictable), the PRNG
      has become 100% deterministic again. While SipHash is expected to be
      way more robust against brute force than the previous Tausworthe LFSR,
      there's still the risk that whoever has even one temporary access to
      the PRNG's internal state is able to predict all subsequent draws till
      the next reseed (roughly every minute). This may happen through a side
      channel attack or any data leak.
      
      This patch restores the spirit of commit f227e3ec ("random32: update
      the net random state on interrupt and activity") in that it will perturb
      the internal PRNG's statee using externally collected noise, except that
      it will not pick that noise from the random pool's bits nor upon
      interrupt, but will rather combine a few elements along the Tx path
      that are collectively hard to predict, such as dev, skb and txq
      pointers, packet length and jiffies values. These ones are combined
      using a single round of SipHash into a single long variable that is
      mixed with the net_rand_state upon each invocation.
      
      The operation was inlined because it produces very small and efficient
      code, typically 3 xor, 2 add and 2 rol. The performance was measured
      to be the same (even very slightly better) than before the switch to
      SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC
      (i40e), the connection rate dropped from 556k/s to 555k/s while the
      SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps.
      
      Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
      Cc: George Spelvin <lkml@sdf.org>
      Cc: Amit Klein <aksecurity@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: tytso@mit.edu
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Marc Plumb <lkml.mplumb@gmail.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      3744741a
    • G
      random32: make prandom_u32() output unpredictable · c51f8f88
      George Spelvin 提交于
      Non-cryptographic PRNGs may have great statistical properties, but
      are usually trivially predictable to someone who knows the algorithm,
      given a small sample of their output.  An LFSR like prandom_u32() is
      particularly simple, even if the sample is widely scattered bits.
      
      It turns out the network stack uses prandom_u32() for some things like
      random port numbers which it would prefer are *not* trivially predictable.
      Predictability led to a practical DNS spoofing attack.  Oops.
      
      This patch replaces the LFSR with a homebrew cryptographic PRNG based
      on the SipHash round function, which is in turn seeded with 128 bits
      of strong random key.  (The authors of SipHash have *not* been consulted
      about this abuse of their algorithm.)  Speed is prioritized over security;
      attacks are rare, while performance is always wanted.
      
      Replacing all callers of prandom_u32() is the quick fix.
      Whether to reinstate a weaker PRNG for uses which can tolerate it
      is an open question.
      
      Commit f227e3ec ("random32: update the net random state on interrupt
      and activity") was an earlier attempt at a solution.  This patch replaces
      it.
      Reported-by: NAmit Klein <aksecurity@gmail.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: tytso@mit.edu
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Marc Plumb <lkml.mplumb@gmail.com>
      Fixes: f227e3ec ("random32: update the net random state on interrupt and activity")
      Signed-off-by: NGeorge Spelvin <lkml@sdf.org>
      Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
      [ willy: partial reversal of f227e3ec; moved SIPROUND definitions
        to prandom.h for later use; merged George's prandom_seed() proposal;
        inlined siprand_u32(); replaced the net_rand_state[] array with 4
        members to fix a build issue; cosmetic cleanups to make checkpatch
        happy; fixed RANDOM32_SELFTEST build ]
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      c51f8f88
  14. 23 10月, 2020 1 次提交