1. 10 1月, 2018 1 次提交
    • D
      bpf: avoid false sharing of map refcount with max_entries · be95a845
      Daniel Borkmann 提交于
      In addition to commit b2157399 ("bpf: prevent out-of-bounds
      speculation") also change the layout of struct bpf_map such that
      false sharing of fast-path members like max_entries is avoided
      when the maps reference counter is altered. Therefore enforce
      them to be placed into separate cachelines.
      
      pahole dump after change:
      
        struct bpf_map {
              const struct bpf_map_ops  * ops;                 /*     0     8 */
              struct bpf_map *           inner_map_meta;       /*     8     8 */
              void *                     security;             /*    16     8 */
              enum bpf_map_type          map_type;             /*    24     4 */
              u32                        key_size;             /*    28     4 */
              u32                        value_size;           /*    32     4 */
              u32                        max_entries;          /*    36     4 */
              u32                        map_flags;            /*    40     4 */
              u32                        pages;                /*    44     4 */
              u32                        id;                   /*    48     4 */
              int                        numa_node;            /*    52     4 */
              bool                       unpriv_array;         /*    56     1 */
      
              /* XXX 7 bytes hole, try to pack */
      
              /* --- cacheline 1 boundary (64 bytes) --- */
              struct user_struct *       user;                 /*    64     8 */
              atomic_t                   refcnt;               /*    72     4 */
              atomic_t                   usercnt;              /*    76     4 */
              struct work_struct         work;                 /*    80    32 */
              char                       name[16];             /*   112    16 */
              /* --- cacheline 2 boundary (128 bytes) --- */
      
              /* size: 128, cachelines: 2, members: 17 */
              /* sum members: 121, holes: 1, sum holes: 7 */
        };
      
      Now all entries in the first cacheline are read only throughout
      the life time of the map, set up once during map creation. Overall
      struct size and number of cachelines doesn't change from the
      reordering. struct bpf_map is usually first member and embedded
      in map structs in specific map implementations, so also avoid those
      members to sit at the end where it could potentially share the
      cacheline with first map values e.g. in the array since remote
      CPUs could trigger map updates just as well for those (easily
      dirtying members like max_entries intentionally as well) while
      having subsequent values in cache.
      
      Quoting from Google's Project Zero blog [1]:
      
        Additionally, at least on the Intel machine on which this was
        tested, bouncing modified cache lines between cores is slow,
        apparently because the MESI protocol is used for cache coherence
        [8]. Changing the reference counter of an eBPF array on one
        physical CPU core causes the cache line containing the reference
        counter to be bounced over to that CPU core, making reads of the
        reference counter on all other CPU cores slow until the changed
        reference counter has been written back to memory. Because the
        length and the reference counter of an eBPF array are stored in
        the same cache line, this also means that changing the reference
        counter on one physical CPU core causes reads of the eBPF array's
        length to be slow on other physical CPU cores (intentional false
        sharing).
      
      While this doesn't 'control' the out-of-bounds speculation through
      masking the index as in commit b2157399, triggering a manipulation
      of the map's reference counter is really trivial, so lets not allow
      to easily affect max_entries from it.
      
      Splitting to separate cachelines also generally makes sense from
      a performance perspective anyway in that fast-path won't have a
      cache miss if the map gets pinned, reused in other progs, etc out
      of control path, thus also avoids unintentional false sharing.
      
        [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.htmlSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      be95a845
  2. 09 1月, 2018 2 次提交
    • A
      bpf: prevent out-of-bounds speculation · b2157399
      Alexei Starovoitov 提交于
      Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
      memory accesses under a bounds check may be speculated even if the
      bounds check fails, providing a primitive for building a side channel.
      
      To avoid leaking kernel data round up array-based maps and mask the index
      after bounds check, so speculated load with out of bounds index will load
      either valid value from the array or zero from the padded area.
      
      Unconditionally mask index for all array types even when max_entries
      are not rounded to power of 2 for root user.
      When map is created by unpriv user generate a sequence of bpf insns
      that includes AND operation to make sure that JITed code includes
      the same 'index & index_mask' operation.
      
      If prog_array map is created by unpriv user replace
        bpf_tail_call(ctx, map, index);
      with
        if (index >= max_entries) {
          index &= map->index_mask;
          bpf_tail_call(ctx, map, index);
        }
      (along with roundup to power 2) to prevent out-of-bounds speculation.
      There is secondary redundant 'if (index >= max_entries)' in the interpreter
      and in all JITs, but they can be optimized later if necessary.
      
      Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
      cannot be used by unpriv, so no changes there.
      
      That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
      all architectures with and without JIT.
      
      v2->v3:
      Daniel noticed that attack potentially can be crafted via syscall commands
      without loading the program, so add masking to those paths as well.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b2157399
    • M
      sctp: fix the handling of ICMP Frag Needed for too small MTUs · b6c5734d
      Marcelo Ricardo Leitner 提交于
      syzbot reported a hang involving SCTP, on which it kept flooding dmesg
      with the message:
      [  246.742374] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too
      low, using default minimum of 512
      
      That happened because whenever SCTP hits an ICMP Frag Needed, it tries
      to adjust to the new MTU and triggers an immediate retransmission. But
      it didn't consider the fact that MTUs smaller than the SCTP minimum MTU
      allowed (512) would not cause the PMTU to change, and issued the
      retransmission anyway (thus leading to another ICMP Frag Needed, and so
      on).
      
      As IPv4 (ip_rt_min_pmtu=556) and IPv6 (IPV6_MIN_MTU=1280) minimum MTU
      are higher than that, sctp_transport_update_pmtu() is changed to
      re-fetch the PMTU that got set after our request, and with that, detect
      if there was an actual change or not.
      
      The fix, thus, skips the immediate retransmission if the received ICMP
      resulted in no change, in the hope that SCTP will select another path.
      
      Note: The value being used for the minimum MTU (512,
      SCTP_DEFAULT_MINSEGMENT) is not right and instead it should be (576,
      SCTP_MIN_PMTU), but such change belongs to another patch.
      
      Changes from v1:
      - do not disable PMTU discovery, in the light of commit
      06ad3919 ("[SCTP] Don't disable PMTU discovery when mtu is small")
      and as suggested by Xin Long.
      - changed the way to break the rtx loop by detecting if the icmp
        resulted in a change or not
      Changes from v2:
      none
      
      See-also: https://lkml.org/lkml/2017/12/22/811Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6c5734d
  3. 06 1月, 2018 2 次提交
  4. 05 1月, 2018 1 次提交
  5. 04 1月, 2018 1 次提交
  6. 03 1月, 2018 2 次提交
    • F
      uapi libc compat: add fallback for unsupported libcs · c0bace79
      Felix Janda 提交于
      libc-compat.h aims to prevent symbol collisions between uapi and libc
      headers for each supported libc. This requires continuous coordination
      between them.
      
      The goal of this commit is to improve the situation for libcs (such as
      musl) which are not yet supported and/or do not wish to be explicitly
      supported, while not affecting supported libcs. More precisely, with
      this commit, unsupported libcs can request the suppression of any
      specific uapi definition by defining the correspondings _UAPI_DEF_*
      macro as 0. This can fix symbol collisions for them, as long as the
      libc headers are included before the uapi headers. Inclusion in the
      other order is outside the scope of this commit.
      
      All infrastructure in order to enable this fallback for unsupported
      libcs is already in place, except that libc-compat.h unconditionally
      defines all _UAPI_DEF_* macros to 1 for all unsupported libcs so that
      any previous definitions are ignored. In order to fix this, this commit
      merely makes these definitions conditional.
      
      This commit together with the musl libc commit
      
      http://git.musl-libc.org/cgit/musl/commit/?id=04983f2272382af92eb8f8838964ff944fbb8258
      
      fixes for example the following compiler errors when <linux/in6.h> is
      included after musl's <netinet/in.h>:
      
      ./linux/in6.h:32:8: error: redefinition of 'struct in6_addr'
      ./linux/in6.h:49:8: error: redefinition of 'struct sockaddr_in6'
      ./linux/in6.h:59:8: error: redefinition of 'struct ipv6_mreq'
      
      The comments referencing glibc are still correct, but this file is not
      only used for glibc any more.
      Signed-off-by: NFelix Janda <felix.janda@posteo.de>
      Reviewed-by: NHauke Mehrtens <hauke@hauke-m.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0bace79
    • A
      efi/capsule-loader: Reinstate virtual capsule mapping · f24c4d47
      Ard Biesheuvel 提交于
      Commit:
      
        82c3768b ("efi/capsule-loader: Use a cached copy of the capsule header")
      
      ... refactored the capsule loading code that maps the capsule header,
      to avoid having to map it several times.
      
      However, as it turns out, the vmap() call we ended up removing did not
      just map the header, but the entire capsule image, and dropping this
      virtual mapping breaks capsules that are processed by the firmware
      immediately (i.e., without a reboot).
      
      Unfortunately, that change was part of a larger refactor that allowed
      a quirk to be implemented for Quark, which has a non-standard memory
      layout for capsules, and we have slightly painted ourselves into a
      corner by allowing quirk code to mangle the capsule header and memory
      layout.
      
      So we need to fix this without breaking Quark. Fortunately, Quark does
      not appear to care about the virtual mapping, and so we can simply
      do a partial revert of commit:
      
        2a457fb3 ("efi/capsule-loader: Use page addresses rather than struct page pointers")
      
      ... and create a vmap() mapping of the entire capsule (including header)
      based on the reinstated struct page array, unless running on Quark, in
      which case we pass the capsule header copy as before.
      Reported-by: NGe Song <ge.song@hxt-semitech.com>
      Tested-by: NBryan O'Donoghue <pure.logic@nexus-software.ie>
      Tested-by: NGe Song <ge.song@hxt-semitech.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: <stable@vger.kernel.org>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Fixes: 82c3768b ("efi/capsule-loader: Use a cached copy of the capsule header")
      Link: http://lkml.kernel.org/r/20180102172110.17018-3-ard.biesheuvel@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f24c4d47
  7. 02 1月, 2018 1 次提交
    • D
      fscache: Fix the default for fscache_maybe_release_page() · 98801506
      David Howells 提交于
      Fix the default for fscache_maybe_release_page() for when the cookie isn't
      valid or the page isn't cached.  It mustn't return false as that indicates
      the page cannot yet be freed.
      
      The problem with the default is that if, say, there's no cache, but a
      network filesystem's pages are using up almost all the available memory, a
      system can OOM because the filesystem ->releasepage() op will not allow
      them to be released as fscache_maybe_release_page() incorrectly prevents
      it.
      
      This can be tested by writing a sequence of 512MiB files to an AFS mount.
      It does not affect NFS or CIFS because both of those wrap the call in a
      check of PG_fscache and it shouldn't bother Ceph as that only has
      PG_private set whilst writeback is in progress.  This might be an issue for
      9P, however.
      
      Note that the pages aren't entirely stuck.  Removing a file or unmounting
      will clear things because that uses ->invalidatepage() instead.
      
      Fixes: 201a1542 ("FS-Cache: Handle pages pending storage that get evicted under OOM conditions")
      Reported-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Tested-by: NMarc Dionne <marc.dionne@auristor.com>
      cc: stable@vger.kernel.org # 2.6.32+
      98801506
  8. 30 12月, 2017 3 次提交
    • T
      timers: Reinitialize per cpu bases on hotplug · 26456f87
      Thomas Gleixner 提交于
      The timer wheel bases are not (re)initialized on CPU hotplug. That leaves
      them with a potentially stale clk and next_expiry valuem, which can cause
      trouble then the CPU is plugged.
      
      Add a prepare callback which forwards the clock, sets next_expiry to far in
      the future and reset the control flags to a known state.
      
      Set base->must_forward_clk so the first timer which is queued will try to
      forward the clock to current jiffies.
      
      Fixes: 500462a9 ("timers: Switch to a non-cascading wheel")
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272152200.2431@nanos
      26456f87
    • T
      genirq/irqdomain: Rename early argument of irq_domain_activate_irq() · 702cb0a0
      Thomas Gleixner 提交于
      The 'early' argument of irq_domain_activate_irq() is actually used to
      denote reservation mode. To avoid confusion, rename it before abuse
      happens.
      
      No functional change.
      
      Fixes: 72491643 ("genirq/irqdomain: Update irq_domain_ops.activate() signature")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexandru Chirvasitu <achirvasub@gmail.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Maciej W. Rozycki <macro@linux-mips.org>
      Cc: Mikael Pettersson <mikpelinux@gmail.com>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: Mihai Costache <v-micos@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Simon Xiao <sixiao@microsoft.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: devel@linuxdriverproject.org
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Sakari Ailus <sakari.ailus@intel.com>,
      Cc: linux-media@vger.kernel.org
      702cb0a0
    • T
      genirq: Introduce IRQD_CAN_RESERVE flag · 69790ba9
      Thomas Gleixner 提交于
      Add a new flag to mark interrupts which can use reservation mode. This is
      going to be used in subsequent patches to disable reservation mode for a
      certain class of MSI devices.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NAlexandru Chirvasitu <achirvasub@gmail.com>
      Tested-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Maciej W. Rozycki <macro@linux-mips.org>
      Cc: Mikael Pettersson <mikpelinux@gmail.com>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: Mihai Costache <v-micos@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Simon Xiao <sixiao@microsoft.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: devel@linuxdriverproject.org
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Sakari Ailus <sakari.ailus@intel.com>,
      Cc: linux-media@vger.kernel.org
      69790ba9
  9. 29 12月, 2017 1 次提交
  10. 28 12月, 2017 2 次提交
  11. 27 12月, 2017 1 次提交
    • M
      tcp: Avoid preprocessor directives in tracepoint macro args · 6a6b0b99
      Mat Martineau 提交于
      Using a preprocessor directive to check for CONFIG_IPV6 in the middle of
      a DECLARE_EVENT_CLASS macro's arg list causes sparse to report a series
      of errors:
      
      ./include/trace/events/tcp.h:68:1: error: directive in argument list
      ./include/trace/events/tcp.h:75:1: error: directive in argument list
      ./include/trace/events/tcp.h:144:1: error: directive in argument list
      ./include/trace/events/tcp.h:151:1: error: directive in argument list
      ./include/trace/events/tcp.h:216:1: error: directive in argument list
      ./include/trace/events/tcp.h:223:1: error: directive in argument list
      ./include/trace/events/tcp.h:274:1: error: directive in argument list
      ./include/trace/events/tcp.h:281:1: error: directive in argument list
      
      Once sparse finds an error, it stops printing warnings for the file it
      is checking. This masks any sparse warnings that would normally be
      reported for the core TCP code.
      
      Instead, handle the preprocessor conditionals in a couple of auxiliary
      macros. This also has the benefit of reducing duplicate code.
      
      Cc: David Ahern <dsahern@gmail.com>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a6b0b99
  12. 24 12月, 2017 1 次提交
    • T
      x86/mm/pti: Add infrastructure for page table isolation · aa8c6248
      Thomas Gleixner 提交于
      Add the initial files for kernel page table isolation, with a minimal init
      function and the boot time detection for this misfeature.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      aa8c6248
  13. 23 12月, 2017 2 次提交
    • T
      init: Invoke init_espfix_bsp() from mm_init() · 613e396b
      Thomas Gleixner 提交于
      init_espfix_bsp() needs to be invoked before the page table isolation
      initialization. Move it into mm_init() which is the place where pti_init()
      will be added.
      
      While at it get rid of the #ifdeffery and provide proper stub functions.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      613e396b
    • T
      arch, mm: Allow arch_dup_mmap() to fail · c10e83f5
      Thomas Gleixner 提交于
      In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
      allowed to fail. Fix up all instances.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: dan.j.williams@intel.com
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: kirill.shutemov@linux.intel.com
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c10e83f5
  14. 22 12月, 2017 3 次提交
    • J
      crypto: af_alg - Fix race around ctx->rcvused by making it atomic_t · af955bf1
      Jonathan Cameron 提交于
      This variable was increased and decreased without any protection.
      Result was an occasional misscount and negative wrap around resulting
      in false resource allocation failures.
      
      Fixes: 7d2c3f54 ("crypto: af_alg - remove locking in async callback")
      Signed-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Reviewed-by: NStephan Mueller <smueller@chronox.de>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      af955bf1
    • M
      IB/mlx5: Fix congestion counters in LAG mode · 71a0ff65
      Majd Dibbiny 提交于
      Congestion counters are counted and queried per physical function.
      When working in LAG mode, CNP packets can be sent or received on both
      of the functions, thus congestion counters should be aggregated from
      the two physical functions.
      
      Fixes: e1f24a79 ("IB/mlx5: Support congestion related counters")
      Signed-off-by: NMajd Dibbiny <majd@mellanox.com>
      Reviewed-by: NAviv Heller <avivh@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      71a0ff65
    • S
      net: reevalulate autoflowlabel setting after sysctl setting · 513674b5
      Shaohua Li 提交于
      sysctl.ip6.auto_flowlabels is default 1. In our hosts, we set it to 2.
      If sockopt doesn't set autoflowlabel, outcome packets from the hosts are
      supposed to not include flowlabel. This is true for normal packet, but
      not for reset packet.
      
      The reason is ipv6_pinfo.autoflowlabel is set in sock creation. Later if
      we change sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't
      changed, so the sock will keep the old behavior in terms of auto
      flowlabel. Reset packet is suffering from this problem, because reset
      packet is sent from a special control socket, which is created at boot
      time. Since sysctl.ipv6.auto_flowlabels is 1 by default, the control
      socket will always have its ipv6_pinfo.autoflowlabel set, even after
      user set sysctl.ipv6.auto_flowlabels to 1, so reset packset will always
      have flowlabel. Normal sock created before sysctl setting suffers from
      the same issue. We can't even turn off autoflowlabel unless we kill all
      socks in the hosts.
      
      To fix this, if IPV6_AUTOFLOWLABEL sockopt is used, we use the
      autoflowlabel setting from user, otherwise we always call
      ip6_default_np_autolabel() which has the new settings of sysctl.
      
      Note, this changes behavior a little bit. Before commit 42240901
      (ipv6: Implement different admin modes for automatic flow labels), the
      autoflowlabel behavior of a sock isn't sticky, eg, if sysctl changes,
      existing connection will change autoflowlabel behavior. After that
      commit, autoflowlabel behavior is sticky in the whole life of the sock.
      With this patch, the behavior isn't sticky again.
      
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Tom Herbert <tom@quantonium.net>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      513674b5
  15. 21 12月, 2017 6 次提交
    • F
      netfilter: uapi: correct UNTRACKED conntrack state bit number · 4c82fd0a
      Florian Westphal 提交于
      nft_ct exposes this bit to userspace.  This used to be
      
        #define NF_CT_STATE_UNTRACKED_BIT              (1 << (IP_CT_NUMBER + 1))
        (IP_CT_NUMBER is 5, so this was 0x40)
      
      .. but this got changed to 8 (0x100) when the untracked object got removed.
      Replace this with a literal 6 to prevent further incompatible changes
      in case IP_CT_NUMBER ever increases.
      
      Fixes: cc41c84b ("netfilter: kill the fake untracked conntrack objects")
      Reported-by: NLi Shuang <shuali@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      4c82fd0a
    • A
      bpf: fix integer overflows · bb7f0f98
      Alexei Starovoitov 提交于
      There were various issues related to the limited size of integers used in
      the verifier:
       - `off + size` overflow in __check_map_access()
       - `off + reg->off` overflow in check_mem_access()
       - `off + reg->var_off.value` overflow or 32-bit truncation of
         `reg->var_off.value` in check_mem_access()
       - 32-bit truncation in check_stack_boundary()
      
      Make sure that any integer math cannot overflow by not allowing
      pointer math with large values.
      
      Also reduce the scope of "scalar op scalar" tracking.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      bb7f0f98
    • J
      block: unalign call_single_data in struct request · 4ccafe03
      Jens Axboe 提交于
      A previous change blindly added massive alignment to the
      call_single_data structure in struct request. This ballooned it in size
      from 296 to 320 bytes on my setup, for no valid reason at all.
      
      Use the unaligned struct __call_single_data variant instead.
      
      Fixes: 966a9671 ("smp: Avoid using two cache lines for struct call_single_data")
      Cc: stable@vger.kernel.org # v4.14
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4ccafe03
    • B
      xen/balloon: Mark unallocated host memory as UNUSABLE · b3cf8528
      Boris Ostrovsky 提交于
      Commit f5775e0b ("x86/xen: discard RAM regions above the maximum
      reservation") left host memory not assigned to dom0 as available for
      memory hotplug.
      
      Unfortunately this also meant that those regions could be used by
      others. Specifically, commit fa564ad9 ("x86/PCI: Enable a 64bit BAR
      on AMD Family 15h (Models 00-1f, 30-3f, 60-7f)") may try to map those
      addresses as MMIO.
      
      To prevent this mark unallocated host memory as E820_TYPE_UNUSABLE (thus
      effectively reverting f5775e0b) and keep track of that region as
      a hostmem resource that can be used for the hotplug.
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      b3cf8528
    • S
      block-throttle: avoid double charge · 111be883
      Shaohua Li 提交于
      If a bio is throttled and split after throttling, the bio could be
      resubmited and enters the throttling again. This will cause part of the
      bio to be charged multiple times. If the cgroup has an IO limit, the
      double charge will significantly harm the performance. The bio split
      becomes quite common after arbitrary bio size change.
      
      To fix this, we always set the BIO_THROTTLED flag if a bio is throttled.
      If the bio is cloned/split, we copy the flag to new bio too to avoid a
      double charge. However, cloned bio could be directed to a new disk,
      keeping the flag be a problem. The observation is we always set new disk
      for the bio in this case, so we can clear the flag in bio_set_dev().
      
      This issue exists for a long time, arbitrary bio size change just makes
      it worse, so this should go into stable at least since v4.2.
      
      V1-> V2: Not add extra field in bio based on discussion with Tejun
      
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: stable@vger.kernel.org
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      111be883
    • J
      cls_bpf: fix offload assumptions after callback conversion · 102740bd
      Jakub Kicinski 提交于
      cls_bpf used to take care of tracking what offload state a filter
      is in, i.e. it would track if offload request succeeded or not.
      This information would then be used to issue correct requests to
      the driver, e.g. requests for statistics only on offloaded filters,
      removing only filters which were offloaded, using add instead of
      replace if previous filter was not added etc.
      
      This tracking of offload state no longer functions with the new
      callback infrastructure.  There could be multiple entities trying
      to offload the same filter.
      
      Throw out all the tracking and corresponding commands and simply
      pass to the drivers both old and new bpf program.  Drivers will
      have to deal with offload state tracking by themselves.
      
      Fixes: 3f7889c4 ("net: sched: cls_bpf: call block callbacks for offload")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      102740bd
  16. 20 12月, 2017 3 次提交
    • M
      net/mlx5: Cleanup IRQs in case of unload failure · d6b2785c
      Moshe Shemesh 提交于
      When mlx5_stop_eqs fails to destroy any of the eqs it returns with an error.
      In such failure flow the function will return without
      releasing all EQs irqs and then pci_free_irq_vectors will fail.
      Fix by only warn on destroy EQ failure and continue to release other
      EQs and their irqs.
      
      It fixes the following kernel trace:
      kernel: kernel BUG at drivers/pci/msi.c:352!
      ...
      ...
      kernel: Call Trace:
      kernel: pci_disable_msix+0xd3/0x100
      kernel: pci_free_irq_vectors+0xe/0x20
      kernel: mlx5_load_one.isra.17+0x9f5/0xec0 [mlx5_core]
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      d6b2785c
    • E
      net/mlx5: Fix rate limit packet pacing naming and struct · 37e92a9d
      Eran Ben Elisha 提交于
      In mlx5_ifc, struct size was not complete, and thus driver was sending
      garbage after the last defined field. Fixed it by adding reserved field
      to complete the struct size.
      
      In addition, rename all set_rate_limit to set_pp_rate_limit to be
      compliant with the Firmware <-> Driver definition.
      
      Fixes: 7486216b ("{net,IB}/mlx5: mlx5_ifc updates")
      Fixes: 1466cc5b ("net/mlx5: Rate limit tables support")
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      37e92a9d
    • S
      Revert "mlx5: move affinity hints assignments to generic code" · 231243c8
      Saeed Mahameed 提交于
      Before the offending commit, mlx5 core did the IRQ affinity itself,
      and it seems that the new generic code have some drawbacks and one
      of them is the lack for user ability to modify irq affinity after
      the initial affinity values got assigned.
      
      The issue is still being discussed and a solution in the new generic code
      is required, until then we need to revert this patch.
      
      This fixes the following issue:
      echo <new affinity> > /proc/irq/<x>/smp_affinity
      fails with  -EIO
      
      This reverts commit a435393a.
      Note: kept mlx5_get_vector_affinity in include/linux/mlx5/driver.h since
      it is used in mlx5_ib driver.
      
      Fixes: a435393a ("mlx5: move affinity hints assignments to generic code")
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jes Sorensen <jsorensen@fb.com>
      Reported-by: NJes Sorensen <jsorensen@fb.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      231243c8
  17. 19 12月, 2017 4 次提交
  18. 18 12月, 2017 2 次提交
  19. 17 12月, 2017 2 次提交