1. 03 5月, 2021 1 次提交
    • D
      bpf: Fix leakage of uninitialized bpf stack under speculation · 801c6058
      Daniel Borkmann 提交于
      The current implemented mechanisms to mitigate data disclosure under
      speculation mainly address stack and map value oob access from the
      speculative domain. However, Piotr discovered that uninitialized BPF
      stack is not protected yet, and thus old data from the kernel stack,
      potentially including addresses of kernel structures, could still be
      extracted from that 512 bytes large window. The BPF stack is special
      compared to map values since it's not zero initialized for every
      program invocation, whereas map values /are/ zero initialized upon
      their initial allocation and thus cannot leak any prior data in either
      domain. In the non-speculative domain, the verifier ensures that every
      stack slot read must have a prior stack slot write by the BPF program
      to avoid such data leaking issue.
      
      However, this is not enough: for example, when the pointer arithmetic
      operation moves the stack pointer from the last valid stack offset to
      the first valid offset, the sanitation logic allows for any intermediate
      offsets during speculative execution, which could then be used to
      extract any restricted stack content via side-channel.
      
      Given for unprivileged stack pointer arithmetic the use of unknown
      but bounded scalars is generally forbidden, we can simply turn the
      register-based arithmetic operation into an immediate-based arithmetic
      operation without the need for masking. This also gives the benefit
      of reducing the needed instructions for the operation. Given after
      the work in 7fedb63a ("bpf: Tighten speculative pointer arithmetic
      mask"), the aux->alu_limit already holds the final immediate value for
      the offset register with the known scalar. Thus, a simple mov of the
      immediate to AX register with using AX as the source for the original
      instruction is sufficient and possible now in this case.
      Reported-by: NPiotr Krysiuk <piotras@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NPiotr Krysiuk <piotras@gmail.com>
      Reviewed-by: NPiotr Krysiuk <piotras@gmail.com>
      Reviewed-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      801c6058
  2. 30 4月, 2021 1 次提交
    • A
      seg6: add counters support for SRv6 Behaviors · 94604548
      Andrea Mayer 提交于
      This patch provides counters for SRv6 Behaviors as defined in [1],
      section 6. For each SRv6 Behavior instance, counters defined in [1] are:
      
       - the total number of packets that have been correctly processed;
       - the total amount of traffic in bytes of all packets that have been
         correctly processed;
      
      In addition, this patch introduces a new counter that counts the number of
      packets that have NOT been properly processed (i.e. errors) by an SRv6
      Behavior instance.
      
      Counters are not only interesting for network monitoring purposes (i.e.
      counting the number of packets processed by a given behavior) but they also
      provide a simple tool for checking whether a behavior instance is working
      as we expect or not.
      Counters can be useful for troubleshooting misconfigured SRv6 networks.
      Indeed, an SRv6 Behavior can silently drop packets for very different
      reasons (i.e. wrong SID configuration, interfaces set with SID addresses,
      etc) without any notification/message to the user.
      
      Due to the nature of SRv6 networks, diagnostic tools such as ping and
      traceroute may be ineffective: paths used for reaching a given router can
      be totally different from the ones followed by probe packets. In addition,
      paths are often asymmetrical and this makes it even more difficult to keep
      up with the journey of the packets and to understand which behaviors are
      actually processing our traffic.
      
      When counters are enabled on an SRv6 Behavior instance, it is possible to
      verify if packets are actually processed by such behavior and what is the
      outcome of the processing. Therefore, the counters for SRv6 Behaviors offer
      an non-invasive observability point which can be leveraged for both traffic
      monitoring and troubleshooting purposes.
      
      [1] https://www.rfc-editor.org/rfc/rfc8986.html#name-counters
      
      Troubleshooting using SRv6 Behavior counters
      --------------------------------------------
      
      Let's make a brief example to see how helpful counters can be for SRv6
      networks. Let's consider a node where an SRv6 End Behavior receives an SRv6
      packet whose Segment Left (SL) is equal to 0. In this case, the End
      Behavior (which accepts only packets with SL >= 1) discards the packet and
      increases the error counter.
      This information can be leveraged by the network operator for
      troubleshooting. Indeed, the error counter is telling the user that the
      packet:
      
        (i) arrived at the node;
       (ii) the packet has been taken into account by the SRv6 End behavior;
      (iii) but an error has occurred during the processing.
      
      The error (iii) could be caused by different reasons, such as wrong route
      settings on the node or due to an invalid SID List carried by the SRv6
      packet. Anyway, the error counter is used to exclude that the packet did
      not arrive at the node or it has not been processed by the behavior at
      all.
      
      Turning on/off counters for SRv6 Behaviors
      ------------------------------------------
      
      Each SRv6 Behavior instance can be configured, at the time of its creation,
      to make use of counters.
      This is done through iproute2 which allows the user to create an SRv6
      Behavior instance specifying the optional "count" attribute as shown in the
      following example:
      
       $ ip -6 route add 2001:db8::1 encap seg6local action End count dev eth0
      
      per-behavior counters can be shown by adding "-s" to the iproute2 command
      line, i.e.:
      
       $ ip -s -6 route show 2001:db8::1
       2001:db8::1 encap seg6local action End packets 0 bytes 0 errors 0 dev eth0
      
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Impact of counters for SRv6 Behaviors on performance
      ====================================================
      
      To determine the performance impact due to the introduction of counters in
      the SRv6 Behavior subsystem, we have carried out extensive tests.
      
      We chose to test the throughput achieved by the SRv6 End.DX2 Behavior
      because, among all the other behaviors implemented so far, it reaches the
      highest throughput which is around 1.5 Mpps (per core at 2.4 GHz on a
      Xeon(R) CPU E5-2630 v3) on kernel 5.12-rc2 using packets of size ~ 100
      bytes.
      
      Three different tests were conducted in order to evaluate the overall
      throughput of the SRv6 End.DX2 Behavior in the following scenarios:
      
       1) vanilla kernel (without the SRv6 Behavior counters patch) and a single
          instance of an SRv6 End.DX2 Behavior;
       2) patched kernel with SRv6 Behavior counters and a single instance of
          an SRv6 End.DX2 Behavior with counters turned off;
       3) patched kernel with SRv6 Behavior counters and a single instance of
          SRv6 End.DX2 Behavior with counters turned on.
      
      All tests were performed on a testbed deployed on the CloudLab facilities
      [2], a flexible infrastructure dedicated to scientific research on the
      future of Cloud Computing.
      
      Results of tests are shown in the following table:
      
      Scenario (1): average 1504764,81 pps (~1504,76 kpps); std. dev 3956,82 pps
      Scenario (2): average 1501469,78 pps (~1501,47 kpps); std. dev 2979,85 pps
      Scenario (3): average 1501315,13 pps (~1501,32 kpps); std. dev 2956,00 pps
      
      As can be observed, throughputs achieved in scenarios (2),(3) did not
      suffer any observable degradation compared to scenario (1).
      
      Thanks to Jakub Kicinski and David Ahern for their valuable suggestions
      and comments provided during the discussion of the proposed RFCs.
      
      [2] https://www.cloudlab.usSigned-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94604548
  3. 29 4月, 2021 2 次提交
  4. 28 4月, 2021 11 次提交
    • L
      Fix misc new gcc warnings · e7c6e405
      Linus Torvalds 提交于
      It seems like Fedora 34 ends up enabling a few new gcc warnings, notably
      "-Wstringop-overread" and "-Warray-parameter".
      
      Both of them cause what seem to be valid warnings in the kernel, where
      we have array size mismatches in function arguments (that are no longer
      just silently converted to a pointer to element, but actually checked).
      
      This fixes most of the trivial ones, by making the function declaration
      match the function definition, and in the case of intel_pm.c, removing
      the over-specified array size from the argument declaration.
      
      At least one 'stringop-overread' warning remains in the i915 driver, but
      that one doesn't have the same obvious trivial fix, and may or may not
      actually be indicative of a bug.
      
      [ It was a mistake to upgrade one of my machines to Fedora 34 while
        being busy with the merge window, but if this is the extent of the
        compiler upgrade problems, things are better than usual    - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7c6e405
    • F
      bpf: Implement formatted output helpers with bstr_printf · 48cac3f4
      Florent Revest 提交于
      BPF has three formatted output helpers: bpf_trace_printk, bpf_seq_printf
      and bpf_snprintf. Their signatures specify that all arguments are
      provided from the BPF world as u64s (in an array or as registers). All
      of these helpers are currently implemented by calling functions such as
      snprintf() whose signatures take a variable number of arguments, then
      placed in a va_list by the compiler to call vsnprintf().
      
      "d9c9e4db bpf: Factorize bpf_trace_printk and bpf_seq_printf" introduced
      a bpf_printf_prepare function that fills an array of u64 sanitized
      arguments with an array of "modifiers" which indicate what the "real"
      size of each argument should be (given by the format specifier). The
      BPF_CAST_FMT_ARG macro consumes these arrays and casts each argument to
      its real size. However, the C promotion rules implicitely cast them all
      back to u64s. Therefore, the arguments given to snprintf are u64s and
      the va_list constructed by the compiler will use 64 bits for each
      argument. On 64 bit machines, this happens to work well because 32 bit
      arguments in va_lists need to occupy 64 bits anyway, but on 32 bit
      architectures this breaks the layout of the va_list expected by the
      called function and mangles values.
      
      In "88a5c690 bpf: fix bpf_trace_printk on 32 bit archs", this problem
      had been solved for bpf_trace_printk only with a "horrid workaround"
      that emitted multiple calls to trace_printk where each call had
      different argument types and generated different va_list layouts. One of
      the call would be dynamically chosen at runtime. This was ok with the 3
      arguments that bpf_trace_printk takes but bpf_seq_printf and
      bpf_snprintf accept up to 12 arguments. Because this approach scales
      code exponentially, it is not a viable option anymore.
      
      Because the promotion rules are part of the language and because the
      construction of a va_list is an arch-specific ABI, it's best to just
      avoid variadic arguments and va_lists altogether. Thankfully the
      kernel's snprintf() has an alternative in the form of bstr_printf() that
      accepts arguments in a "binary buffer representation". These binary
      buffers are currently created by vbin_printf and used in the tracing
      subsystem to split the cost of printing into two parts: a fast one that
      only dereferences and remembers values, and a slower one, called later,
      that does the pretty-printing.
      
      This patch refactors bpf_printf_prepare to construct binary buffers of
      arguments consumable by bstr_printf() instead of arrays of arguments and
      modifiers. This gets rid of BPF_CAST_FMT_ARG and greatly simplifies the
      bpf_printf_prepare usage but there are a few gotchas that change how
      bpf_printf_prepare needs to do things.
      
      Currently, bpf_printf_prepare uses a per cpu temporary buffer as a
      generic storage for strings and IP addresses. With this refactoring, the
      temporary buffers now holds all the arguments in a structured binary
      format.
      
      To comply with the format expected by bstr_printf, certain format
      specifiers also need to be pre-formatted: %pB and %pi6/%pi4/%pI4/%pI6.
      Because vsnprintf subroutines for these specifiers are hard to expose,
      we pre-format these arguments with calls to snprintf().
      Reported-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NFlorent Revest <revest@chromium.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210427174313.860948-3-revest@chromium.org
      48cac3f4
    • F
      seq_file: Add a seq_bprintf function · 76d6a133
      Florent Revest 提交于
      Similarly to seq_buf_bprintf in lib/seq_buf.c, this function writes a
      printf formatted string with arguments provided in a "binary
      representation" built by functions such as vbin_printf.
      Signed-off-by: NFlorent Revest <revest@chromium.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210427174313.860948-2-revest@chromium.org
      76d6a133
    • A
      net: phy: Add support for microchip SMI0 MDIO bus · 800fcab8
      Andrew Lunn 提交于
      SMI0 is a mangled version of MDIO. The main low level difference is
      the MDIO C22 OP code is always 0, not 0x2 or 0x1 for Read/Write. The
      read/write information is instead encoded in the PHY address.
      
      Extend the bit-bang code to allow the op code to be overridden, but
      default to normal C22 values. Add an extra compatible to the mdio-gpio
      driver, and when this compatible is present, set the op codes to 0.
      
      A higher level driver, sitting on top of the basic MDIO bus driver can
      then implement the rest of the microchip SMI0 odderties.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NMichael Grzeschik <m.grzeschik@pengutronix.de>
      Signed-off-by: NOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      800fcab8
    • Y
      net: mscc: ocelot: support PTP Sync one-step timestamping · 39e5308b
      Yangbo Lu 提交于
      Although HWTSTAMP_TX_ONESTEP_SYNC existed in ioctl for hardware timestamp
      configuration, the PTP Sync one-step timestamping had never been supported.
      
      This patch is to truely support it.
      
      - ocelot_port_txtstamp_request()
        This function handles tx timestamp request by storing
        ptp_cmd(tx timestamp type) in OCELOT_SKB_CB(skb)->ptp_cmd,
        and additionally for two-step timestamp storing ts_id in
        OCELOT_SKB_CB(clone)->ptp_cmd.
      
      - ocelot_ptp_rew_op()
        During xmit, this function is called to get rew_op (rewriter option) by
        checking skb->cb for tx timestamp request, and configure to transmitting.
      
      Non-onestep-Sync packet with one-step timestamp request falls back to use
      two-step timestamp.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39e5308b
    • Y
      net: mscc: ocelot: convert to ocelot_port_txtstamp_request() · 682eaad9
      Yangbo Lu 提交于
      Convert to a common ocelot_port_txtstamp_request() for TX timestamp
      request handling.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Reviewed-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      682eaad9
    • Y
      net: dsa: free skb->cb usage in core driver · c4b364ce
      Yangbo Lu 提交于
      Free skb->cb usage in core driver and let device drivers decide to
      use or not. The reason having a DSA_SKB_CB(skb)->clone was because
      dsa_skb_tx_timestamp() which may set the clone pointer was called
      before p->xmit() which would use the clone if any, and the device
      driver has no way to initialize the clone pointer.
      
      This patch just put memset(skb->cb, 0, sizeof(skb->cb)) at beginning
      of dsa_slave_xmit(). Some new features in the future, like one-step
      timestamp may need more bytes of skb->cb to use in
      dsa_skb_tx_timestamp(), and p->xmit().
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4b364ce
    • Y
      net: dsa: no longer clone skb in core driver · 5c5416f5
      Yangbo Lu 提交于
      It was a waste to clone skb directly in dsa_skb_tx_timestamp().
      For one-step timestamping, a clone was not needed. For any failure of
      port_txtstamp (this may usually happen), the skb clone had to be freed.
      
      So this patch moves skb cloning for tx timestamp out of dsa core, and
      let drivers clone skb in port_txtstamp if they really need.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Tested-by: NKurt Kanzenbach <kurt@linutronix.de>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c5416f5
    • Y
      net: dsa: no longer identify PTP packet in core driver · cf536ea3
      Yangbo Lu 提交于
      Move ptp_classify_raw out of dsa core driver for handling tx
      timestamp request. Let device drivers do this if they want.
      Not all drivers want to limit tx timestamping for only PTP
      packet.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Tested-by: NKurt Kanzenbach <kurt@linutronix.de>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf536ea3
    • L
      net: bridge: mcast: fix broken length + header check for MRDv6 Adv. · 99014088
      Linus Lüssing 提交于
      The IPv6 Multicast Router Advertisements parsing has the following two
      issues:
      
      For one thing, ICMPv6 MRD Advertisements are smaller than ICMPv6 MLD
      messages (ICMPv6 MRD Adv.: 8 bytes vs. ICMPv6 MLDv1/2: >= 24 bytes,
      assuming MLDv2 Reports with at least one multicast address entry).
      When ipv6_mc_check_mld_msg() tries to parse an Multicast Router
      Advertisement its MLD length check will fail - and it will wrongly
      return -EINVAL, even if we have a valid MRD Advertisement. With the
      returned -EINVAL the bridge code will assume a broken packet and will
      wrongly discard it, potentially leading to multicast packet loss towards
      multicast routers.
      
      The second issue is the MRD header parsing in
      br_ip6_multicast_mrd_rcv(): It wrongly checks for an ICMPv6 header
      immediately after the IPv6 header (IPv6 next header type). However
      according to RFC4286, section 2 all MRD messages contain a Router Alert
      option (just like MLD). So instead there is an IPv6 Hop-by-Hop option
      for the Router Alert between the IPv6 and ICMPv6 header, again leading
      to the bridge wrongly discarding Multicast Router Advertisements.
      
      To fix these two issues, introduce a new return value -ENODATA to
      ipv6_mc_check_mld() to indicate a valid ICMPv6 packet with a hop-by-hop
      option which is not an MLD but potentially an MRD packet. This also
      simplifies further parsing in the bridge code, as ipv6_mc_check_mld()
      already fully checks the ICMPv6 header and hop-by-hop option.
      
      These issues were found and fixed with the help of the mrdisc tool
      (https://github.com/troglobit/mrdisc).
      
      Fixes: 4b3087c7 ("bridge: Snoop Multicast Router Advertisements")
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99014088
    • P
      netfilter: nftables: add catch-all set element support · aaa31047
      Pablo Neira Ayuso 提交于
      This patch extends the set infrastructure to add a special catch-all set
      element. If the lookup fails to find an element (or range) in the set,
      then the catch-all element is selected. Users can specify a mapping,
      expression(s) and timeout to be attached to the catch-all element.
      
      This patch adds a catchall list to the set, this list might contain more
      than one single catch-all element (e.g. in case that the catch-all
      element is removed and a new one is added in the same transaction).
      However, most of the time, there will be either one element or no
      elements at all in this list.
      
      The catch-all element is identified via NFT_SET_ELEM_CATCHALL flag and
      such special element has no NFTA_SET_ELEM_KEY attribute. There is a new
      nft_set_elem_catchall object that stores a reference to the dummy
      catch-all element (catchall->elem) whose layout is the same of the set
      element type to reuse the existing set element codebase.
      
      The set size does not apply to the catch-all element, users can define a
      catch-all element even if the set is full.
      
      The check for valid set element flags hava been updates to report
      EOPNOTSUPP in case userspace requests flags that are not supported when
      using new userspace nftables and old kernel.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      aaa31047
  5. 27 4月, 2021 6 次提交
  6. 26 4月, 2021 19 次提交