1. 08 7月, 2018 9 次提交
    • T
      xdp: XDP_REDIRECT should check IFF_UP and MTU · d8d7218a
      Toshiaki Makita 提交于
      Otherwise we end up with attempting to send packets from down devices
      or to send oversized packets, which may cause unexpected driver/device
      behaviour. Generic XDP has already done this check, so reuse the logic
      in native XDP.
      
      Fixes: 814abfab ("xdp: add bpf_redirect helper function")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8d7218a
    • A
      Merge branch 'sockhash-fixes' · 4fb126cb
      Alexei Starovoitov 提交于
      John Fastabend says:
      
      ====================
      First three patches resolve issues found while testing sockhash and
      reviewing code. Syzbot also found them about the same time as I was
      working on fixes. The main issue is in the sockhash path we reduced
      the scope of sk_callback lock but this meant we could get update and
      close running in parallel so fix that here.
      
      Then testing sk_msg and sk_skb programs together found that skb->dev
      is not always assigned and some of the helpers were depending on this
      to lookup max mtu. Fix this by using SKB_MAX_ALLOC when no MTU is
      available.
      
      Finally, Martin spotted that the sockmap code was still using the
      qdisc skb cb structure. But I was sure we had fixed this long ago.
      Looks like we missed it in a merge conflict resolution and then by
      chance data_end offset was the same in both structures so everything
      sort of continued to work even though it could break at any moment
      if the structs ever change. So redo the conversion and this time
      also convert the helpers.
      
      v2: fix '0 files changed' issue in patches
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      4fb126cb
    • J
      bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb · 0ea488ff
      John Fastabend 提交于
      In commit
      
        'bpf: bpf_compute_data uses incorrect cb structure' (8108a775)
      
      we added the routine bpf_compute_data_end_sk_skb() to compute the
      correct data_end values, but this has since been lost. In kernel
      v4.14 this was correct and the above patch was applied in it
      entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
      we lost the piece that renamed bpf_compute_data_pointers to the
      new function bpf_compute_data_end_sk_skb. This was done here,
      
      e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      
      When it conflicted with the following rename patch,
      
      6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      
      Finally, after a refactor I thought even the function
      bpf_compute_data_end_sk_skb() was no longer needed and it was
      erroneously removed.
      
      However, we never reverted the sk_skb_convert_ctx_access() usage of
      tcp_skb_cb which had been committed and survived the merge conflict.
      Here we fix this by adding back the helper and *_data_end_sk_skb()
      usage. Using the bpf_skc_data_end mapping is not correct because it
      expects a qdisc_skb_cb object but at the sock layer this is not the
      case. Even though it happens to work here because we don't overwrite
      any data in-use at the socket layer and the cb structure is cleared
      later this has potential to create some subtle issues. But, even
      more concretely the filter.c access check uses tcp_skb_cb.
      
      And by some act of chance though,
      
      struct bpf_skb_data_end {
              struct qdisc_skb_cb        qdisc_cb;             /*     0    28 */
      
              /* XXX 4 bytes hole, try to pack */
      
              void *                     data_meta;            /*    32     8 */
              void *                     data_end;             /*    40     8 */
      
              /* size: 48, cachelines: 1, members: 3 */
              /* sum members: 44, holes: 1, sum holes: 4 */
              /* last cacheline: 48 bytes */
      };
      
      and then tcp_skb_cb,
      
      struct tcp_skb_cb {
      	[...]
                      struct {
                              __u32      flags;                /*    24     4 */
                              struct sock * sk_redir;          /*    32     8 */
                              void *     data_end;             /*    40     8 */
                      } bpf;                                   /*          24 */
              };
      
      So when we use offset_of() to track down the byte offset we get 40 in
      either case and everything continues to work. Fix this mess and use
      correct structures its unclear how long this might actually work for
      until someone moves the structs around.
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Fixes: e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      Fixes: 6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0ea488ff
    • J
      bpf: sockmap, consume_skb in close path · 7ebc14d5
      John Fastabend 提交于
      Currently, when a sock is closed and the bpf_tcp_close() callback is
      used we remove memory but do not free the skb. Call consume_skb() if
      the skb is attached to the buffer.
      
      Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
      Fixes: 1aa12bdf ("bpf: sockmap, add sock close() hook to remove socks")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7ebc14d5
    • J
      bpf: sockhash, disallow bpf_tcp_close and update in parallel · 99ba2b5a
      John Fastabend 提交于
      After latest lock updates there is no longer anything preventing a
      close and recvmsg call running in parallel. Additionally, we can
      race update with close if we close a socket and simultaneously update
      if via the BPF userspace API (note the cgroup ops are already run
      with sock_lock held).
      
      To resolve this take sock_lock in close and update paths.
      
      Reported-by: syzbot+b680e42077a0d7c9a0c4@syzkaller.appspotmail.com
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      99ba2b5a
    • J
      bpf: fix sk_skb programs without skb->dev assigned · 0c6bc6e5
      John Fastabend 提交于
      Multiple BPF helpers in use by sk_skb programs calculate the max
      skb length using the __bpf_skb_max_len function. However, this
      calculates the max length using the skb->dev pointer which can be
      NULL when an sk_skb program is paired with an sk_msg program.
      
      To force this a sk_msg program needs to redirect into the ingress
      path of a sock with an attach sk_skb program. Then the the sk_skb
      program would need to call one of the helpers that adjust the skb
      size.
      
      To fix the null ptr dereference use SKB_MAX_ALLOC size if no dev
      is available.
      
      Fixes: 8934ce2f ("bpf: sockmap redirect ingress support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0c6bc6e5
    • A
      Merge branch 'sockmap-fixes' · 631da853
      Alexei Starovoitov 提交于
      John Fastabend says:
      
      ====================
      I missed fixing the error path in the sockhash code to align with
      supporting socks in multiple maps. Simply checking if the psock is
      present does not mean we can decrement the reference count because
      it could be part of another map. Fix this by cleaning up the error
      path so this situation does not happen.
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      631da853
    • J
      bpf: sockmap, hash table is RCU so readers do not need locks · 1d1ef005
      John Fastabend 提交于
      This removes locking from readers of RCU hash table. Its not
      necessary.
      
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1d1ef005
    • J
      bpf: sockmap, error path can not release psock in multi-map case · 547b3aa4
      John Fastabend 提交于
      The current code, in the error path of sock_hash_ctx_update_elem,
      checks if the sock has a psock in the user data and if so decrements
      the reference count of the psock. However, if the error happens early
      in the error path we may have never incremented the psock reference
      count and if the psock exists because the sock is in another map then
      we may inadvertently decrement the reference count.
      
      Fix this by making the error path only call smap_release_sock if the
      error happens after the increment.
      
      Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      547b3aa4
  2. 05 7月, 2018 4 次提交
  3. 04 7月, 2018 1 次提交
  4. 03 7月, 2018 8 次提交
    • A
      Merge branch 'af_xdp-fixes' · 39d393cf
      Alexei Starovoitov 提交于
      Magnus Karlsson says:
      
      ====================
      This patch set fixes three bugs in the SKB TX path of AF_XDP.
      Details in the individual commits.
      
      The structure of the patch set is as follows:
      
      Patch 1: Fix for lost completion message
      Patch 2-3: Fix for possible multiple completions of single packet
      Patch 4: Fix potential race during error
      
      Changes from v1:
      
      * Added explanation of race in commit message of patch 4.
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      39d393cf
    • M
      xsk: fix potential race in SKB TX completion code · a9744f7c
      Magnus Karlsson 提交于
      There is a potential race in the TX completion code for the SKB
      case. One process enters the sendmsg code of an AF_XDP socket in order
      to send a frame. The execution eventually trickles down to the driver
      that is told to send the packet. However, it decides to drop the
      packet due to some error condition (e.g., rings full) and frees the
      SKB. This will trigger the SKB destructor and a completion will be
      sent to the AF_XDP user space through its
      single-producer/single-consumer queues.
      
      At the same time a TX interrupt has fired on another core and it
      dispatches the TX completion code in the driver. It does its HW
      specific things and ends up freeing the SKB associated with the
      transmitted packet. This will trigger the SKB destructor and a
      completion will be sent to the AF_XDP user space through its
      single-producer/single-consumer queues. With a pseudo call stack, it
      would look like this:
      
      Core 1:
      sendmsg() being called in the application
        netdev_start_xmit()
          Driver entered through ndo_start_xmit
            Driver decides to free the SKB for some reason (e.g., rings full)
              Destructor of SKB called
                xskq_produce_addr() is called to signal completion to user space
      
      Core 2:
      TX completion irq
        NAPI loop
          Driver irq handler for TX completions
            Frees the SKB
              Destructor of SKB called
                xskq_produce_addr() is called to signal completion to user space
      
      We now have a violation of the single-producer/single-consumer
      principle for our queues as there are two threads trying to produce at
      the same time on the same queue.
      
      Fixed by introducing a spin_lock in the destructor. In regards to the
      performance, I get around 1.74 Mpps for txonly before and after the
      introduction of the spinlock. There is of course some impact due to
      the spin lock but it is in the less significant digits that are too
      noisy for me to measure. But let us say that the version without the
      spin lock got 1.745 Mpps in the best case and the version with 1.735
      Mpps in the worst case, then that would mean a maximum drop in
      performance of 0.5%.
      
      Fixes: 35fcde7f ("xsk: support for Tx")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a9744f7c
    • M
      samples/bpf: deal with EBUSY return code from sendmsg in xdpsock sample · c03079c9
      Magnus Karlsson 提交于
      Sendmsg in the SKB path of AF_XDP can now return EBUSY when a packet
      was discarded and completed by the driver. Just ignore this message
      in the sample application.
      
      Fixes: b4b8faa1 ("samples/bpf: sample application and documentation for AF_XDP sockets")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Reported-by: NPavel Odintsov <pavel@fastnetmon.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c03079c9
    • M
      xsk: frame could be completed more than once in SKB path · fe588685
      Magnus Karlsson 提交于
      Fixed a bug in which a frame could be completed more than once
      when an error was returned from dev_direct_xmit(). The code
      erroneously retried sending the message leading to multiple
      calls to the SKB destructor and therefore multiple completions
      of the same buffer to user space.
      
      The error code in this case has been changed from EAGAIN to EBUSY
      in order to tell user space that the sending of the packet failed
      and the buffer has been return to user space through the completion
      queue.
      
      Fixes: 35fcde7f ("xsk: support for Tx")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Reported-by: NPavel Odintsov <pavel@fastnetmon.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fe588685
    • M
      xsk: fix potential lost completion message in SKB path · 20b52a75
      Magnus Karlsson 提交于
      The code in xskq_produce_addr erroneously checked if there
      was up to LAZY_UPDATE_THRESHOLD amount of space in the completion
      queue. It only needs to check if there is one slot left in the
      queue. This bug could under some circumstances lead to a WARN_ON_ONCE
      being triggered and the completion message to user space being lost.
      
      Fixes: 35fcde7f ("xsk: support for Tx")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Reported-by: NPavel Odintsov <pavel@fastnetmon.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      20b52a75
    • L
      Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md · d0fbad0a
      Linus Torvalds 提交于
      Pull MD fixes from Shaohua Li:
       "Two small fixes for MD:
      
         - an error handling fix from me
      
         - a recover bug fix for raid10 from BingJing"
      
      * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
        md/raid10: fix that replacement cannot complete recovery after reassemble
        MD: cleanup resources in failure
      d0fbad0a
    • L
      Merge tag 'for-linus' of git://github.com/stffrdhrn/linux · 8d2b6f6b
      Linus Torvalds 提交于
      Pull OpenRISC fixes from Stafford Horne:
       "Two fixes for issues which were breaking OpenRISC boot:
      
         - Fix bug in __pte_free_tlb() exposed in 4.18 by Matthew Wilcox's
           page table flag addition.
      
         - Fix issue booting on real hardware if delay slot detection
           emulation is disabled"
      
      * tag 'for-linus' of git://github.com/stffrdhrn/linux:
        openrisc: entry: Fix delay slot exception detection
        openrisc: Call destructor during __pte_free_tlb
      8d2b6f6b
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 4e33d7d4
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
      
       1) Verify netlink attributes properly in nf_queue, from Eric Dumazet.
      
       2) Need to bump memory lock rlimit for test_sockmap bpf test, from
          Yonghong Song.
      
       3) Fix VLAN handling in lan78xx driver, from Dave Stevenson.
      
       4) Fix uninitialized read in nf_log, from Jann Horn.
      
       5) Fix raw command length parsing in mlx5, from Alex Vesker.
      
       6) Cleanup loopback RDS connections upon netns deletion, from Sowmini
          Varadhan.
      
       7) Fix regressions in FIB rule matching during create, from Jason A.
          Donenfeld and Roopa Prabhu.
      
       8) Fix mpls ether type detection in nfp, from Pieter Jansen van Vuuren.
      
       9) More bpfilter build fixes/adjustments from Masahiro Yamada.
      
      10) Fix XDP_{TX,REDIRECT} flushing in various drivers, from Jesper
          Dangaard Brouer.
      
      11) fib_tests.sh file permissions were broken, from Shuah Khan.
      
      12) Make sure BH/preemption is disabled in data path of mac80211, from
          Denis Kenzior.
      
      13) Don't ignore nla_parse_nested() return values in nl80211, from
          Johannes berg.
      
      14) Properly account sock objects ot kmemcg, from Shakeel Butt.
      
      15) Adjustments to setting bpf program permissions to read-only, from
          Daniel Borkmann.
      
      16) TCP Fast Open key endianness was broken, it always took on the host
          endiannness. Whoops. Explicitly make it little endian. From Yuching
          Cheng.
      
      17) Fix prefix route setting for link local addresses in ipv6, from
          David Ahern.
      
      18) Potential Spectre v1 in zatm driver, from Gustavo A. R. Silva.
      
      19) Various bpf sockmap fixes, from John Fastabend.
      
      20) Use after free for GRO with ESP, from Sabrina Dubroca.
      
      21) Passing bogus flags to crypto_alloc_shash() in ipv6 SR code, from
          Eric Biggers.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
        qede: Adverstise software timestamp caps when PHC is not available.
        qed: Fix use of incorrect size in memcpy call.
        qed: Fix setting of incorrect eswitch mode.
        qed: Limit msix vectors in kdump kernel to the minimum required count.
        ipvlan: call dev_change_flags when ipvlan mode is reset
        ipv6: sr: fix passing wrong flags to crypto_alloc_shash()
        net: fix use-after-free in GRO with ESP
        tcp: prevent bogus FRTO undos with non-SACK flows
        bpf: sockhash, add release routine
        bpf: sockhash fix omitted bucket lock in sock_close
        bpf: sockmap, fix smap_list_map_remove when psock is in many maps
        bpf: sockmap, fix crash when ipv6 sock is added
        net: fib_rules: bring back rule_exists to match rule during add
        hv_netvsc: split sub-channel setup into async and sync
        net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN
        atm: zatm: Fix potential Spectre v1
        s390/qeth: consistently re-enable device features
        s390/qeth: don't clobber buffer on async TX completion
        s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]
        s390/qeth: fix race when setting MAC address
        ...
      4e33d7d4
  5. 02 7月, 2018 15 次提交
  6. 01 7月, 2018 3 次提交
    • I
      tcp: prevent bogus FRTO undos with non-SACK flows · 1236f22f
      Ilpo Järvinen 提交于
      If SACK is not enabled and the first cumulative ACK after the RTO
      retransmission covers more than the retransmitted skb, a spurious
      FRTO undo will trigger (assuming FRTO is enabled for that RTO).
      The reason is that any non-retransmitted segment acknowledged will
      set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
      no indication that it would have been delivered for real (the
      scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
      case so the check for that bit won't help like it does with SACK).
      Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
      in tcp_process_loss.
      
      We need to use more strict condition for non-SACK case and check
      that none of the cumulatively ACKed segments were retransmitted
      to prove that progress is due to original transmissions. Only then
      keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
      non-SACK case.
      
      (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
      to better indicate its purpose but to keep this change minimal, it
      will be done in another patch).
      
      Besides burstiness and congestion control violations, this problem
      can result in RTO loop: When the loss recovery is prematurely
      undoed, only new data will be transmitted (if available) and
      the next retransmission can occur only after a new RTO which in case
      of multiple losses (that are not for consecutive packets) requires
      one RTO per loss to recover.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1236f22f
    • S
      openrisc: entry: Fix delay slot exception detection · ae15a41a
      Stafford Horne 提交于
      Originally in patch e6d20c55 ("openrisc: entry: Fix delay slot
      detection") I fixed delay slot detection, but only for QEMU.  We missed
      that hardware delay slot detection using delay slot exception flag (DSX)
      was still broken.  This was because QEMU set the DSX flag in both
      pre-exception supervision register (ESR) and supervision register (SR)
      register, but on real hardware the DSX flag is only set on the SR
      register during exceptions.
      
      Fix this by carrying the DSX flag into the SR register during exception.
      We also update the DSX flag read locations to read the value from the SR
      register not the pt_regs SR register which represents ESR.  The ESR
      should never have the DSX flag set.
      
      In the process I updated/removed a few comments to match the current
      state.  Including removing a comment saying that the DSX detection logic
      was inefficient and needed to be rewritten.
      
      I have tested this on QEMU with a patch ensuring it matches the hardware
      specification.
      
      Link: https://lists.gnu.org/archive/html/qemu-devel/2018-07/msg00000.html
      Fixes: e6d20c55 ("openrisc: entry: Fix delay slot detection")
      Signed-off-by: NStafford Horne <shorne@gmail.com>
      ae15a41a
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 271b955e
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2018-07-01
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) A bpf_fib_lookup() helper fix to change the API before freeze to
         return an encoding of the FIB lookup result and return the nexthop
         device index in the params struct (instead of device index as return
         code that we had before), from David.
      
      2) Various BPF JIT fixes to address syzkaller fallout, that is, do not
         reject progs when set_memory_*() fails since it could still be RO.
         Also arm32 JIT was not using bpf_jit_binary_lock_ro() API which was
         an issue, and a memory leak in s390 JIT found during review, from
         Daniel.
      
      3) Multiple fixes for sockmap/hash to address most of the syzkaller
         triggered bugs. Usage with IPv6 was crashing, a GPF in bpf_tcp_close(),
         a missing sock_map_release() routine to hook up to callbacks, and a
         fix for an omitted bucket lock in sock_close(), from John.
      
      4) Two bpftool fixes to remove duplicated error message on program load,
         and another one to close the libbpf object after program load. One
         additional fix for nfp driver's BPF offload to avoid stopping offload
         completely if replace of program failed, from Jakub.
      
      5) Couple of BPF selftest fixes that bail out in some of the test
         scripts if the user does not have the right privileges, from Jeffrin.
      
      6) Fixes in test_bpf for s390 when CONFIG_BPF_JIT_ALWAYS_ON is set
         where we need to set the flag that some of the test cases are expected
         to fail, from Kleber.
      
      7) Fix to detangle BPF_LIRC_MODE2 dependency from CONFIG_CGROUP_BPF
         since it has no relation to it and lirc2 users often have configs
         without cgroups enabled and thus would not be able to use it, from Sean.
      
      8) Fix a selftest failure in sockmap by removing a useless setrlimit()
         call that would set a too low limit where at the same time we are
         already including bpf_rlimit.h that does the job, from Yonghong.
      
      9) Fix BPF selftest config with missing missing NET_SCHED, from Anders.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      271b955e