1. 16 1月, 2020 21 次提交
    • Y
      net: hns3: pad the short frame before sending to the hardware · 36c67349
      Yunsheng Lin 提交于
      The hardware can not handle short frames below or equal to 32
      bytes according to the hardware user manual, and it will trigger
      a RAS error when the frame's length is below 33 bytes.
      
      This patch pads the SKB when skb->len is below 33 bytes before
      sending it to hardware.
      
      Fixes: 76ad4f0e ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC")
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36c67349
    • E
      macvlan: use skb_reset_mac_header() in macvlan_queue_xmit() · 1712b2ff
      Eric Dumazet 提交于
      I missed the fact that macvlan_broadcast() can be used both
      in RX and TX.
      
      skb_eth_hdr() makes only sense in TX paths, so we can not
      use it blindly in macvlan_broadcast()
      
      Fixes: 96cc4b69 ("macvlan: do not assume mac_header is set in macvlan_broadcast()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJurgen Van Ham <juvanham@gmail.com>
      Tested-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1712b2ff
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 3981f955
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2020-01-15
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 12 non-merge commits during the last 9 day(s) which contain
      a total of 13 files changed, 95 insertions(+), 43 deletions(-).
      
      The main changes are:
      
      1) Fix refcount leak for TCP time wait and request sockets for socket lookup
         related BPF helpers, from Lorenz Bauer.
      
      2) Fix wrong verification of ARSH instruction under ALU32, from Daniel Borkmann.
      
      3) Batch of several sockmap and related TLS fixes found while operating
         more complex BPF programs with Cilium and OpenSSL, from John Fastabend.
      
      4) Fix sockmap to read psock's ingress_msg queue before regular sk_receive_queue()
         to avoid purging data upon teardown, from Lingpeng Chen.
      
      5) Fix printing incorrect pointer in bpftool's btf_dump_ptr() in order to properly
         dump a BPF map's value with BTF, from Martin KaFai Lau.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3981f955
    • D
      Merge branch 'bpf-sockmap-tls-fixes' · 85ddd9c3
      Daniel Borkmann 提交于
      John Fastabend says:
      
      ====================
      To date our usage of sockmap/tls has been fairly simple, the BPF programs
      did only well-defined pop, push, pull and apply/cork operations.
      
      Now that we started to push more complex programs into sockmap we uncovered
      a series of issues addressed here. Further OpenSSL3.0 version should be
      released soon with kTLS support so its important to get any remaining
      issues on BPF and kTLS support resolved.
      
      Additionally, I have a patch under development to allow sockmap to be
      enabled/disabled at runtime for Cilium endpoints. This allows us to stress
      the map insert/delete with kTLS more than previously where Cilium only
      added the socket to the map when it entered ESTABLISHED state and never
      touched it from the control path side again relying on the sockets own
      close() hook to remove it.
      
      To test I have a set of test cases in test_sockmap.c that expose these
      issues. Once we get fixes here merged and in bpf-next I'll submit the
      tests to bpf-next tree to ensure we don't regress again. Also I've run
      these patches in the Cilium CI with OpenSSL (master branch) this will
      run tools such as netperf, ab, wrk2, curl, etc. to get a broad set of
      testing.
      
      I'm aware of two more issues that we are working to resolve in another
      couple (probably two) patches. First we see an auth tag corruption in
      kTLS when sending small 1byte chunks under stress. I've not pinned this
      down yet. But, guessing because its under 1B stress tests it must be
      some error path being triggered. And second we need to ensure BPF RX
      programs are not skipped when kTLS ULP is loaded. This breaks some of the
      sockmap selftests when running with kTLS. I'll send a follow up for this.
      
      v2: I dropped a patch that added !0 size check in tls_push_record
          this originated from a panic I caught awhile ago with a trace
          in the crypto stack. But I can not reproduce it anymore so will
          dig into that and send another patch later if needed. Anyways
          after a bit of thought it would be nicer if tls/crypto/bpf didn't
          require special case handling for the !0 size.
      ====================
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      85ddd9c3
    • J
      bpf: Sockmap/tls, fix pop data with SK_DROP return code · 7361d448
      John Fastabend 提交于
      When user returns SK_DROP we need to reset the number of copied bytes
      to indicate to the user the bytes were dropped and not sent. If we
      don't reset the copied arg sendmsg will return as if those bytes were
      copied giving the user a positive return value.
      
      This works as expected today except in the case where the user also
      pops bytes. In the pop case the sg.size is reduced but we don't correctly
      account for this when copied bytes is reset. The popped bytes are not
      accounted for and we return a small positive value potentially confusing
      the user.
      
      The reason this happens is due to a typo where we do the wrong comparison
      when accounting for pop bytes. In this fix notice the if/else is not
      needed and that we have a similar problem if we push data except its not
      visible to the user because if delta is larger the sg.size we return a
      negative value so it appears as an error regardless.
      
      Fixes: 7246d8ed ("bpf: helper to pop data from messages")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-9-john.fastabend@gmail.com
      7361d448
    • J
      bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining · 9aaaa568
      John Fastabend 提交于
      Its possible through a set of push, pop, apply helper calls to construct
      a skmsg, which is just a ring of scatterlist elements, with the start
      value larger than the end value. For example,
      
            end       start
        |_0_|_1_| ... |_n_|_n+1_|
      
      Where end points at 1 and start points and n so that valid elements is
      the set {n, n+1, 0, 1}.
      
      Currently, because we don't build the correct chain only {n, n+1} will
      be sent. This adds a check and sg_chain call to correctly submit the
      above to the crypto and tls send path.
      
      Fixes: d3b18ad3 ("tls: add bpf support to sk_msg handling")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-8-john.fastabend@gmail.com
      9aaaa568
    • J
      bpf: Sockmap/tls, tls_sw can create a plaintext buf > encrypt buf · d468e477
      John Fastabend 提交于
      It is possible to build a plaintext buffer using push helper that is larger
      than the allocated encrypt buffer. When this record is pushed to crypto
      layers this can result in a NULL pointer dereference because the crypto
      API expects the encrypt buffer is large enough to fit the plaintext
      buffer. Kernel splat below.
      
      To resolve catch the cases this can happen and split the buffer into two
      records to send individually. Unfortunately, there is still one case to
      handle where the split creates a zero sized buffer. In this case we merge
      the buffers and unmark the split. This happens when apply is zero and user
      pushed data beyond encrypt buffer. This fixes the original case as well
      because the split allocated an encrypt buffer larger than the plaintext
      buffer and the merge simply moves the pointers around so we now have
      a reference to the new (larger) encrypt buffer.
      
      Perhaps its not ideal but it seems the best solution for a fixes branch
      and avoids handling these two cases, (a) apply that needs split and (b)
      non apply case. The are edge cases anyways so optimizing them seems not
      necessary unless someone wants later in next branches.
      
      [  306.719107] BUG: kernel NULL pointer dereference, address: 0000000000000008
      [...]
      [  306.747260] RIP: 0010:scatterwalk_copychunks+0x12f/0x1b0
      [...]
      [  306.770350] Call Trace:
      [  306.770956]  scatterwalk_map_and_copy+0x6c/0x80
      [  306.772026]  gcm_enc_copy_hash+0x4b/0x50
      [  306.772925]  gcm_hash_crypt_remain_continue+0xef/0x110
      [  306.774138]  gcm_hash_crypt_continue+0xa1/0xb0
      [  306.775103]  ? gcm_hash_crypt_continue+0xa1/0xb0
      [  306.776103]  gcm_hash_assoc_remain_continue+0x94/0xa0
      [  306.777170]  gcm_hash_assoc_continue+0x9d/0xb0
      [  306.778239]  gcm_hash_init_continue+0x8f/0xa0
      [  306.779121]  gcm_hash+0x73/0x80
      [  306.779762]  gcm_encrypt_continue+0x6d/0x80
      [  306.780582]  crypto_gcm_encrypt+0xcb/0xe0
      [  306.781474]  crypto_aead_encrypt+0x1f/0x30
      [  306.782353]  tls_push_record+0x3b9/0xb20 [tls]
      [  306.783314]  ? sk_psock_msg_verdict+0x199/0x300
      [  306.784287]  bpf_exec_tx_verdict+0x3f2/0x680 [tls]
      [  306.785357]  tls_sw_sendmsg+0x4a3/0x6a0 [tls]
      
      test_sockmap test signature to trigger bug,
      
      [TEST]: (1, 1, 1, sendmsg, pass,redir,start 1,end 2,pop (1,2),ktls,):
      
      Fixes: d3b18ad3 ("tls: add bpf support to sk_msg handling")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-7-john.fastabend@gmail.com
      d468e477
    • J
      bpf: Sockmap/tls, msg_push_data may leave end mark in place · cf21e9ba
      John Fastabend 提交于
      Leaving an incorrect end mark in place when passing to crypto
      layer will cause crypto layer to stop processing data before
      all data is encrypted. To fix clear the end mark on push
      data instead of expecting users of the helper to clear the
      mark value after the fact.
      
      This happens when we push data into the middle of a skmsg and
      have room for it so we don't do a set of copies that already
      clear the end flag.
      
      Fixes: 6fff607e ("bpf: sk_msg program helper bpf_msg_push_data")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-6-john.fastabend@gmail.com
      cf21e9ba
    • J
      bpf: Sockmap, skmsg helper overestimates push, pull, and pop bounds · 6562e29c
      John Fastabend 提交于
      In the push, pull, and pop helpers operating on skmsg objects to make
      data writable or insert/remove data we use this bounds check to ensure
      specified data is valid,
      
       /* Bounds checks: start and pop must be inside message */
       if (start >= offset + l || last >= msg->sg.size)
           return -EINVAL;
      
      The problem here is offset has already included the length of the
      current element the 'l' above. So start could be past the end of
      the scatterlist element in the case where start also points into an
      offset on the last skmsg element.
      
      To fix do the accounting slightly different by adding the length of
      the previous entry to offset at the start of the iteration. And
      ensure its initialized to zero so that the first iteration does
      nothing.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Fixes: 6fff607e ("bpf: sk_msg program helper bpf_msg_push_data")
      Fixes: 7246d8ed ("bpf: helper to pop data from messages")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-5-john.fastabend@gmail.com
      6562e29c
    • J
      bpf: Sockmap/tls, push write_space updates through ulp updates · 33bfe20d
      John Fastabend 提交于
      When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
      and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
      don't push the write_space callback up and instead simply overwrite the
      op with the psock stored previous op. This may or may not be correct so
      to ensure we don't overwrite the TLS write space hook pass this field to
      the ULP and have it fixup the ctx.
      
      This completes a previous fix that pushed the ops through to the ULP
      but at the time missed doing this for write_space, presumably because
      write_space TLS hook was added around the same time.
      
      Fixes: 95fa1454 ("bpf: sockmap/tls, close can race with map free")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-4-john.fastabend@gmail.com
      33bfe20d
    • J
      bpf: Sockmap, ensure sock lock held during tear down · 7e81a353
      John Fastabend 提交于
      The sock_map_free() and sock_hash_free() paths used to delete sockmap
      and sockhash maps walk the maps and destroy psock and bpf state associated
      with the socks in the map. When done the socks no longer have BPF programs
      attached and will function normally. This can happen while the socks in
      the map are still "live" meaning data may be sent/received during the walk.
      
      Currently, though we don't take the sock_lock when the psock and bpf state
      is removed through this path. Specifically, this means we can be writing
      into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
      while they are also being called from the networking side. This is not
      safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
      believed it was safe. Further its not clear to me its even a good idea
      to try and do this on "live" sockets while networking side might also
      be using the socket. Instead of trying to reason about using the socks
      from both sides lets realize that every use case I'm aware of rarely
      deletes maps, in fact kubernetes/Cilium case builds map at init and
      never tears it down except on errors. So lets do the simple fix and
      grab sock lock.
      
      This patch wraps sock deletes from maps in sock lock and adds some
      annotations so we catch any other cases easier.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-3-john.fastabend@gmail.com
      7e81a353
    • J
      bpf: Sockmap/tls, during free we may call tcp_bpf_unhash() in loop · 4da6a196
      John Fastabend 提交于
      When a sockmap is free'd and a socket in the map is enabled with tls
      we tear down the bpf context on the socket, the psock struct and state,
      and then call tcp_update_ulp(). The tcp_update_ulp() call is to inform
      the tls stack it needs to update its saved sock ops so that when the tls
      socket is later destroyed it doesn't try to call the now destroyed psock
      hooks.
      
      This is about keeping stacked ULPs in good shape so they always have
      the right set of stacked ops.
      
      However, recently unhash() hook was removed from TLS side. But, the
      sockmap/bpf side is not doing any extra work to update the unhash op
      when is torn down instead expecting TLS side to manage it. So both
      TLS and sockmap believe the other side is managing the op and instead
      no one updates the hook so it continues to point at tcp_bpf_unhash().
      When unhash hook is called we call tcp_bpf_unhash() which detects the
      psock has already been destroyed and calls sk->sk_prot_unhash() which
      calls tcp_bpf_unhash() yet again and so on looping and hanging the core.
      
      To fix have sockmap tear down logic fixup the stale pointer.
      
      Fixes: 5d92e631 ("net/tls: partially revert fix transition through disconnect with close")
      Reported-by: syzbot+83979935eb6304f8cd46@syzkaller.appspotmail.com
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-2-john.fastabend@gmail.com
      4da6a196
    • D
      Merge branch 'stmmac-Fix-selftests-in-Synopsys-AXS101-board' · 567110f1
      David S. Miller 提交于
      Jose Abreu says:
      
      ====================
      net: stmmac: Fix selftests in Synopsys AXS101 board
      
      Set of fixes for sefltests so that they work in Synopsys AXS101 board.
      
      Final output:
      
      $ ethtool -t eth0
      The test result is PASS
      The test extra info:
       1. MAC Loopback                 0
       2. PHY Loopback                 -95
       3. MMC Counters                 0
       4. EEE                          -95
       5. Hash Filter MC               0
       6. Perfect Filter UC            0
       7. MC Filter                    0
       8. UC Filter                    0
       9. Flow Control                 -95
      10. RSS                          -95
      11. VLAN Filtering               -95
      12. VLAN Filtering (perf)        -95
      13. Double VLAN Filter           -95
      14. Double VLAN Filter (perf)    -95
      15. Flexible RX Parser           -95
      16. SA Insertion (desc)          -95
      17. SA Replacement (desc)        -95
      18. SA Insertion (reg)           -95
      19. SA Replacement (reg)         -95
      20. VLAN TX Insertion            -95
      21. SVLAN TX Insertion           -95
      22. L3 DA Filtering              -95
      23. L3 SA Filtering              -95
      24. L4 DA TCP Filtering          -95
      25. L4 SA TCP Filtering          -95
      26. L4 DA UDP Filtering          -95
      27. L4 SA UDP Filtering          -95
      28. ARP Offload                  -95
      29. Jumbo Frame                  0
      30. Multichannel Jumbo           -95
      31. Split Header                 -95
      
      Description:
      
      1) Fixes the unaligned accesses that caused CPU halt in Synopsys AXS101
      boards.
      
      2) Fixes the VLAN tests when filtering failed to work.
      
      3) Fixes the VLAN Perfect tests when filtering is not available in HW.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567110f1
    • J
      net: stmmac: selftests: Guard VLAN Perfect test against non supported HW · 4eee13f1
      Jose Abreu 提交于
      When HW does not support perfect filtering the feature will not be
      enabled in the net_device. Add a check for this to prevent failures.
      
      Fixes: 1b2250a0 ("net: stmmac: selftests: Add tests for VLAN Perfect Filtering")
      Signed-off-by: NJose Abreu <Jose.Abreu@synopsys.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4eee13f1
    • J
      net: stmmac: selftests: Mark as fail when received VLAN ID != expected · d39b68e5
      Jose Abreu 提交于
      When the VLAN ID does not match the expected one it means filter failed
      in HW. Fix it.
      
      Fixes: 94e18382 ("net: stmmac: selftests: Add selftest for VLAN TX Offload")
      Signed-off-by: NJose Abreu <Jose.Abreu@synopsys.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d39b68e5
    • J
      net: stmmac: selftests: Make it work in Synopsys AXS101 boards · 0b9f932e
      Jose Abreu 提交于
      Synopsys AXS101 boards do not support unaligned memory loads or stores.
      Change the selftests mechanism to explicity:
      - Not add extra alignment in TX SKB
      - Use the unaligned version of ether_addr_equal()
      
      Fixes: 091810db ("net: stmmac: Introduce selftests support")
      Signed-off-by: NJose Abreu <Jose.Abreu@synopsys.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b9f932e
    • C
      net/wan/fsl_ucc_hdlc: fix out of bounds write on array utdm_info · ddf42039
      Colin Ian King 提交于
      Array utdm_info is declared as an array of MAX_HDLC_NUM (4) elements
      however up to UCC_MAX_NUM (8) elements are potentially being written
      to it.  Currently we have an array out-of-bounds write error on the
      last 4 elements. Fix this by making utdm_info UCC_MAX_NUM elements in
      size.
      
      Addresses-Coverity: ("Out-of-bounds write")
      Fixes: c19b6d24 ("drivers/net: support hdlc function for QE-UCC")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ddf42039
    • D
      Merge tag 'batadv-net-for-davem-20200114' of git://git.open-mesh.org/linux-merge · 5a40420e
      David S. Miller 提交于
      Simon Wunderlich says:
      
      ====================
      Here is a batman-adv bugfix:
      
       - Fix DAT candidate selection on little endian systems,
         by Sven Eckelmann
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a40420e
    • D
      bpf: Fix incorrect verifier simulation of ARSH under ALU32 · 0af2ffc9
      Daniel Borkmann 提交于
      Anatoly has been fuzzing with kBdysch harness and reported a hang in one
      of the outcomes:
      
        0: R1=ctx(id=0,off=0,imm=0) R10=fp0
        0: (85) call bpf_get_socket_cookie#46
        1: R0_w=invP(id=0) R10=fp0
        1: (57) r0 &= 808464432
        2: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
        2: (14) w0 -= 810299440
        3: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
        3: (c4) w0 s>>= 1
        4: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
        4: (76) if w0 s>= 0x30303030 goto pc+216
        221: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
        221: (95) exit
        processed 6 insns (limit 1000000) [...]
      
      Taking a closer look, the program was xlated as follows:
      
        # ./bpftool p d x i 12
        0: (85) call bpf_get_socket_cookie#7800896
        1: (bf) r6 = r0
        2: (57) r6 &= 808464432
        3: (14) w6 -= 810299440
        4: (c4) w6 s>>= 1
        5: (76) if w6 s>= 0x30303030 goto pc+216
        6: (05) goto pc-1
        7: (05) goto pc-1
        8: (05) goto pc-1
        [...]
        220: (05) goto pc-1
        221: (05) goto pc-1
        222: (95) exit
      
      Meaning, the visible effect is very similar to f54c7898 ("bpf: Fix
      precision tracking for unbounded scalars"), that is, the fall-through
      branch in the instruction 5 is considered to be never taken given the
      conclusion from the min/max bounds tracking in w6, and therefore the
      dead-code sanitation rewrites it as goto pc-1. However, real-life input
      disagrees with verification analysis since a soft-lockup was observed.
      
      The bug sits in the analysis of the ARSH. The definition is that we shift
      the target register value right by K bits through shifting in copies of
      its sign bit. In adjust_scalar_min_max_vals(), we do first coerce the
      register into 32 bit mode, same happens after simulating the operation.
      However, for the case of simulating the actual ARSH, we don't take the
      mode into account and act as if it's always 64 bit, but location of sign
      bit is different:
      
        dst_reg->smin_value >>= umin_val;
        dst_reg->smax_value >>= umin_val;
        dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val);
      
      Consider an unknown R0 where bpf_get_socket_cookie() (or others) would
      for example return 0xffff. With the above ARSH simulation, we'd see the
      following results:
      
        [...]
        1: R1=ctx(id=0,off=0,imm=0) R2_w=invP65535 R10=fp0
        1: (85) call bpf_get_socket_cookie#46
        2: R0_w=invP(id=0) R10=fp0
        2: (57) r0 &= 808464432
          -> R0_runtime = 0x3030
        3: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
        3: (14) w0 -= 810299440
          -> R0_runtime = 0xcfb40000
        4: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
                                    (0xffffffff)
        4: (c4) w0 s>>= 1
          -> R0_runtime = 0xe7da0000
        5: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
                                    (0x67c00000)           (0x7ffbfff8)
        [...]
      
      In insn 3, we have a runtime value of 0xcfb40000, which is '1100 1111 1011
      0100 0000 0000 0000 0000', the result after the shift has 0xe7da0000 that
      is '1110 0111 1101 1010 0000 0000 0000 0000', where the sign bit is correctly
      retained in 32 bit mode. In insn4, the umax was 0xffffffff, and changed into
      0x7ffbfff8 after the shift, that is, '0111 1111 1111 1011 1111 1111 1111 1000'
      and means here that the simulation didn't retain the sign bit. With above
      logic, the updates happen on the 64 bit min/max bounds and given we coerced
      the register, the sign bits of the bounds are cleared as well, meaning, we
      need to force the simulation into s32 space for 32 bit alu mode.
      
      Verification after the fix below. We're first analyzing the fall-through branch
      on 32 bit signed >= test eventually leading to rejection of the program in this
      specific case:
      
        0: R1=ctx(id=0,off=0,imm=0) R10=fp0
        0: (b7) r2 = 808464432
        1: R1=ctx(id=0,off=0,imm=0) R2_w=invP808464432 R10=fp0
        1: (85) call bpf_get_socket_cookie#46
        2: R0_w=invP(id=0) R10=fp0
        2: (bf) r6 = r0
        3: R0_w=invP(id=0) R6_w=invP(id=0) R10=fp0
        3: (57) r6 &= 808464432
        4: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
        4: (14) w6 -= 810299440
        5: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
        5: (c4) w6 s>>= 1
        6: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
                                                    (0x67c00000)          (0xfffbfff8)
        6: (76) if w6 s>= 0x30303030 goto pc+216
        7: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
        7: (30) r0 = *(u8 *)skb[808464432]
        BPF_LD_[ABS|IND] uses reserved fields
        processed 8 insns (limit 1000000) [...]
      
      Fixes: 9cbe1f5a ("bpf/verifier: improve register value range tracking with ARSH")
      Reported-by: NAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200115204733.16648-1-daniel@iogearbox.net
      0af2ffc9
    • M
      hv_netvsc: Fix memory leak when removing rndis device · 536dc5df
      Mohammed Gamal 提交于
      kmemleak detects the following memory leak when hot removing
      a network device:
      
      unreferenced object 0xffff888083f63600 (size 256):
        comm "kworker/0:1", pid 12, jiffies 4294831717 (age 1113.676s)
        hex dump (first 32 bytes):
          00 40 c7 33 80 88 ff ff 00 00 00 00 10 00 00 00  .@.3............
          00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
        backtrace:
          [<00000000d4a8f5be>] rndis_filter_device_add+0x117/0x11c0 [hv_netvsc]
          [<000000009c02d75b>] netvsc_probe+0x5e7/0xbf0 [hv_netvsc]
          [<00000000ddafce23>] vmbus_probe+0x74/0x170 [hv_vmbus]
          [<00000000046e64f1>] really_probe+0x22f/0xb50
          [<000000005cc35eb7>] driver_probe_device+0x25e/0x370
          [<0000000043c642b2>] bus_for_each_drv+0x11f/0x1b0
          [<000000005e3d09f0>] __device_attach+0x1c6/0x2f0
          [<00000000a72c362f>] bus_probe_device+0x1a6/0x260
          [<0000000008478399>] device_add+0x10a3/0x18e0
          [<00000000cf07b48c>] vmbus_device_register+0xe7/0x1e0 [hv_vmbus]
          [<00000000d46cf032>] vmbus_add_channel_work+0x8ab/0x1770 [hv_vmbus]
          [<000000002c94bb64>] process_one_work+0x919/0x17d0
          [<0000000096de6781>] worker_thread+0x87/0xb40
          [<00000000fbe7397e>] kthread+0x333/0x3f0
          [<000000004f844269>] ret_from_fork+0x3a/0x50
      
      rndis_filter_device_add() allocates an instance of struct rndis_device
      which never gets deallocated as rndis_filter_device_remove() sets
      net_device->extension which points to the rndis_device struct to NULL,
      leaving the rndis_device dangling.
      
      Since net_device->extension is eventually freed in free_netvsc_device(),
      we refrain from setting it to NULL inside rndis_filter_device_remove()
      Signed-off-by: NMohammed Gamal <mgamal@redhat.com>
      Reviewed-by: NHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      536dc5df
    • P
      tcp: fix marked lost packets not being retransmitted · e176b1ba
      Pengcheng Yang 提交于
      When the packet pointed to by retransmit_skb_hint is unlinked by ACK,
      retransmit_skb_hint will be set to NULL in tcp_clean_rtx_queue().
      If packet loss is detected at this time, retransmit_skb_hint will be set
      to point to the current packet loss in tcp_verify_retransmit_hint(),
      then the packets that were previously marked lost but not retransmitted
      due to the restriction of cwnd will be skipped and cannot be
      retransmitted.
      
      To fix this, when retransmit_skb_hint is NULL, retransmit_skb_hint can
      be reset only after all marked lost packets are retransmitted
      (retrans_out >= lost_out), otherwise we need to traverse from
      tcp_rtx_queue_head in tcp_xmit_retransmit_queue().
      
      Packetdrill to demonstrate:
      
      // Disable RACK and set max_reordering to keep things simple
          0 `sysctl -q net.ipv4.tcp_recovery=0`
         +0 `sysctl -q net.ipv4.tcp_max_reordering=3`
      
      // Establish a connection
         +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
        +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <...>
       +.01 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
      // Send 8 data segments
         +0 write(4, ..., 8000) = 8000
         +0 > P. 1:8001(8000) ack 1
      
      // Enter recovery and 1:3001 is marked lost
       +.01 < . 1:1(0) ack 1 win 257 <sack 3001:4001,nop,nop>
         +0 < . 1:1(0) ack 1 win 257 <sack 5001:6001 3001:4001,nop,nop>
         +0 < . 1:1(0) ack 1 win 257 <sack 5001:7001 3001:4001,nop,nop>
      
      // Retransmit 1:1001, now retransmit_skb_hint points to 1001:2001
         +0 > . 1:1001(1000) ack 1
      
      // 1001:2001 was ACKed causing retransmit_skb_hint to be set to NULL
       +.01 < . 1:1(0) ack 2001 win 257 <sack 5001:8001 3001:4001,nop,nop>
      // Now retransmit_skb_hint points to 4001:5001 which is now marked lost
      
      // BUG: 2001:3001 was not retransmitted
         +0 > . 2001:3001(1000) ack 1
      Signed-off-by: NPengcheng Yang <yangpc@wangsu.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e176b1ba
  2. 15 1月, 2020 19 次提交