1. 23 6月, 2022 1 次提交
  2. 12 4月, 2022 1 次提交
    • O
      net: remove noblock parameter from recvmsg() entities · ec095263
      Oliver Hartkopp 提交于
      The internal recvmsg() functions have two parameters 'flags' and 'noblock'
      that were merged inside skb_recv_datagram(). As a follow up patch to commit
      f4b41f06 ("net: remove noblock parameter from skb_recv_datagram()")
      this patch removes the separate 'noblock' parameter for recvmsg().
      
      Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
      'noblock' parameters are unnecessarily split up with e.g.
      
      err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
                                 flags & ~MSG_DONTWAIT, &addr_len);
      
      or in
      
      err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
                            sk, msg, size, flags & MSG_DONTWAIT,
                            flags & ~MSG_DONTWAIT, &addr_len);
      
      instead of simply using only flags all the time and check for MSG_DONTWAIT
      where needed (to preserve for the formerly separated no(n)block condition).
      Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.netSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      ec095263
  3. 15 3月, 2022 2 次提交
    • W
      bpf, sockmap: Fix double uncharge the mem of sk_msg · 2486ab43
      Wang Yufen 提交于
      If tcp_bpf_sendmsg is running during a tear down operation, psock may be
      freed.
      
      tcp_bpf_sendmsg()
       tcp_bpf_send_verdict()
        sk_msg_return()
        tcp_bpf_sendmsg_redir()
         unlikely(!psock))
           sk_msg_free()
      
      The mem of msg has been uncharged in tcp_bpf_send_verdict() by
      sk_msg_return(), and would be uncharged by sk_msg_free() again. When psock
      is null, we can simply returning an error code, this would then trigger
      the sk_msg_free_nocharge in the error path of __SK_REDIRECT and would have
      the side effect of throwing an error up to user space. This would be a
      slight change in behavior from user side but would look the same as an
      error if the redirect on the socket threw an error.
      
      This issue can cause the following info:
      WARNING: CPU: 0 PID: 2136 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260
      Call Trace:
       <TASK>
       __sk_destruct+0x24/0x1f0
       sk_psock_destroy+0x19b/0x1c0
       process_one_work+0x1b3/0x3c0
       worker_thread+0x30/0x350
       ? process_one_work+0x3c0/0x3c0
       kthread+0xe6/0x110
       ? kthread_complete_and_exit+0x20/0x20
       ret_from_fork+0x22/0x30
       </TASK>
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NWang Yufen <wangyufen@huawei.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20220304081145.2037182-5-wangyufen@huawei.com
      2486ab43
    • W
      bpf, sockmap: Fix more uncharged while msg has more_data · 84472b43
      Wang Yufen 提交于
      In tcp_bpf_send_verdict(), if msg has more data after
      tcp_bpf_sendmsg_redir():
      
      tcp_bpf_send_verdict()
       tosend = msg->sg.size  //msg->sg.size = 22220
       case __SK_REDIRECT:
        sk_msg_return()  //uncharged msg->sg.size(22220) sk->sk_forward_alloc
        tcp_bpf_sendmsg_redir() //after tcp_bpf_sendmsg_redir, msg->sg.size=11000
       goto more_data;
       tosend = msg->sg.size  //msg->sg.size = 11000
       case __SK_REDIRECT:
        sk_msg_return()  //uncharged msg->sg.size(11000) to sk->sk_forward_alloc
      
      The msg->sg.size(11000) has been uncharged twice, to fix we can charge the
      remaining msg->sg.size before goto more data.
      
      This issue can cause the following info:
      WARNING: CPU: 0 PID: 9860 at net/core/stream.c:208 sk_stream_kill_queues+0xd4/0x1a0
      Call Trace:
       <TASK>
       inet_csk_destroy_sock+0x55/0x110
       __tcp_close+0x279/0x470
       tcp_close+0x1f/0x60
       inet_release+0x3f/0x80
       __sock_release+0x3d/0xb0
       sock_close+0x11/0x20
       __fput+0x92/0x250
       task_work_run+0x6a/0xa0
       do_exit+0x33b/0xb60
       do_group_exit+0x2f/0xa0
       get_signal+0xb6/0x950
       arch_do_signal_or_restart+0xac/0x2a0
       ? vfs_write+0x237/0x290
       exit_to_user_mode_prepare+0xa9/0x200
       syscall_exit_to_user_mode+0x12/0x30
       do_syscall_64+0x46/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
       </TASK>
      
      WARNING: CPU: 0 PID: 2136 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260
      Call Trace:
       <TASK>
       __sk_destruct+0x24/0x1f0
       sk_psock_destroy+0x19b/0x1c0
       process_one_work+0x1b3/0x3c0
       worker_thread+0x30/0x350
       ? process_one_work+0x3c0/0x3c0
       kthread+0xe6/0x110
       ? kthread_complete_and_exit+0x20/0x20
       ret_from_fork+0x22/0x30
       </TASK>
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NWang Yufen <wangyufen@huawei.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20220304081145.2037182-4-wangyufen@huawei.com
      84472b43
  4. 06 1月, 2022 1 次提交
    • J
      bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser() · 5b2c5540
      John Fastabend 提交于
      Applications can be confused slightly because we do not always return the
      same error code as expected, e.g. what the TCP stack normally returns. For
      example on a sock err sk->sk_err instead of returning the sock_error we
      return EAGAIN. This usually means the application will 'try again'
      instead of aborting immediately. Another example, when a shutdown event
      is received we should immediately abort instead of waiting for data when
      the user provides a timeout.
      
      These tend to not be fatal, applications usually recover, but introduces
      bogus errors to the user or introduces unexpected latency. Before
      'c5d2177a' we fell back to the TCP stack when no data was available
      so we managed to catch many of the cases here, although with the extra
      latency cost of calling tcp_msg_wait_data() first.
      
      To fix lets duplicate the error handling in TCP stack into tcp_bpf so
      that we get the same error codes.
      
      These were found in our CI tests that run applications against sockmap
      and do longer lived testing, at least compared to test_sockmap that
      does short-lived ping/pong tests, and in some of our test clusters
      we deploy.
      
      Its non-trivial to do these in a shorter form CI tests that would be
      appropriate for BPF selftests, but we are looking into it so we can
      ensure this keeps working going forward. As a preview one idea is to
      pull in the packetdrill testing which catches some of this.
      
      Fixes: c5d2177a ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220104205918.286416-1-john.fastabend@gmail.com
      5b2c5540
  5. 09 11月, 2021 2 次提交
    • J
      bpf, sockmap: Fix race in ingress receive verdict with redirect to self · c5d2177a
      John Fastabend 提交于
      A socket in a sockmap may have different combinations of programs attached
      depending on configuration. There can be no programs in which case the socket
      acts as a sink only. There can be a TX program in this case a BPF program is
      attached to sending side, but no RX program is attached. There can be an RX
      program only where sends have no BPF program attached, but receives are hooked
      with BPF. And finally, both TX and RX programs may be attached. Giving us the
      permutations:
      
       None, Tx, Rx, and TxRx
      
      To date most of our use cases have been TX case being used as a fast datapath
      to directly copy between local application and a userspace proxy. Or Rx cases
      and TxRX applications that are operating an in kernel based proxy. The traffic
      in the first case where we hook applications into a userspace application looks
      like this:
      
        AppA  redirect   AppB
         Tx <-----------> Rx
         |                |
         +                +
         TCP <--> lo <--> TCP
      
      In this case all traffic from AppA (after 3whs) is copied into the AppB
      ingress queue and no traffic is ever on the TCP recieive_queue.
      
      In the second case the application never receives, except in some rare error
      cases, traffic on the actual user space socket. Instead the send happens in
      the kernel.
      
                 AppProxy       socket pool
             sk0 ------------->{sk1,sk2, skn}
              ^                      |
              |                      |
              |                      v
             ingress              lb egress
             TCP                  TCP
      
      Here because traffic is never read off the socket with userspace recv() APIs
      there is only ever one reader on the sk receive_queue. Namely the BPF programs.
      
      However, we've started to introduce a third configuration where the BPF program
      on receive should process the data, but then the normal case is to push the
      data into the receive queue of AppB.
      
             AppB
             recv()                (userspace)
           -----------------------
             tcp_bpf_recvmsg()     (kernel)
               |             |
               |             |
               |             |
             ingress_msgQ    |
               |             |
             RX_BPF          |
               |             |
               v             v
             sk->receive_queue
      
      This is different from the App{A,B} redirect because traffic is first received
      on the sk->receive_queue.
      
      Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
      queue for any data handled by the BPF rx program and returned with PASS code
      so that it was enqueued on the ingress msg queue. Then if no data exists on
      that queue it checks the socket receive queue. Unfortunately, this is the same
      receive_queue the BPF program is reading data off of. So we get a race. Its
      possible for the recvmsg() hook to pull data off the receive_queue before the
      BPF hook has a chance to read it. It typically happens when an application is
      banging on recv() and getting EAGAINs. Until they manage to race with the RX
      BPF program.
      
      To fix this we note that before this patch at attach time when the socket is
      loaded into the map we check if it needs a TX program or just the base set of
      proto bpf hooks. Then it uses the above general RX hook regardless of if we
      have a BPF program attached at rx or not. This patch now extends this check to
      handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
      race when an RX program is attached we use a new hook that is nearly identical
      to the old one except now we do not let the recv() call skip the RX BPF program.
      Now only the BPF program pulls data from sk->receive_queue and recv() only
      pulls data from the ingress msgQ post BPF program handling.
      
      With this resolved our AppB from above has been up and running for many hours
      without detecting any errors. We do this by correlating counters in RX BPF
      events and the AppB to ensure data is never skipping the BPF program. Selftests,
      was not able to detect this because we only run them for a short period of time
      on well ordered send/recvs so we don't get any of the noise we see in real
      application environments.
      
      Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NJussi Maki <joamaki@gmail.com>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
      c5d2177a
    • J
      bpf, sockmap: Remove unhash handler for BPF sockmap usage · b8b8315e
      John Fastabend 提交于
      We do not need to handle unhash from BPF side we can simply wait for the
      close to happen. The original concern was a socket could transition from
      ESTABLISHED state to a new state while the BPF hook was still attached.
      But, we convinced ourself this is no longer possible and we also improved
      BPF sockmap to handle listen sockets so this is no longer a problem.
      
      More importantly though there are cases where unhash is called when data is
      in the receive queue. The BPF unhash logic will flush this data which is
      wrong. To be correct it should keep the data in the receive queue and allow
      a receiving application to continue reading the data. This may happen when
      tcp_abort() is received for example. Instead of complicating the logic in
      unhash simply moving all this to tcp_close() hook solves this.
      
      Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NJussi Maki <joamaki@gmail.com>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20211103204736.248403-3-john.fastabend@gmail.com
      b8b8315e
  6. 27 10月, 2021 3 次提交
  7. 16 7月, 2021 1 次提交
  8. 21 6月, 2021 1 次提交
  9. 18 5月, 2021 1 次提交
  10. 12 4月, 2021 1 次提交
  11. 02 4月, 2021 3 次提交
  12. 27 2月, 2021 1 次提交
  13. 18 11月, 2020 2 次提交
  14. 22 8月, 2020 1 次提交
  15. 13 6月, 2020 1 次提交
  16. 10 6月, 2020 1 次提交
    • D
      bpf/sockmap: Fix kernel panic at __tcp_bpf_recvmsg · 487082fb
      dihu 提交于
      When user application calls read() with MSG_PEEK flag to read data
      of bpf sockmap socket, kernel panic happens at
      __tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
      queue after read out under MSG_PEEK flag is set. Because it's not
      judged whether sk_msg is the last msg of ingress_msg queue, the next
      sk_msg may be the head of ingress_msg queue, whose memory address of
      sg page is invalid. So it's necessary to add check codes to prevent
      this problem.
      
      [20759.125457] BUG: kernel NULL pointer dereference, address:
      0000000000000008
      [20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: G            E
      5.4.32 #1
      [20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
      4.1.12 06/18/2017
      [20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
      [20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
      [20759.276099] tcp_bpf_recvmsg+0x113/0x370
      [20759.281137] inet_recvmsg+0x55/0xc0
      [20759.285734] __sys_recvfrom+0xc8/0x130
      [20759.290566] ? __audit_syscall_entry+0x103/0x130
      [20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
      [20759.301700] ? __audit_syscall_exit+0x1e4/0x290
      [20759.307235] __x64_sys_recvfrom+0x24/0x30
      [20759.312226] do_syscall_64+0x55/0x1b0
      [20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: Ndihu <anny.hu@linux.alibaba.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20200605084625.9783-1-anny.hu@linux.alibaba.com
      487082fb
  17. 06 5月, 2020 1 次提交
    • J
      bpf, sockmap: bpf_tcp_ingress needs to subtract bytes from sg.size · 81aabbb9
      John Fastabend 提交于
      In bpf_tcp_ingress we used apply_bytes to subtract bytes from sg.size
      which is used to track total bytes in a message. But this is not
      correct because apply_bytes is itself modified in the main loop doing
      the mem_charge.
      
      Then at the end of this we have sg.size incorrectly set and out of
      sync with actual sk values. Then we can get a splat if we try to
      cork the data later and again try to redirect the msg to ingress. To
      fix instead of trying to track msg.size do the easy thing and include
      it as part of the sk_msg_xfer logic so that when the msg is moved the
      sg.size is always correct.
      
      To reproduce the below users will need ingress + cork and hit an
      error path that will then try to 'free' the skmsg.
      
      [  173.699981] BUG: KASAN: null-ptr-deref in sk_msg_free_elem+0xdd/0x120
      [  173.699987] Read of size 8 at addr 0000000000000008 by task test_sockmap/5317
      
      [  173.700000] CPU: 2 PID: 5317 Comm: test_sockmap Tainted: G          I       5.7.0-rc1+ #43
      [  173.700005] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
      [  173.700009] Call Trace:
      [  173.700021]  dump_stack+0x8e/0xcb
      [  173.700029]  ? sk_msg_free_elem+0xdd/0x120
      [  173.700034]  ? sk_msg_free_elem+0xdd/0x120
      [  173.700042]  __kasan_report+0x102/0x15f
      [  173.700052]  ? sk_msg_free_elem+0xdd/0x120
      [  173.700060]  kasan_report+0x32/0x50
      [  173.700070]  sk_msg_free_elem+0xdd/0x120
      [  173.700080]  __sk_msg_free+0x87/0x150
      [  173.700094]  tcp_bpf_send_verdict+0x179/0x4f0
      [  173.700109]  tcp_bpf_sendpage+0x3ce/0x5d0
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/158861290407.14306.5327773422227552482.stgit@john-Precision-5820-Tower
      81aabbb9
  18. 28 4月, 2020 1 次提交
  19. 20 3月, 2020 2 次提交
  20. 10 3月, 2020 3 次提交
  21. 22 2月, 2020 2 次提交
  22. 16 1月, 2020 1 次提交
    • J
      bpf: Sockmap/tls, fix pop data with SK_DROP return code · 7361d448
      John Fastabend 提交于
      When user returns SK_DROP we need to reset the number of copied bytes
      to indicate to the user the bytes were dropped and not sent. If we
      don't reset the copied arg sendmsg will return as if those bytes were
      copied giving the user a positive return value.
      
      This works as expected today except in the case where the user also
      pops bytes. In the pop case the sg.size is reduced but we don't correctly
      account for this when copied bytes is reset. The popped bytes are not
      accounted for and we return a small positive value potentially confusing
      the user.
      
      The reason this happens is due to a typo where we do the wrong comparison
      when accounting for pop bytes. In this fix notice the if/else is not
      needed and that we have a similar problem if we push data except its not
      visible to the user because if delta is larger the sg.size we return a
      negative value so it appears as an error regardless.
      
      Fixes: 7246d8ed ("bpf: helper to pop data from messages")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-9-john.fastabend@gmail.com
      7361d448
  23. 10 1月, 2020 1 次提交
  24. 29 11月, 2019 1 次提交
    • J
      net: skmsg: fix TLS 1.3 crash with full sk_msg · 031097d9
      Jakub Kicinski 提交于
      TLS 1.3 started using the entry at the end of the SG array
      for chaining-in the single byte content type entry. This mostly
      works:
      
      [ E E E E E E . . ]
        ^           ^
         start       end
      
                       E < content type
                     /
      [ E E E E E E C . ]
        ^           ^
         start       end
      
      (Where E denotes a populated SG entry; C denotes a chaining entry.)
      
      If the array is full, however, the end will point to the start:
      
      [ E E E E E E E E ]
        ^
         start
         end
      
      And we end up overwriting the start:
      
          E < content type
         /
      [ C E E E E E E E ]
        ^
         start
         end
      
      The sg array is supposed to be a circular buffer with start and
      end markers pointing anywhere. In case where start > end
      (i.e. the circular buffer has "wrapped") there is an extra entry
      reserved at the end to chain the two halves together.
      
      [ E E E E E E . . l ]
      
      (Where l is the reserved entry for "looping" back to front.
      
      As suggested by John, let's reserve another entry for chaining
      SG entries after the main circular buffer. Note that this entry
      has to be pointed to by the end entry so its position is not fixed.
      
      Examples of full messages:
      
      [ E E E E E E E E . l ]
        ^               ^
         start           end
      
         <---------------.
      [ E E . E E E E E E l ]
            ^ ^
         end   start
      
      Now the end will always point to an unused entry, so TLS 1.3
      can always use it.
      
      Fixes: 130b392c ("net: tls: Add tls 1.3 support")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      031097d9
  25. 09 8月, 2019 1 次提交
    • J
      net/tls: prevent skb_orphan() from leaking TLS plain text with offload · 41477662
      Jakub Kicinski 提交于
      sk_validate_xmit_skb() and drivers depend on the sk member of
      struct sk_buff to identify segments requiring encryption.
      Any operation which removes or does not preserve the original TLS
      socket such as skb_orphan() or skb_clone() will cause clear text
      leaks.
      
      Make the TCP socket underlying an offloaded TLS connection
      mark all skbs as decrypted, if TLS TX is in offload mode.
      Then in sk_validate_xmit_skb() catch skbs which have no socket
      (or a socket with no validation) and decrypted flag set.
      
      Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
      sk->sk_validate_xmit_skb are slightly interchangeable right now,
      they all imply TLS offload. The new checks are guarded by
      CONFIG_TLS_DEVICE because that's the option guarding the
      sk_buff->decrypted member.
      
      Second, smaller issue with orphaning is that it breaks
      the guarantee that packets will be delivered to device
      queues in-order. All TLS offload drivers depend on that
      scheduling property. This means skb_orphan_partial()'s
      trick of preserving partial socket references will cause
      issues in the drivers. We need a full orphan, and as a
      result netem delay/throttling will cause all TLS offload
      skbs to be dropped.
      
      Reusing the sk_buff->decrypted flag also protects from
      leaking clear text when incoming, decrypted skb is redirected
      (e.g. by TC).
      
      See commit 0608c69c ("bpf: sk_msg, sock{map|hash} redirect
      through ULP") for justification why the internal flag is safe.
      The only location which could leak the flag in is tcp_bpf_sendmsg(),
      which is taken care of by clearing the previously unused bit.
      
      v2:
       - remove superfluous decrypted mark copy (Willem);
       - remove the stale doc entry (Boris);
       - rely entirely on EOR marking to prevent coalescing (Boris);
       - use an internal sendpages flag instead of marking the socket
         (Boris).
      v3 (Willem):
       - reorganize the can_skb_orphan_partial() condition;
       - fix the flag leak-in through tcp_bpf_sendmsg.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41477662
  26. 16 5月, 2019 1 次提交
  27. 14 5月, 2019 1 次提交
    • J
      bpf: sockmap remove duplicate queue free · c42253cc
      John Fastabend 提交于
      In tcp bpf remove we free the cork list and purge the ingress msg
      list. However we do this before the ref count reaches zero so it
      could be possible some other access is in progress. In this case
      (tcp close and/or tcp_unhash) we happen to also hold the sock
      lock so no path exists but lets fix it otherwise it is extremely
      fragile and breaks the reference counting rules. Also we already
      check the cork list and ingress msg queue and free them once the
      ref count reaches zero so its wasteful to check twice.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c42253cc
  28. 21 12月, 2018 2 次提交
    • J
      bpf: sk_msg, sock{map|hash} redirect through ULP · 0608c69c
      John Fastabend 提交于
      A sockmap program that redirects through a kTLS ULP enabled socket
      will not work correctly because the ULP layer is skipped. This
      fixes the behavior to call through the ULP layer on redirect to
      ensure any operations required on the data stream at the ULP layer
      continue to be applied.
      
      To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
      calling the BPF layer on a redirected message. This is
      required to avoid calling the BPF layer multiple times (possibly
      recursively) which is not the current/expected behavior without
      ULPs. In the future we may add a redirect flag if users _do_
      want the policy applied again but this would need to work for both
      ULP and non-ULP sockets and be opt-in to avoid breaking existing
      programs.
      
      Also to avoid polluting the flag space with an internal flag we
      reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
      MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
      SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
      to verify is user space API is masked correctly to ensure the flag
      can not be set by user. (Note this needs to be true regardless
      because we have internal flags already in-use that user space
      should not be able to set). But for completeness we have two UAPI
      paths into sendpage, sendfile and splice.
      
      In the sendfile case the function do_sendfile() zero's flags,
      
      ./fs/read_write.c:
       static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
      		   	    size_t count, loff_t max)
       {
         ...
         fl = 0;
      #if 0
         /*
          * We need to debate whether we can enable this or not. The
          * man page documents EAGAIN return for the output at least,
          * and the application is arguably buggy if it doesn't expect
          * EAGAIN on a non-blocking file descriptor.
          */
          if (in.file->f_flags & O_NONBLOCK)
      	fl = SPLICE_F_NONBLOCK;
      #endif
          file_start_write(out.file);
          retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
       }
      
      In the splice case the pipe_to_sendpage "actor" is used which
      masks flags with SPLICE_F_MORE.
      
      ./fs/splice.c:
       static int pipe_to_sendpage(struct pipe_inode_info *pipe,
      			    struct pipe_buffer *buf, struct splice_desc *sd)
       {
         ...
         more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
         ...
       }
      
      Confirming what we expect that internal flags  are in fact internal
      to socket side.
      
      Fixes: d3b18ad3 ("tls: add bpf support to sk_msg handling")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0608c69c
    • J
      bpf: sk_msg, fix socket data_ready events · 552de910
      John Fastabend 提交于
      When a skb verdict program is in-use and either another BPF program
      redirects to that socket or the new SK_PASS support is used the
      data_ready callback does not wake up application. Instead because
      the stream parser/verdict is using the sk data_ready callback we wake
      up the stream parser/verdict block.
      
      Fix this by adding a helper to check if the stream parser block is
      enabled on the sk and if so call the saved pointer which is the
      upper layers wake up function.
      
      This fixes application stalls observed when an application is waiting
      for data in a blocking read().
      
      Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      552de910