1. 21 12月, 2021 1 次提交
  2. 19 12月, 2021 1 次提交
  3. 15 12月, 2021 1 次提交
  4. 14 12月, 2021 1 次提交
    • H
      net_tstamp: add new flag HWTSTAMP_FLAG_BONDED_PHC_INDEX · 9c9211a3
      Hangbin Liu 提交于
      Since commit 94dd016a ("bond: pass get_ts_info and SIOC[SG]HWTSTAMP
      ioctl to active device") the user could get bond active interface's
      PHC index directly. But when there is a failover, the bond active
      interface will change, thus the PHC index is also changed. This may
      break the user's program if they did not update the PHC timely.
      
      This patch adds a new hwtstamp_config flag HWTSTAMP_FLAG_BONDED_PHC_INDEX.
      When the user wants to get the bond active interface's PHC, they need to
      add this flag and be aware the PHC index may be changed.
      
      With the new flag. All flag checks in current drivers are removed. Only
      the checking in net_hwtstamp_validate() is kept.
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c9211a3
  5. 11 12月, 2021 1 次提交
    • D
      Increase default MLOCK_LIMIT to 8 MiB · 9dcc38e2
      Drew DeVault 提交于
      This limit has not been updated since 2008, when it was increased to 64
      KiB at the request of GnuPG.  Until recently, the main use-cases for this
      feature were (1) preventing sensitive memory from being swapped, as in
      GnuPG's use-case; and (2) real-time use-cases.  In the first case, little
      memory is called for, and in the second case, the user is generally in a
      position to increase it if they need more.
      
      The introduction of IOURING_REGISTER_BUFFERS adds a third use-case:
      preparing fixed buffers for high-performance I/O.  This use-case will take
      as much of this memory as it can get, but is still limited to 64 KiB by
      default, which is very little.  This increases the limit to 8 MB, which
      was chosen fairly arbitrarily as a more generous, but still conservative,
      default value.
      
      It is also possible to raise this limit in userspace.  This is easily
      done, for example, in the use-case of a network daemon: systemd, for
      instance, provides for this via LimitMEMLOCK in the service file; OpenRC
      via the rc_ulimit variables.  However, there is no established userspace
      facility for configuring this outside of daemons: end-user applications do
      not presently have access to a convenient means of raising their limits.
      
      The buck, as it were, stops with the kernel.  It's much easier to address
      it here than it is to bring it to hundreds of distributions, and it can
      only realistically be relied upon to be high-enough by end-user software
      if it is more-or-less ubiquitous.  Most distros don't change this
      particular rlimit from the kernel-supplied default value, so a change here
      will easily provide that ubiquity.
      
      Link: https://lkml.kernel.org/r/20211028080813.15966-1-sir@cmpwn.comSigned-off-by: NDrew DeVault <sir@cmpwn.com>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NCyril Hrubis <chrubis@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Andrew Dona-Couch <andrew@donacou.ch>
      Cc: Ammar Faizi <ammarfaizi2@gnuweeb.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9dcc38e2
  6. 10 12月, 2021 1 次提交
    • E
      aio: fix use-after-free due to missing POLLFREE handling · 50252e4b
      Eric Biggers 提交于
      signalfd_poll() and binder_poll() are special in that they use a
      waitqueue whose lifetime is the current task, rather than the struct
      file as is normally the case.  This is okay for blocking polls, since a
      blocking poll occurs within one task; however, non-blocking polls
      require another solution.  This solution is for the queue to be cleared
      before it is freed, by sending a POLLFREE notification to all waiters.
      
      Unfortunately, only eventpoll handles POLLFREE.  A second type of
      non-blocking poll, aio poll, was added in kernel v4.18, and it doesn't
      handle POLLFREE.  This allows a use-after-free to occur if a signalfd or
      binder fd is polled with aio poll, and the waitqueue gets freed.
      
      Fix this by making aio poll handle POLLFREE.
      
      A patch by Ramji Jiyani <ramjiyani@google.com>
      (https://lore.kernel.org/r/20211027011834.2497484-1-ramjiyani@google.com)
      tried to do this by making aio_poll_wake() always complete the request
      inline if POLLFREE is seen.  However, that solution had two bugs.
      First, it introduced a deadlock, as it unconditionally locked the aio
      context while holding the waitqueue lock, which inverts the normal
      locking order.  Second, it didn't consider that POLLFREE notifications
      are missed while the request has been temporarily de-queued.
      
      The second problem was solved by my previous patch.  This patch then
      properly fixes the use-after-free by handling POLLFREE in a
      deadlock-free way.  It does this by taking advantage of the fact that
      freeing of the waitqueue is RCU-delayed, similar to what eventpoll does.
      
      Fixes: 2c14fa83 ("aio: implement IOCB_CMD_POLL")
      Cc: <stable@vger.kernel.org> # v4.18+
      Link: https://lore.kernel.org/r/20211209010455.42744-6-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com>
      50252e4b
  7. 03 12月, 2021 2 次提交
  8. 02 12月, 2021 2 次提交
  9. 01 12月, 2021 1 次提交
    • J
      bpf: Add bpf_loop helper · e6f2dd0f
      Joanne Koong 提交于
      This patch adds the kernel-side and API changes for a new helper
      function, bpf_loop:
      
      long bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx,
      u64 flags);
      
      where long (*callback_fn)(u32 index, void *ctx);
      
      bpf_loop invokes the "callback_fn" **nr_loops** times or until the
      callback_fn returns 1. The callback_fn can only return 0 or 1, and
      this is enforced by the verifier. The callback_fn index is zero-indexed.
      
      A few things to please note:
      ~ The "u64 flags" parameter is currently unused but is included in
      case a future use case for it arises.
      ~ In the kernel-side implementation of bpf_loop (kernel/bpf/bpf_iter.c),
      bpf_callback_t is used as the callback function cast.
      ~ A program can have nested bpf_loop calls but the program must
      still adhere to the verifier constraint of its stack depth (the stack depth
      cannot exceed MAX_BPF_STACK))
      ~ Recursive callback_fns do not pass the verifier, due to the call stack
      for these being too deep.
      ~ The next patch will include the tests and benchmark
      Signed-off-by: NJoanne Koong <joannekoong@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211130030622.4131246-2-joannekoong@fb.com
      e6f2dd0f
  10. 30 11月, 2021 1 次提交
  11. 29 11月, 2021 2 次提交
  12. 23 11月, 2021 1 次提交
    • J
      mctp: Add MCTP-over-serial transport binding · a0c2ccd9
      Jeremy Kerr 提交于
      This change adds a MCTP Serial transport binding, as defined by DMTF
      specificiation DSP0253 - "MCTP Serial Transport Binding". This is
      implemented as a new serial line discipline, and can be attached to
      arbitrary tty devices.
      
      From the Kconfig description:
      
        This driver provides an MCTP-over-serial interface, through a
        serial line-discipline, as defined by DMTF specification "DSP0253 -
        MCTP Serial Transport Binding". By attaching the ldisc to a serial
        device, we get a new net device to transport MCTP packets.
      
        This allows communication with external MCTP endpoints which use
        serial as their transport. It can also be used as an easy way to
        provide MCTP connectivity between virtual machines, by forwarding
        data between simple virtual serial devices.
      
        Say y here if you need to connect to MCTP endpoints over serial. To
        compile as a module, use m; the module will be called mctp-serial.
      
      Once the N_MCTP line discipline is set [using ioctl(TCIOSETD)], we get a
      new netdev suitable for MCTP communication.
      
      The 'mctp' utility[1] provides a simple wrapper for this ioctl, using
      'link serial <device>':
      
        # mctp link serial /dev/ttyS0 &
        # mctp link
        dev lo index 1 address 0x00:00:00:00:00:00 net 1 mtu 65536 up
        dev mctpserial0 index 5 address 0x(no-addr) net 1 mtu 68 down
      
      [1]: https://github.com/CodeConstruct/mctpSigned-off-by: NJeremy Kerr <jk@codeconstruct.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0c2ccd9
  13. 22 11月, 2021 2 次提交
  14. 16 11月, 2021 1 次提交
    • T
      bpf: Change value of MAX_TAIL_CALL_CNT from 32 to 33 · ebf7f6f0
      Tiezhu Yang 提交于
      In the current code, the actual max tail call count is 33 which is greater
      than MAX_TAIL_CALL_CNT (defined as 32). The actual limit is not consistent
      with the meaning of MAX_TAIL_CALL_CNT and thus confusing at first glance.
      We can see the historical evolution from commit 04fd61ab ("bpf: allow
      bpf programs to tail-call other bpf programs") and commit f9dabe01
      ("bpf: Undo off-by-one in interpreter tail call count limit"). In order
      to avoid changing existing behavior, the actual limit is 33 now, this is
      reasonable.
      
      After commit 874be05f ("bpf, tests: Add tail call test suite"), we can
      see there exists failed testcase.
      
      On all archs when CONFIG_BPF_JIT_ALWAYS_ON is not set:
       # echo 0 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf
       # dmesg | grep -w FAIL
       Tail call error path, max count reached jited:0 ret 34 != 33 FAIL
      
      On some archs:
       # echo 1 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf
       # dmesg | grep -w FAIL
       Tail call error path, max count reached jited:1 ret 34 != 33 FAIL
      
      Although the above failed testcase has been fixed in commit 18935a72
      ("bpf/tests: Fix error in tail call limit tests"), it would still be good
      to change the value of MAX_TAIL_CALL_CNT from 32 to 33 to make the code
      more readable.
      
      The 32-bit x86 JIT was using a limit of 32, just fix the wrong comments and
      limit to 33 tail calls as the constant MAX_TAIL_CALL_CNT updated. For the
      mips64 JIT, use "ori" instead of "addiu" as suggested by Johan Almbladh.
      For the riscv JIT, use RV_REG_TCC directly to save one register move as
      suggested by Björn Töpel. For the other implementations, no function changes,
      it does not change the current limit 33, the new value of MAX_TAIL_CALL_CNT
      can reflect the actual max tail call count, the related tail call testcases
      in test_bpf module and selftests can work well for the interpreter and the
      JIT.
      
      Here are the test results on x86_64:
      
       # uname -m
       x86_64
       # echo 0 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf test_suite=test_tail_calls
       # dmesg | tail -1
       test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [0/8 JIT'ed]
       # rmmod test_bpf
       # echo 1 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf test_suite=test_tail_calls
       # dmesg | tail -1
       test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [8/8 JIT'ed]
       # rmmod test_bpf
       # ./test_progs -t tailcalls
       #142 tailcalls:OK
       Summary: 1/11 PASSED, 0 SKIPPED, 0 FAILED
      Signed-off-by: NTiezhu Yang <yangtiezhu@loongson.cn>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Tested-by: NIlya Leoshkevich <iii@linux.ibm.com>
      Acked-by: NBjörn Töpel <bjorn@kernel.org>
      Acked-by: NJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Acked-by: NIlya Leoshkevich <iii@linux.ibm.com>
      Link: https://lore.kernel.org/bpf/1636075800-3264-1-git-send-email-yangtiezhu@loongson.cn
      ebf7f6f0
  15. 15 11月, 2021 3 次提交
  16. 12 11月, 2021 1 次提交
  17. 11 11月, 2021 2 次提交
  18. 10 11月, 2021 1 次提交
    • D
      virtio-mem: support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE · 61082ad6
      David Hildenbrand 提交于
      The initial virtio-mem spec states that while unplugged memory should not
      be read, the device still has to allow for reading unplugged memory inside
      the usable region. The primary motivation for this default handling was
      to simplify bringup of virtio-mem, because there were corner cases where
      Linux might have accidentially read unplugged memory inside added Linux
      memory blocks.
      
      In the meantime, we:
      1. Removed /dev/kmem in commit bbcd53c9 ("drivers/char: remove
         /dev/kmem for good")
      2. Disallowed access to virtio-mem device memory via /dev/mem in
         commit 2128f4e2 ("virtio-mem: disallow mapping virtio-mem memory via
         /dev/mem")
      3. Sanitized access to virtio-mem device memory via /proc/kcore in
         commit 0daa322b ("fs/proc/kcore: don't read offline sections,
         logically offline pages and hwpoisoned pages")
      4. Sanitized access to virtio-mem device memory via /proc/vmcore in
         commit ce281462 ("virtio-mem: kdump mode to sanitize /proc/vmcore
         access")
      
      "Accidential" access to unplugged memory is no longer possible; we can
      support the new VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE feature that will be
      required by some hypervisors implementing virtio-mem in the near future.
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      61082ad6
  19. 08 11月, 2021 2 次提交
  20. 04 11月, 2021 3 次提交
  21. 03 11月, 2021 1 次提交
  22. 02 11月, 2021 3 次提交
    • J
      net: ndisc: introduce ndisc_evict_nocarrier sysctl parameter · 18ac597a
      James Prestwood 提交于
      In most situations the neighbor discovery cache should be cleared on a
      NOCARRIER event which is currently done unconditionally. But for wireless
      roams the neighbor discovery cache can and should remain intact since
      the underlying network has not changed.
      
      This patch introduces a sysctl option ndisc_evict_nocarrier which can
      be disabled by a wireless supplicant during a roam. This allows packets
      to be sent after a roam immediately without having to wait for
      neighbor discovery.
      
      A user reported roughly a 1 second delay after a roam before packets
      could be sent out (note, on IPv4). This delay was due to the ARP
      cache being cleared. During testing of this same scenario using IPv6
      no delay was noticed, but regardless there is no reason to clear
      the ndisc cache for wireless roams.
      Signed-off-by: NJames Prestwood <prestwoj@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      18ac597a
    • J
      net: arp: introduce arp_evict_nocarrier sysctl parameter · fcdb44d0
      James Prestwood 提交于
      This change introduces a new sysctl parameter, arp_evict_nocarrier.
      When set (default) the ARP cache will be cleared on a NOCARRIER event.
      This new option has been defaulted to '1' which maintains existing
      behavior.
      
      Clearing the ARP cache on NOCARRIER is relatively new, introduced by:
      
      commit 859bd2ef
      Author: David Ahern <dsahern@gmail.com>
      Date:   Thu Oct 11 20:33:49 2018 -0700
      
          net: Evict neighbor entries on carrier down
      
      The reason for this changes is to prevent the ARP cache from being
      cleared when a wireless device roams. Specifically for wireless roams
      the ARP cache should not be cleared because the underlying network has not
      changed. Clearing the ARP cache in this case can introduce significant
      delays sending out packets after a roam.
      
      A user reported such a situation here:
      
      https://lore.kernel.org/linux-wireless/CACsRnHWa47zpx3D1oDq9JYnZWniS8yBwW1h0WAVZ6vrbwL_S0w@mail.gmail.com/
      
      After some investigation it was found that the kernel was holding onto
      packets until ARP finished which resulted in this 1 second delay. It
      was also found that the first ARP who-has was never responded to,
      which is actually what caues the delay. This change is more or less
      working around this behavior, but again, there is no reason to clear
      the cache on a roam anyways.
      
      As for the unanswered who-has, we know the packet made it OTA since
      it was seen while monitoring. Why it never received a response is
      unknown. In any case, since this is a problem on the AP side of things
      all that can be done is to work around it until it is solved.
      
      Some background on testing/reproducing the packet delay:
      
      Hardware:
       - 2 access points configured for Fast BSS Transition (Though I don't
         see why regular reassociation wouldn't have the same behavior)
       - Wireless station running IWD as supplicant
       - A device on network able to respond to pings (I used one of the APs)
      
      Procedure:
       - Connect to first AP
       - Ping once to establish an ARP entry
       - Start a tcpdump
       - Roam to second AP
       - Wait for operstate UP event, and note the timestamp
       - Start pinging
      
      Results:
      
      Below is the tcpdump after UP. It was recorded the interface went UP at
      10:42:01.432875.
      
      10:42:01.461871 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
      10:42:02.497976 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
      10:42:02.507162 ARP, Reply 192.168.254.1 is-at ac:86:74:55:b0:20, length 46
      10:42:02.507185 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 1, length 64
      10:42:02.507205 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 2, length 64
      10:42:02.507212 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 3, length 64
      10:42:02.507219 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 4, length 64
      10:42:02.507225 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 5, length 64
      10:42:02.507232 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 6, length 64
      10:42:02.515373 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 1, length 64
      10:42:02.521399 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 2, length 64
      10:42:02.521612 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 3, length 64
      10:42:02.521941 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 4, length 64
      10:42:02.522419 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 5, length 64
      10:42:02.523085 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 6, length 64
      
      You can see the first ARP who-has went out very quickly after UP, but
      was never responded to. Nearly a second later the kernel retries and
      gets a response. Only then do the ping packets go out. If an ARP entry
      is manually added prior to UP (after the cache is cleared) it is seen
      that the first ping is never responded to, so its not only an issue with
      ARP but with data packets in general.
      
      As mentioned prior, the wireless interface was also monitored to verify
      the ping/ARP packet made it OTA which was observed to be true.
      Signed-off-by: NJames Prestwood <prestwoj@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      fcdb44d0
    • J
      bpf: Add alignment padding for "map_extra" + consolidate holes · 8845b468
      Joanne Koong 提交于
      This patch makes 2 changes regarding alignment padding
      for the "map_extra" field.
      
      1) In the kernel header, "map_extra" and "btf_value_type_id"
      are rearranged to consolidate the hole.
      
      Before:
      struct bpf_map {
      	...
              u32		max_entries;	/*    36     4	*/
              u32		map_flags;	/*    40     4	*/
      
              /* XXX 4 bytes hole, try to pack */
      
              u64		map_extra;	/*    48     8	*/
              int		spin_lock_off;	/*    56     4	*/
              int		timer_off;	/*    60     4	*/
              /* --- cacheline 1 boundary (64 bytes) --- */
              u32		id;		/*    64     4	*/
              int		numa_node;	/*    68     4	*/
      	...
              bool		frozen;		/*   117     1	*/
      
              /* XXX 10 bytes hole, try to pack */
      
              /* --- cacheline 2 boundary (128 bytes) --- */
      	...
              struct work_struct	work;	/*   144    72	*/
      
              /* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
      	struct mutex	freeze_mutex;	/*   216   144 	*/
      
              /* --- cacheline 5 boundary (320 bytes) was 40 bytes ago --- */
              u64		writecnt; 	/*   360     8	*/
      
          /* size: 384, cachelines: 6, members: 26 */
          /* sum members: 354, holes: 2, sum holes: 14 */
          /* padding: 16 */
          /* forced alignments: 2, forced holes: 1, sum forced holes: 10 */
      
      } __attribute__((__aligned__(64)));
      
      After:
      struct bpf_map {
      	...
              u32		max_entries;	/*    36     4	*/
              u64		map_extra;	/*    40     8 	*/
              u32		map_flags;	/*    48     4	*/
              int		spin_lock_off;	/*    52     4	*/
              int		timer_off;	/*    56     4	*/
              u32		id;		/*    60     4	*/
      
              /* --- cacheline 1 boundary (64 bytes) --- */
              int		numa_node;	/*    64     4	*/
      	...
      	bool		frozen		/*   113     1  */
      
              /* XXX 14 bytes hole, try to pack */
      
              /* --- cacheline 2 boundary (128 bytes) --- */
      	...
              struct work_struct	work;	/*   144    72	*/
      
              /* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
              struct mutex	freeze_mutex;	/*   216   144	*/
      
              /* --- cacheline 5 boundary (320 bytes) was 40 bytes ago --- */
              u64		writecnt;       /*   360     8	*/
      
          /* size: 384, cachelines: 6, members: 26 */
          /* sum members: 354, holes: 1, sum holes: 14 */
          /* padding: 16 */
          /* forced alignments: 2, forced holes: 1, sum forced holes: 14 */
      
      } __attribute__((__aligned__(64)));
      
      2) Add alignment padding to the bpf_map_info struct
      More details can be found in commit 36f9814a ("bpf: fix uapi hole
      for 32 bit compat applications")
      Signed-off-by: NJoanne Koong <joannekoong@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20211029224909.1721024-3-joannekoong@fb.com
      8845b468
  23. 01 11月, 2021 6 次提交