1. 30 9月, 2020 1 次提交
  2. 29 9月, 2020 7 次提交
    • J
      ethtool: mark netlink family as __ro_after_init · 78b70155
      Jakub Kicinski 提交于
      Like all genl families ethtool_genl_family needs to not
      be a straight up constant, because it's modified/initialized
      by genl_register_family(). After init, however, it's only
      passed to genlmsg_put() & co. therefore we can mark it
      as __ro_after_init.
      
      Since genl_family structure contains function pointers
      mark this as a fix.
      
      Fixes: 2b4a8990 ("ethtool: introduce ethtool netlink interface")
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      78b70155
    • M
      net: qrtr: ns: Protect radix_tree_deref_slot() using rcu read locks · a7809ff9
      Manivannan Sadhasivam 提交于
      The rcu read locks are needed to avoid potential race condition while
      dereferencing radix tree from multiple threads. The issue was identified
      by syzbot. Below is the crash report:
      
      =============================
      WARNING: suspicious RCU usage
      5.7.0-syzkaller #0 Not tainted
      -----------------------------
      include/linux/radix-tree.h:176 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      2 locks held by kworker/u4:1/21:
       #0: ffff88821b097938 ((wq_completion)qrtr_ns_handler){+.+.}-{0:0}, at: spin_unlock_irq include/linux/spinlock.h:403 [inline]
       #0: ffff88821b097938 ((wq_completion)qrtr_ns_handler){+.+.}-{0:0}, at: process_one_work+0x6df/0xfd0 kernel/workqueue.c:2241
       #1: ffffc90000dd7d80 ((work_completion)(&qrtr_ns.work)){+.+.}-{0:0}, at: process_one_work+0x71e/0xfd0 kernel/workqueue.c:2243
      
      stack backtrace:
      CPU: 0 PID: 21 Comm: kworker/u4:1 Not tainted 5.7.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: qrtr_ns_handler qrtr_ns_worker
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1e9/0x30e lib/dump_stack.c:118
       radix_tree_deref_slot include/linux/radix-tree.h:176 [inline]
       ctrl_cmd_new_lookup net/qrtr/ns.c:558 [inline]
       qrtr_ns_worker+0x2aff/0x4500 net/qrtr/ns.c:674
       process_one_work+0x76e/0xfd0 kernel/workqueue.c:2268
       worker_thread+0xa7f/0x1450 kernel/workqueue.c:2414
       kthread+0x353/0x380 kernel/kthread.c:268
      
      Fixes: 0c2204a4 ("net: qrtr: Migrate nameservice to kernel from userspace")
      Reported-and-tested-by: syzbot+0f84f6eed90503da72fc@syzkaller.appspotmail.com
      Signed-off-by: NManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7809ff9
    • T
      net: core: add nested_level variable in net_device · 1fc70edb
      Taehee Yoo 提交于
      This patch is to add a new variable 'nested_level' into the net_device
      structure.
      This variable will be used as a parameter of spin_lock_nested() of
      dev->addr_list_lock.
      
      netif_addr_lock() can be called recursively so spin_lock_nested() is
      used instead of spin_lock() and dev->lower_level is used as a parameter
      of spin_lock_nested().
      But, dev->lower_level value can be updated while it is being used.
      So, lockdep would warn a possible deadlock scenario.
      
      When a stacked interface is deleted, netif_{uc | mc}_sync() is
      called recursively.
      So, spin_lock_nested() is called recursively too.
      At this moment, the dev->lower_level variable is used as a parameter of it.
      dev->lower_level value is updated when interfaces are being unlinked/linked
      immediately.
      Thus, After unlinking, dev->lower_level shouldn't be a parameter of
      spin_lock_nested().
      
          A (macvlan)
          |
          B (vlan)
          |
          C (bridge)
          |
          D (macvlan)
          |
          E (vlan)
          |
          F (bridge)
      
          A->lower_level : 6
          B->lower_level : 5
          C->lower_level : 4
          D->lower_level : 3
          E->lower_level : 2
          F->lower_level : 1
      
      When an interface 'A' is removed, it releases resources.
      At this moment, netif_addr_lock() would be called.
      Then, netdev_upper_dev_unlink() is called recursively.
      Then dev->lower_level is updated.
      There is no problem.
      
      But, when the bridge module is removed, 'C' and 'F' interfaces
      are removed at once.
      If 'F' is removed first, a lower_level value is like below.
          A->lower_level : 5
          B->lower_level : 4
          C->lower_level : 3
          D->lower_level : 2
          E->lower_level : 1
          F->lower_level : 1
      
      Then, 'C' is removed. at this moment, netif_addr_lock() is called
      recursively.
      The ordering is like this.
      C(3)->D(2)->E(1)->F(1)
      At this moment, the lower_level value of 'E' and 'F' are the same.
      So, lockdep warns a possible deadlock scenario.
      
      In order to avoid this problem, a new variable 'nested_level' is added.
      This value is the same as dev->lower_level - 1.
      But this value is updated in rtnl_unlock().
      So, this variable can be used as a parameter of spin_lock_nested() safely
      in the rtnl context.
      
      Test commands:
         ip link add br0 type bridge vlan_filtering 1
         ip link add vlan1 link br0 type vlan id 10
         ip link add macvlan2 link vlan1 type macvlan
         ip link add br3 type bridge vlan_filtering 1
         ip link set macvlan2 master br3
         ip link add vlan4 link br3 type vlan id 10
         ip link add macvlan5 link vlan4 type macvlan
         ip link add br6 type bridge vlan_filtering 1
         ip link set macvlan5 master br6
         ip link add vlan7 link br6 type vlan id 10
         ip link add macvlan8 link vlan7 type macvlan
      
         ip link set br0 up
         ip link set vlan1 up
         ip link set macvlan2 up
         ip link set br3 up
         ip link set vlan4 up
         ip link set macvlan5 up
         ip link set br6 up
         ip link set vlan7 up
         ip link set macvlan8 up
         modprobe -rv bridge
      
      Splat looks like:
      [   36.057436][  T744] WARNING: possible recursive locking detected
      [   36.058848][  T744] 5.9.0-rc6+ #728 Not tainted
      [   36.059959][  T744] --------------------------------------------
      [   36.061391][  T744] ip/744 is trying to acquire lock:
      [   36.062590][  T744] ffff8c4767509280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_set_rx_mode+0x19/0x30
      [   36.064922][  T744]
      [   36.064922][  T744] but task is already holding lock:
      [   36.066626][  T744] ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60
      [   36.068851][  T744]
      [   36.068851][  T744] other info that might help us debug this:
      [   36.070731][  T744]  Possible unsafe locking scenario:
      [   36.070731][  T744]
      [   36.072497][  T744]        CPU0
      [   36.073238][  T744]        ----
      [   36.074007][  T744]   lock(&vlan_netdev_addr_lock_key);
      [   36.075290][  T744]   lock(&vlan_netdev_addr_lock_key);
      [   36.076590][  T744]
      [   36.076590][  T744]  *** DEADLOCK ***
      [   36.076590][  T744]
      [   36.078515][  T744]  May be due to missing lock nesting notation
      [   36.078515][  T744]
      [   36.080491][  T744] 3 locks held by ip/744:
      [   36.081471][  T744]  #0: ffffffff98571df0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x236/0x490
      [   36.083614][  T744]  #1: ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60
      [   36.085942][  T744]  #2: ffff8c476c8da280 (&bridge_netdev_addr_lock_key/4){+...}-{2:2}, at: dev_uc_sync+0x39/0x80
      [   36.088400][  T744]
      [   36.088400][  T744] stack backtrace:
      [   36.089772][  T744] CPU: 6 PID: 744 Comm: ip Not tainted 5.9.0-rc6+ #728
      [   36.091364][  T744] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [   36.093630][  T744] Call Trace:
      [   36.094416][  T744]  dump_stack+0x77/0x9b
      [   36.095385][  T744]  __lock_acquire+0xbc3/0x1f40
      [   36.096522][  T744]  lock_acquire+0xb4/0x3b0
      [   36.097540][  T744]  ? dev_set_rx_mode+0x19/0x30
      [   36.098657][  T744]  ? rtmsg_ifinfo+0x1f/0x30
      [   36.099711][  T744]  ? __dev_notify_flags+0xa5/0xf0
      [   36.100874][  T744]  ? rtnl_is_locked+0x11/0x20
      [   36.101967][  T744]  ? __dev_set_promiscuity+0x7b/0x1a0
      [   36.103230][  T744]  _raw_spin_lock_bh+0x38/0x70
      [   36.104348][  T744]  ? dev_set_rx_mode+0x19/0x30
      [   36.105461][  T744]  dev_set_rx_mode+0x19/0x30
      [   36.106532][  T744]  dev_set_promiscuity+0x36/0x50
      [   36.107692][  T744]  __dev_set_promiscuity+0x123/0x1a0
      [   36.108929][  T744]  dev_set_promiscuity+0x1e/0x50
      [   36.110093][  T744]  br_port_set_promisc+0x1f/0x40 [bridge]
      [   36.111415][  T744]  br_manage_promisc+0x8b/0xe0 [bridge]
      [   36.112728][  T744]  __dev_set_promiscuity+0x123/0x1a0
      [   36.113967][  T744]  ? __hw_addr_sync_one+0x23/0x50
      [   36.115135][  T744]  __dev_set_rx_mode+0x68/0x90
      [   36.116249][  T744]  dev_uc_sync+0x70/0x80
      [   36.117244][  T744]  dev_uc_add+0x50/0x60
      [   36.118223][  T744]  macvlan_open+0x18e/0x1f0 [macvlan]
      [   36.119470][  T744]  __dev_open+0xd6/0x170
      [   36.120470][  T744]  __dev_change_flags+0x181/0x1d0
      [   36.121644][  T744]  dev_change_flags+0x23/0x60
      [   36.122741][  T744]  do_setlink+0x30a/0x11e0
      [   36.123778][  T744]  ? __lock_acquire+0x92c/0x1f40
      [   36.124929][  T744]  ? __nla_validate_parse.part.6+0x45/0x8e0
      [   36.126309][  T744]  ? __lock_acquire+0x92c/0x1f40
      [   36.127457][  T744]  __rtnl_newlink+0x546/0x8e0
      [   36.128560][  T744]  ? lock_acquire+0xb4/0x3b0
      [   36.129623][  T744]  ? deactivate_slab.isra.85+0x6a1/0x850
      [   36.130946][  T744]  ? __lock_acquire+0x92c/0x1f40
      [   36.132102][  T744]  ? lock_acquire+0xb4/0x3b0
      [   36.133176][  T744]  ? is_bpf_text_address+0x5/0xe0
      [   36.134364][  T744]  ? rtnl_newlink+0x2e/0x70
      [   36.135445][  T744]  ? rcu_read_lock_sched_held+0x32/0x60
      [   36.136771][  T744]  ? kmem_cache_alloc_trace+0x2d8/0x380
      [   36.138070][  T744]  ? rtnl_newlink+0x2e/0x70
      [   36.139164][  T744]  rtnl_newlink+0x47/0x70
      [ ... ]
      
      Fixes: 845e0ebb ("net: change addr_list_lock back to static key")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1fc70edb
    • T
      net: core: introduce struct netdev_nested_priv for nested interface infrastructure · eff74233
      Taehee Yoo 提交于
      Functions related to nested interface infrastructure such as
      netdev_walk_all_{ upper | lower }_dev() pass both private functions
      and "data" pointer to handle their own things.
      At this point, the data pointer type is void *.
      In order to make it easier to expand common variables and functions,
      this new netdev_nested_priv structure is added.
      
      In the following patch, a new member variable will be added into this
      struct to fix the lockdep issue.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eff74233
    • T
      net: core: add __netdev_upper_dev_unlink() · fe8300fd
      Taehee Yoo 提交于
      The netdev_upper_dev_unlink() has to work differently according to flags.
      This idea is the same with __netdev_upper_dev_link().
      
      In the following patches, new flags will be added.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe8300fd
    • C
      net_sched: remove a redundant goto chain check · 1aad8049
      Cong Wang 提交于
      All TC actions call tcf_action_check_ctrlact() to validate
      goto chain, so this check in tcf_action_init_1() is actually
      redundant. Remove it to save troubles of leaking memory.
      
      Fixes: e49d8c22 ("net_sched: defer tcf_idr_insert() in tcf_action_init_1()")
      Reported-by: NVlad Buslov <vladbu@mellanox.com>
      Suggested-by: NDavide Caratti <dcaratti@redhat.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1aad8049
    • N
      net: bridge: fdb: don't flush ext_learn entries · f2f3729f
      Nikolay Aleksandrov 提交于
      When a user-space software manages fdb entries externally it should
      set the ext_learn flag which marks the fdb entry as externally managed
      and avoids expiring it (they're treated as static fdbs). Unfortunately
      on events where fdb entries are flushed (STP down, netlink fdb flush
      etc) these fdbs are also deleted automatically by the bridge. That in turn
      causes trouble for the managing user-space software (e.g. in MLAG setups
      we lose remote fdb entries on port flaps).
      These entries are completely externally managed so we should avoid
      automatically deleting them, the only exception are offloaded entries
      (i.e. BR_FDB_ADDED_BY_EXT_LEARN + BR_FDB_OFFLOADED). They are flushed as
      before.
      
      Fixes: eb100e0e ("net: bridge: allow to add externally learned entries from user-space")
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2f3729f
  3. 25 9月, 2020 5 次提交
    • H
      xfrm: Use correct address family in xfrm_state_find · e94ee171
      Herbert Xu 提交于
      The struct flowi must never be interpreted by itself as its size
      depends on the address family.  Therefore it must always be grouped
      with its original family value.
      
      In this particular instance, the original family value is lost in
      the function xfrm_state_find.  Therefore we get a bogus read when
      it's coupled with the wrong family which would occur with inter-
      family xfrm states.
      
      This patch fixes it by keeping the original family value.
      
      Note that the same bug could potentially occur in LSM through
      the xfrm_state_pol_flow_match hook.  I checked the current code
      there and it seems to be safe for now as only secid is used which
      is part of struct flowi_common.  But that API should be changed
      so that so that we don't get new bugs in the future.  We could
      do that by replacing fl with just secid or adding a family field.
      
      Reported-by: syzbot+577fbac3145a6eb2e7a5@syzkaller.appspotmail.com
      Fixes: 48b8d783 ("[XFRM]: State selection update to use inner...")
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      e94ee171
    • P
      tcp: skip DSACKs with dubious sequence ranges · ad2b9b0f
      Priyaranjan Jha 提交于
      Currently, we use length of DSACKed range to compute number of
      delivered packets. And if sequence range in DSACK is corrupted,
      we can get bogus dsacked/acked count, and bogus cwnd.
      
      This patch put bounds on DSACKed range to skip update of data
      delivery and spurious retransmission information, if the DSACK
      is unlikely caused by sender's action:
      - DSACKed range shouldn't be greater than maximum advertised rwnd.
      - Total no. of DSACKed segments shouldn't be greater than total
        no. of retransmitted segs. Unlike spurious retransmits, network
        duplicates or corrupted DSACKs shouldn't be counted as delivery.
      Signed-off-by: NPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad2b9b0f
    • R
      net/tls: race causes kernel panic · 38f7e1c0
      Rohit Maheshwari 提交于
      BUG: kernel NULL pointer dereference, address: 00000000000000b8
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 80000008b6fef067 P4D 80000008b6fef067 PUD 8b6fe6067 PMD 0
       Oops: 0000 [#1] SMP PTI
       CPU: 12 PID: 23871 Comm: kworker/12:80 Kdump: loaded Tainted: G S
       5.9.0-rc3+ #1
       Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.1 03/29/2018
       Workqueue: events tx_work_handler [tls]
       RIP: 0010:tx_work_handler+0x1b/0x70 [tls]
       Code: dc fe ff ff e8 16 d4 a3 f6 66 0f 1f 44 00 00 0f 1f 44 00 00 55 53 48 8b
       6f 58 48 8b bd a0 04 00 00 48 85 ff 74 1c 48 8b 47 28 <48> 8b 90 b8 00 00 00 83
       e2 02 75 0c f0 48 0f ba b0 b8 00 00 00 00
       RSP: 0018:ffffa44ace61fe88 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: ffff91da9e45cc30 RCX: dead000000000122
       RDX: 0000000000000001 RSI: ffff91da9e45cc38 RDI: ffff91d95efac200
       RBP: ffff91da133fd780 R08: 0000000000000000 R09: 000073746e657665
       R10: 8080808080808080 R11: 0000000000000000 R12: ffff91dad7d30700
       R13: ffff91dab6561080 R14: 0ffff91dad7d3070 R15: ffff91da9e45cc38
       FS:  0000000000000000(0000) GS:ffff91dad7d00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00000000000000b8 CR3: 0000000906478003 CR4: 00000000003706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        process_one_work+0x1a7/0x370
        worker_thread+0x30/0x370
        ? process_one_work+0x370/0x370
        kthread+0x114/0x130
        ? kthread_park+0x80/0x80
        ret_from_fork+0x22/0x30
      
      tls_sw_release_resources_tx() waits for encrypt_pending, which
      can have race, so we need similar changes as in commit
      0cada332 here as well.
      
      Fixes: a42055e8 ("net/tls: Add support for async encryption of records for performance")
      Signed-off-by: NRohit Maheshwari <rohitm@chelsio.com>
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38f7e1c0
    • C
      net_sched: commit action insertions together · 0fedc63f
      Cong Wang 提交于
      syzbot is able to trigger a failure case inside the loop in
      tcf_action_init(), and when this happens we clean up with
      tcf_action_destroy(). But, as these actions are already inserted
      into the global IDR, other parallel process could free them
      before tcf_action_destroy(), then we will trigger a use-after-free.
      
      Fix this by deferring the insertions even later, after the loop,
      and committing all the insertions in a separate loop, so we will
      never fail in the middle of the insertions any more.
      
      One side effect is that the window between alloction and final
      insertion becomes larger, now it is more likely that the loop in
      tcf_del_walker() sees the placeholder -EBUSY pointer. So we have
      to check for error pointer in tcf_del_walker().
      
      Reported-and-tested-by: syzbot+2287853d392e4b42374a@syzkaller.appspotmail.com
      Fixes: 0190c1d4 ("net: sched: atomically check-allocate action")
      Cc: Vlad Buslov <vladbu@mellanox.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fedc63f
    • C
      net_sched: defer tcf_idr_insert() in tcf_action_init_1() · e49d8c22
      Cong Wang 提交于
      All TC actions call tcf_idr_insert() for new action at the end
      of their ->init(), so we can actually move it to a central place
      in tcf_action_init_1().
      
      And once the action is inserted into the global IDR, other parallel
      process could free it immediately as its refcnt is still 1, so we can
      not fail after this, we need to move it after the goto action
      validation to avoid handling the failure case after insertion.
      
      This is found during code review, is not directly triggered by syzbot.
      And this prepares for the next patch.
      
      Cc: Vlad Buslov <vladbu@mellanox.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e49d8c22
  4. 24 9月, 2020 2 次提交
  5. 22 9月, 2020 3 次提交
    • E
      inet_diag: validate INET_DIAG_REQ_PROTOCOL attribute · d5e4d0a5
      Eric Dumazet 提交于
      User space could send an invalid INET_DIAG_REQ_PROTOCOL attribute
      as caught by syzbot.
      
      BUG: KMSAN: uninit-value in inet_diag_lock_handler net/ipv4/inet_diag.c:55 [inline]
      BUG: KMSAN: uninit-value in __inet_diag_dump+0x58c/0x720 net/ipv4/inet_diag.c:1147
      CPU: 0 PID: 8505 Comm: syz-executor174 Not tainted 5.9.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x21c/0x280 lib/dump_stack.c:118
       kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:122
       __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:219
       inet_diag_lock_handler net/ipv4/inet_diag.c:55 [inline]
       __inet_diag_dump+0x58c/0x720 net/ipv4/inet_diag.c:1147
       inet_diag_dump_compat+0x2a5/0x380 net/ipv4/inet_diag.c:1254
       netlink_dump+0xb73/0x1cb0 net/netlink/af_netlink.c:2246
       __netlink_dump_start+0xcf2/0xea0 net/netlink/af_netlink.c:2354
       netlink_dump_start include/linux/netlink.h:246 [inline]
       inet_diag_rcv_msg_compat+0x5da/0x6c0 net/ipv4/inet_diag.c:1288
       sock_diag_rcv_msg+0x24f/0x620 net/core/sock_diag.c:256
       netlink_rcv_skb+0x6d7/0x7e0 net/netlink/af_netlink.c:2470
       sock_diag_rcv+0x63/0x80 net/core/sock_diag.c:275
       netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
       netlink_unicast+0x11c8/0x1490 net/netlink/af_netlink.c:1330
       netlink_sendmsg+0x173a/0x1840 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:651 [inline]
       sock_sendmsg net/socket.c:671 [inline]
       ____sys_sendmsg+0xc82/0x1240 net/socket.c:2353
       ___sys_sendmsg net/socket.c:2407 [inline]
       __sys_sendmsg+0x6d1/0x820 net/socket.c:2440
       __do_sys_sendmsg net/socket.c:2449 [inline]
       __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
       __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
       do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x441389
      Code: e8 fc ab 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 09 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fff3b02ce98 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441389
      RDX: 0000000000000000 RSI: 0000000020001500 RDI: 0000000000000003
      RBP: 00000000006cb018 R08: 00000000004002c8 R09: 00000000004002c8
      R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000402130
      R13: 00000000004021c0 R14: 0000000000000000 R15: 0000000000000000
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:143 [inline]
       kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:126
       kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:80
       slab_alloc_node mm/slub.c:2907 [inline]
       __kmalloc_node_track_caller+0x9aa/0x12f0 mm/slub.c:4511
       __kmalloc_reserve net/core/skbuff.c:142 [inline]
       __alloc_skb+0x35f/0xb30 net/core/skbuff.c:210
       alloc_skb include/linux/skbuff.h:1094 [inline]
       netlink_alloc_large_skb net/netlink/af_netlink.c:1176 [inline]
       netlink_sendmsg+0xdb9/0x1840 net/netlink/af_netlink.c:1894
       sock_sendmsg_nosec net/socket.c:651 [inline]
       sock_sendmsg net/socket.c:671 [inline]
       ____sys_sendmsg+0xc82/0x1240 net/socket.c:2353
       ___sys_sendmsg net/socket.c:2407 [inline]
       __sys_sendmsg+0x6d1/0x820 net/socket.c:2440
       __do_sys_sendmsg net/socket.c:2449 [inline]
       __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
       __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
       do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 3f935c75 ("inet_diag: support for wider protocol numbers")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Christoph Paasch <cpaasch@apple.com>
      Cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5e4d0a5
    • V
      net: bridge: br_vlan_get_pvid_rcu() should dereference the VLAN group under RCU · 99f62a74
      Vladimir Oltean 提交于
      When calling the RCU brother of br_vlan_get_pvid(), lockdep warns:
      
      =============================
      WARNING: suspicious RCU usage
      5.9.0-rc3-01631-g13c17acb8e38-dirty #814 Not tainted
      -----------------------------
      net/bridge/br_private.h:1054 suspicious rcu_dereference_protected() usage!
      
      Call trace:
       lockdep_rcu_suspicious+0xd4/0xf8
       __br_vlan_get_pvid+0xc0/0x100
       br_vlan_get_pvid_rcu+0x78/0x108
      
      The warning is because br_vlan_get_pvid_rcu() calls nbp_vlan_group()
      which calls rtnl_dereference() instead of rcu_dereference(). In turn,
      rtnl_dereference() calls rcu_dereference_protected() which assumes
      operation under an RCU write-side critical section, which obviously is
      not the case here. So, when the incorrect primitive is used to access
      the RCU-protected VLAN group pointer, READ_ONCE() is not used, which may
      cause various unexpected problems.
      
      I'm sad to say that br_vlan_get_pvid() and br_vlan_get_pvid_rcu() cannot
      share the same implementation. So fix the bug by splitting the 2
      functions, and making br_vlan_get_pvid_rcu() retrieve the VLAN groups
      under proper locking annotations.
      
      Fixes: 7582f5b7 ("bridge: add br_vlan_get_pvid_rcu()")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99f62a74
    • X
      ipv6: route: convert comma to semicolon · 91b2c9a0
      Xu Wang 提交于
      Replace a comma between expression statements by a semicolon.
      Signed-off-by: NXu Wang <vulab@iscas.ac.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91b2c9a0
  6. 21 9月, 2020 1 次提交
  7. 19 9月, 2020 2 次提交
    • N
      net: ipv6: fix kconfig dependency warning for IPV6_SEG6_HMAC · db7cd91a
      Necip Fazil Yildiran 提交于
      When IPV6_SEG6_HMAC is enabled and CRYPTO is disabled, it results in the
      following Kbuild warning:
      
      WARNING: unmet direct dependencies detected for CRYPTO_HMAC
        Depends on [n]: CRYPTO [=n]
        Selected by [y]:
        - IPV6_SEG6_HMAC [=y] && NET [=y] && INET [=y] && IPV6 [=y]
      
      WARNING: unmet direct dependencies detected for CRYPTO_SHA1
        Depends on [n]: CRYPTO [=n]
        Selected by [y]:
        - IPV6_SEG6_HMAC [=y] && NET [=y] && INET [=y] && IPV6 [=y]
      
      WARNING: unmet direct dependencies detected for CRYPTO_SHA256
        Depends on [n]: CRYPTO [=n]
        Selected by [y]:
        - IPV6_SEG6_HMAC [=y] && NET [=y] && INET [=y] && IPV6 [=y]
      
      The reason is that IPV6_SEG6_HMAC selects CRYPTO_HMAC, CRYPTO_SHA1, and
      CRYPTO_SHA256 without depending on or selecting CRYPTO while those configs
      are subordinate to CRYPTO.
      
      Honor the kconfig menu hierarchy to remove kconfig dependency warnings.
      
      Fixes: bf355b8d ("ipv6: sr: add core files for SR HMAC support")
      Signed-off-by: NNecip Fazil Yildiran <fazilyildiran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db7cd91a
    • V
      net: mscc: ocelot: add locking for the port TX timestamp ID · 6565243c
      Vladimir Oltean 提交于
      The ocelot_port->ts_id is used to:
      (a) populate skb->cb[0] for matching the TX timestamp in the PTP IRQ
          with an skb.
      (b) populate the REW_OP from the injection header of the ongoing skb.
      Only then is ocelot_port->ts_id incremented.
      
      This is a problem because, at least theoretically, another timestampable
      skb might use the same ocelot_port->ts_id before that is incremented.
      Normally all transmit calls are serialized by the netdev transmit
      spinlock, but in this case, ocelot_port_add_txtstamp_skb() is also
      called by DSA, which has started declaring the NETIF_F_LLTX feature
      since commit 2b86cb82 ("net: dsa: declare lockless TX feature for
      slave ports").  So the logic of using and incrementing the timestamp id
      should be atomic per port.
      
      The solution is to use the global ocelot_port->ts_id only while
      protected by the associated ocelot_port->ts_id_lock. That's where we
      populate skb->cb[0]. Note that for ocelot, ocelot_port_add_txtstamp_skb
      is called for the actual skb, but for felix, it is called for the skb's
      clone. That is something which will also be changed in the future.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Tested-by: NAlexandre Belloni <alexandre.belloni@bootlin.com>
      Reviewed-by: NAlexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6565243c
  8. 18 9月, 2020 9 次提交
  9. 16 9月, 2020 3 次提交
    • M
      bpf: Bpf_skc_to_* casting helpers require a NULL check on sk · 8c33dadc
      Martin KaFai Lau 提交于
      The bpf_skc_to_* type casting helpers are available to
      BPF_PROG_TYPE_TRACING.  The traced PTR_TO_BTF_ID may be NULL.
      For example, the skb->sk may be NULL.  Thus, these casting helpers
      need to check "!sk" also and this patch fixes them.
      
      Fixes: 0d4fad3e ("bpf: Add bpf_skc_to_udp6_sock() helper")
      Fixes: 478cfbdf ("bpf: Add bpf_skc_to_{tcp, tcp_timewait, tcp_request}_sock() helpers")
      Fixes: af7ec138 ("bpf: Add bpf_skc_to_tcp6_sock() helper")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200915182959.241101-1-kafai@fb.com
      8c33dadc
    • D
      ipv4: Update exception handling for multipath routes via same device · 2fbc6e89
      David Ahern 提交于
      Kfir reported that pmtu exceptions are not created properly for
      deployments where multipath routes use the same device.
      
      After some digging I see 2 compounding problems:
      1. ip_route_output_key_hash_rcu is updating the flowi4_oif *after*
         the route lookup. This is the second use case where this has
         been a problem (the first is related to use of vti devices with
         VRF). I can not find any reason for the oif to be changed after the
         lookup; the code goes back to the start of git. It does not seem
         logical so remove it.
      
      2. fib_lookups for exceptions do not call fib_select_path to handle
         multipath route selection based on the hash.
      
      The end result is that the fib_lookup used to add the exception
      always creates it based using the first leg of the route.
      
      An example topology showing the problem:
      
                       |  host1
                   +------+
                   | eth0 |  .209
                   +------+
                       |
                   +------+
           switch  | br0  |
                   +------+
                       |
             +---------+---------+
             | host2             |  host3
         +------+             +------+
         | eth0 | .250        | eth0 | 192.168.252.252
         +------+             +------+
      
         +-----+             +-----+
         | vti | .2          | vti | 192.168.247.3
         +-----+             +-----+
             \                  /
       =================================
       tunnels
               192.168.247.1/24
      
      for h in host1 host2 host3; do
              ip netns add ${h}
              ip -netns ${h} link set lo up
              ip netns exec ${h} sysctl -wq net.ipv4.ip_forward=1
      done
      
      ip netns add switch
      ip -netns switch li set lo up
      ip -netns switch link add br0 type bridge stp 0
      ip -netns switch link set br0 up
      
      for n in 1 2 3; do
              ip -netns switch link add eth-sw type veth peer name eth-h${n}
              ip -netns switch li set eth-h${n} master br0 up
              ip -netns switch li set eth-sw netns host${n} name eth0
      done
      
      ip -netns host1 addr add 192.168.252.209/24 dev eth0
      ip -netns host1 link set dev eth0 up
      ip -netns host1 route add 192.168.247.0/24 \
              nexthop via 192.168.252.250 dev eth0 nexthop via 192.168.252.252 dev eth0
      
      ip -netns host2 addr add 192.168.252.250/24 dev eth0
      ip -netns host2 link set dev eth0 up
      
      ip -netns host2 addr add 192.168.252.252/24 dev eth0
      ip -netns host3 link set dev eth0 up
      
      ip netns add tunnel
      ip -netns tunnel li set lo up
      ip -netns tunnel li add br0 type bridge
      ip -netns tunnel li set br0 up
      for n in $(seq 11 20); do
              ip -netns tunnel addr add dev br0 192.168.247.${n}/24
      done
      
      for n in 2 3
      do
              ip -netns tunnel link add vti${n} type veth peer name eth${n}
              ip -netns tunnel link set eth${n} mtu 1360 master br0 up
              ip -netns tunnel link set vti${n} netns host${n} mtu 1360 up
              ip -netns host${n} addr add dev vti${n} 192.168.247.${n}/24
      done
      ip -netns tunnel ro add default nexthop via 192.168.247.2 nexthop via 192.168.247.3
      
      ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.11
      ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.15
      ip -netns host1 ro ls cache
      
      Before this patch the cache always shows exceptions against the first
      leg in the multipath route; 192.168.252.250 per this example. Since the
      hash has an initial random seed, you may need to vary the final octet
      more than what is listed. In my tests, using addresses between 11 and 19
      usually found 1 that used both legs.
      
      With this patch, the cache will have exceptions for both legs.
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions")
      Reported-by: NKfir Itzhak <mastertheknife@gmail.com>
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2fbc6e89
    • L
      net: tipc: kerneldoc fixes · 2e5117ba
      Lu Wei 提交于
      Fix parameter description of tipc_link_bc_create()
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Fixes: 16ad3f40 ("tipc: introduce variable window congestion control")
      Signed-off-by: NLu Wei <luwei32@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e5117ba
  10. 15 9月, 2020 7 次提交
    • L
      batman-adv: mcast: fix duplicate mcast packets from BLA backbone to mesh · 2369e827
      Linus Lüssing 提交于
      Scenario:
      * Multicast frame send from BLA backbone gateways (multiple nodes
        with their bat0 bridged together, with BLA enabled) sharing the same
        LAN to nodes in the mesh
      
      Issue:
      * Nodes receive the frame multiple times on bat0 from the mesh,
        once from each foreign BLA backbone gateway which shares the same LAN
        with another
      
      For multicast frames via batman-adv broadcast packets coming from the
      same BLA backbone but from different backbone gateways duplicates are
      currently detected via a CRC history of previously received packets.
      
      However this CRC so far was not performed for multicast frames received
      via batman-adv unicast packets. Fixing this by appyling the same check
      for such packets, too.
      
      Room for improvements in the future: Ideally we would introduce the
      possibility to not only claim a client, but a complete originator, too.
      This would allow us to only send a multicast-in-unicast packet from a BLA
      backbone gateway claiming the node and by that avoid potential redundant
      transmissions in the first place.
      
      Fixes: 279e89b2 ("batman-adv: add broadcast duplicate check")
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: NSven Eckelmann <sven@narfation.org>
      Signed-off-by: NSimon Wunderlich <sw@simonwunderlich.de>
      2369e827
    • L
      batman-adv: mcast: fix duplicate mcast packets in BLA backbone from mesh · 74c09b72
      Linus Lüssing 提交于
      Scenario:
      * Multicast frame send from mesh to a BLA backbone (multiple nodes
        with their bat0 bridged together, with BLA enabled)
      
      Issue:
      * BLA backbone nodes receive the frame multiple times on bat0,
        once from mesh->bat0 and once from each backbone_gw from LAN
      
      For unicast, a node will send only to the best backbone gateway
      according to the TQ. However for multicast we currently cannot determine
      if multiple destination nodes share the same backbone if they don't share
      the same backbone with us. So we need to keep sending the unicasts to
      all backbone gateways and let the backbone gateways decide which one
      will forward the frame. We can use the CLAIM mechanism to make this
      decision.
      
      One catch: The batman-adv gateway feature for DHCP packets potentially
      sends multicast packets in the same batman-adv unicast header as the
      multicast optimizations code. And we are not allowed to drop those even
      if we did not claim the source address of the sender, as for such
      packets there is only this one multicast-in-unicast packet.
      
      How can we distinguish the two cases?
      
      The gateway feature uses a batman-adv unicast 4 address header. While
      the multicast-to-unicasts feature uses a simple, 3 address batman-adv
      unicast header. So let's use this to distinguish.
      
      Fixes: fe2da6ff ("batman-adv: check incoming packet type for bla")
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: NSven Eckelmann <sven@narfation.org>
      Signed-off-by: NSimon Wunderlich <sw@simonwunderlich.de>
      74c09b72
    • L
      batman-adv: mcast: fix duplicate mcast packets in BLA backbone from LAN · 3236d215
      Linus Lüssing 提交于
      Scenario:
      * Multicast frame send from a BLA backbone (multiple nodes with
        their bat0 bridged together, with BLA enabled)
      
      Issue:
      * BLA backbone nodes receive the frame multiple times on bat0
      
      For multicast frames received via batman-adv broadcast packets the
      originator of the broadcast packet is checked before decapsulating and
      forwarding the frame to bat0 (batadv_bla_is_backbone_gw()->
      batadv_recv_bcast_packet()). If it came from a node which shares the
      same BLA backbone with us then it is not forwarded to bat0 to avoid a
      loop.
      
      When sending a multicast frame in a non-4-address batman-adv unicast
      packet we are currently missing this check - and cannot do so because
      the batman-adv unicast packet has no originator address field.
      
      However, we can simply fix this on the sender side by only sending the
      multicast frame via unicasts to interested nodes which do not share the
      same BLA backbone with us. This also nicely avoids some unnecessary
      transmissions on mesh side.
      
      Note that no infinite loop was observed, probably because of dropping
      via batadv_interface_tx()->batadv_bla_tx(). However the duplicates still
      utterly confuse switches/bridges, ICMPv6 duplicate address detection and
      neighbor discovery and therefore leads to long delays before being able
      to establish TCP connections, for instance. And it also leads to the Linux
      bridge printing messages like:
      "br-lan: received packet on eth1 with own address as source address ..."
      
      Fixes: 2d3f6ccc ("batman-adv: Modified forwarding behaviour for multicast packets")
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: NSven Eckelmann <sven@narfation.org>
      Signed-off-by: NSimon Wunderlich <sw@simonwunderlich.de>
      3236d215
    • B
      xsk: Fix number of pinned pages/umem size discrepancy · 2b1667e5
      Björn Töpel 提交于
      For AF_XDP sockets, there was a discrepancy between the number of of
      pinned pages and the size of the umem region.
      
      The size of the umem region is used to validate the AF_XDP descriptor
      addresses. The logic that pinned the pages covered by the region only
      took whole pages into consideration, creating a mismatch between the
      size and pinned pages. A user could then pass AF_XDP addresses outside
      the range of pinned pages, but still within the size of the region,
      crashing the kernel.
      
      This change correctly calculates the number of pages to be
      pinned. Further, the size check for the aligned mode is
      simplified. Now the code simply checks if the size is divisible by the
      chunk size.
      
      Fixes: bbff2f32 ("xsk: new descriptor addressing scheme")
      Reported-by: NCiara Loftus <ciara.loftus@intel.com>
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Tested-by: NCiara Loftus <ciara.loftus@intel.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200910075609.7904-1-bjorn.topel@gmail.com
      2b1667e5
    • X
      net: sched: initialize with 0 before setting erspan md->u · 8e1b3ac4
      Xin Long 提交于
      In fl_set_erspan_opt(), all bits of erspan md was set 1, as this
      function is also used to set opt MASK. However, when setting for
      md->u.index for opt VALUE, the rest bits of the union md->u will
      be left 1. It would cause to fail the match of the whole md when
      version is 1 and only index is set.
      
      This patch is to fix by initializing with 0 before setting erspan
      md->u.
      Reported-by: NShuang Li <shuali@redhat.com>
      Fixes: 79b1011c ("net: sched: allow flower to match erspan options")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e1b3ac4
    • X
      lwtunnel: only keep the available bits when setting vxlan md->gbp · 681d2cfb
      Xin Long 提交于
      As we can see from vxlan_build/parse_gbp_hdr(), when processing metadata
      on vxlan rx/tx path, only dont_learn/policy_applied/policy_id fields can
      be set to or parse from the packet for vxlan gbp option.
      
      So do the mask when set it in lwtunnel, as it does in act_tunnel_key and
      cls_flower.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      681d2cfb
    • X
      net: sched: only keep the available bits when setting vxlan md->gbp · 13e6ce98
      Xin Long 提交于
      As we can see from vxlan_build/parse_gbp_hdr(), when processing metadata
      on vxlan rx/tx path, only dont_learn/policy_applied/policy_id fields can
      be set to or parse from the packet for vxlan gbp option.
      
      So we'd better do the mask when set it in act_tunnel_key and cls_flower.
      Otherwise, when users don't know these bits, they may configure with a
      value which can never be matched.
      Reported-by: NShuang Li <shuali@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13e6ce98