1. 01 12月, 2018 4 次提交
    • E
      tcp: md5: add tcp_md5_needed jump label · 6015c71e
      Eric Dumazet 提交于
      Most linux hosts never setup TCP MD5 keys. We can avoid a
      cache line miss (accessing tp->md5ig_info) on RX and TX
      using a jump label.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6015c71e
    • E
      tcp: implement coalescing on backlog queue · 4f693b55
      Eric Dumazet 提交于
      In case GRO is not as efficient as it should be or disabled,
      we might have a user thread trapped in __release_sock() while
      softirq handler flood packets up to the point we have to drop.
      
      This patch balances work done from user thread and softirq,
      to give more chances to __release_sock() to complete its work
      before new packets are added the the backlog.
      
      This also helps if we receive many ACK packets, since GRO
      does not aggregate them.
      
      This patch brings ~60% throughput increase on a receiver
      without GRO, but the spectacular gain is really on
      1000x release_sock() latency reduction I have measured.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f693b55
    • E
      tcp: take care of compressed acks in tcp_add_reno_sack() · 19119f29
      Eric Dumazet 提交于
      Neal pointed out that non sack flows might suffer from ACK compression
      added in the following patch ("tcp: implement coalescing on backlog queue")
      
      Instead of tweaking tcp_add_backlog() we can take into
      account how many ACK were coalesced, this information
      will be available in skb_shinfo(skb)->gso_segs
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19119f29
    • G
      net: Add trace events for all receive exit points · b0e3f1bd
      Geneviève Bastien 提交于
      Trace events are already present for the receive entry points, to indicate
      how the reception entered the stack.
      
      This patch adds the corresponding exit trace events that will bound the
      reception such that all events occurring between the entry and the exit
      can be considered as part of the reception context. This greatly helps
      for dependency and root cause analyses.
      
      Without this, it is not possible with tracepoint instrumentation to
      determine whether a sched_wakeup event following a netif_receive_skb
      event is the result of the packet reception or a simple coincidence after
      further processing by the thread. It is possible using other mechanisms
      like kretprobes, but considering the "entry" points are already present,
      it would be good to add the matching exit events.
      
      In addition to linking packets with wakeups, the entry/exit event pair
      can also be used to perform network stack latency analyses.
      Signed-off-by: NGeneviève Bastien <gbastien@versatic.net>
      CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      CC: Steven Rostedt <rostedt@goodmis.org>
      CC: Ingo Molnar <mingo@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> (tracing side)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0e3f1bd
  2. 30 11月, 2018 2 次提交
  3. 29 11月, 2018 1 次提交
  4. 28 11月, 2018 14 次提交
    • T
      netfilter: nf_tables: deactivate expressions in rule replecement routine · ca089878
      Taehee Yoo 提交于
      There is no expression deactivation call from the rule replacement path,
      hence, chain counter is not decremented. A few steps to reproduce the
      problem:
      
         %nft add table ip filter
         %nft add chain ip filter c1
         %nft add chain ip filter c1
         %nft add rule ip filter c1 jump c2
         %nft replace rule ip filter c1 handle 3 accept
         %nft flush ruleset
      
      <jump c2> expression means immediate NFT_JUMP to chain c2.
      Reference count of chain c2 is increased when the rule is added.
      
      When rule is deleted or replaced, the reference counter of c2 should be
      decreased via nft_rule_expr_deactivate() which calls
      nft_immediate_deactivate().
      
      Splat looks like:
      [  214.396453] WARNING: CPU: 1 PID: 21 at net/netfilter/nf_tables_api.c:1432 nf_tables_chain_destroy.isra.38+0x2f9/0x3a0 [nf_tables]
      [  214.398983] Modules linked in: nf_tables nfnetlink
      [  214.398983] CPU: 1 PID: 21 Comm: kworker/1:1 Not tainted 4.20.0-rc2+ #44
      [  214.398983] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
      [  214.398983] RIP: 0010:nf_tables_chain_destroy.isra.38+0x2f9/0x3a0 [nf_tables]
      [  214.398983] Code: 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 8e 00 00 00 48 8b 7b 58 e8 e1 2c 4e c6 48 89 df e8 d9 2c 4e c6 eb 9a <0f> 0b eb 96 0f 0b e9 7e fe ff ff e8 a7 7e 4e c6 e9 a4 fe ff ff e8
      [  214.398983] RSP: 0018:ffff8881152874e8 EFLAGS: 00010202
      [  214.398983] RAX: 0000000000000001 RBX: ffff88810ef9fc28 RCX: ffff8881152876f0
      [  214.398983] RDX: dffffc0000000000 RSI: 1ffff11022a50ede RDI: ffff88810ef9fc78
      [  214.398983] RBP: 1ffff11022a50e9d R08: 0000000080000000 R09: 0000000000000000
      [  214.398983] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff11022a50eba
      [  214.398983] R13: ffff888114446e08 R14: ffff8881152876f0 R15: ffffed1022a50ed6
      [  214.398983] FS:  0000000000000000(0000) GS:ffff888116400000(0000) knlGS:0000000000000000
      [  214.398983] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  214.398983] CR2: 00007fab9bb5f868 CR3: 000000012aa16000 CR4: 00000000001006e0
      [  214.398983] Call Trace:
      [  214.398983]  ? nf_tables_table_destroy.isra.37+0x100/0x100 [nf_tables]
      [  214.398983]  ? __kasan_slab_free+0x145/0x180
      [  214.398983]  ? nf_tables_trans_destroy_work+0x439/0x830 [nf_tables]
      [  214.398983]  ? kfree+0xdb/0x280
      [  214.398983]  nf_tables_trans_destroy_work+0x5f5/0x830 [nf_tables]
      [ ... ]
      
      Fixes: bb7b40ae ("netfilter: nf_tables: bogus EBUSY in chain deletions")
      Reported by: Christoph Anton Mitterer <calestyo@scientia.net>
      Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=914505
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=201791Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      ca089878
    • D
      net/ipv4: Fix missing raw_init when CONFIG_PROC_FS is disabled · 86d1d8b7
      David Ahern 提交于
      Randy reported when CONFIG_PROC_FS is not enabled:
          ld: net/ipv4/af_inet.o: in function `inet_init':
          af_inet.c:(.init.text+0x42d): undefined reference to `raw_init'
      
      Fix by moving the endif up to the end of the proc entries
      
      Fixes: 6897445f ("net: provide a sysctl raw_l3mdev_accept for raw socket lookup with VRFs")
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Mike Manning <mmanning@vyatta.att-mail.com>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86d1d8b7
    • E
      tcp: remove hdrlen argument from tcp_queue_rcv() · e7395f1f
      Eric Dumazet 提交于
      Only one caller needs to pull TCP headers, so lets
      move __skb_pull() to the caller side.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7395f1f
    • V
      net/ncsi: Add NCSI Mellanox OEM command · 16e8c4ca
      Vijay Khemka 提交于
      This patch adds OEM Mellanox commands and response handling. It also
      defines OEM Get MAC Address handler to get and configure the device.
      
      ncsi_oem_gma_handler_mlx: This handler send NCSI mellanox command for
      getting mac address.
      ncsi_rsp_handler_oem_mlx: This handles response received for all
      mellanox OEM commands.
      ncsi_rsp_handler_oem_mlx_gma: This handles get mac address response and
      set it to device.
      Signed-off-by: NVijay Khemka <vijaykhemka@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16e8c4ca
    • J
      tipc: fix lockdep warning during node delete · ec835f89
      Jon Maloy 提交于
      We see the following lockdep warning:
      
      [ 2284.078521] ======================================================
      [ 2284.078604] WARNING: possible circular locking dependency detected
      [ 2284.078604] 4.19.0+ #42 Tainted: G            E
      [ 2284.078604] ------------------------------------------------------
      [ 2284.078604] rmmod/254 is trying to acquire lock:
      [ 2284.078604] 00000000acd94e28 ((&n->timer)#2){+.-.}, at: del_timer_sync+0x5/0xa0
      [ 2284.078604]
      [ 2284.078604] but task is already holding lock:
      [ 2284.078604] 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x190 [tipc]
      [ 2284.078604]
      [ 2284.078604] which lock already depends on the new lock.
      [ 2284.078604]
      [ 2284.078604]
      [ 2284.078604] the existing dependency chain (in reverse order) is:
      [ 2284.078604]
      [ 2284.078604] -> #1 (&(&tn->node_list_lock)->rlock){+.-.}:
      [ 2284.078604]        tipc_node_timeout+0x20a/0x330 [tipc]
      [ 2284.078604]        call_timer_fn+0xa1/0x280
      [ 2284.078604]        run_timer_softirq+0x1f2/0x4d0
      [ 2284.078604]        __do_softirq+0xfc/0x413
      [ 2284.078604]        irq_exit+0xb5/0xc0
      [ 2284.078604]        smp_apic_timer_interrupt+0xac/0x210
      [ 2284.078604]        apic_timer_interrupt+0xf/0x20
      [ 2284.078604]        default_idle+0x1c/0x140
      [ 2284.078604]        do_idle+0x1bc/0x280
      [ 2284.078604]        cpu_startup_entry+0x19/0x20
      [ 2284.078604]        start_secondary+0x187/0x1c0
      [ 2284.078604]        secondary_startup_64+0xa4/0xb0
      [ 2284.078604]
      [ 2284.078604] -> #0 ((&n->timer)#2){+.-.}:
      [ 2284.078604]        del_timer_sync+0x34/0xa0
      [ 2284.078604]        tipc_node_delete+0x1a/0x40 [tipc]
      [ 2284.078604]        tipc_node_stop+0xcb/0x190 [tipc]
      [ 2284.078604]        tipc_net_stop+0x154/0x170 [tipc]
      [ 2284.078604]        tipc_exit_net+0x16/0x30 [tipc]
      [ 2284.078604]        ops_exit_list.isra.8+0x36/0x70
      [ 2284.078604]        unregister_pernet_operations+0x87/0xd0
      [ 2284.078604]        unregister_pernet_subsys+0x1d/0x30
      [ 2284.078604]        tipc_exit+0x11/0x6f2 [tipc]
      [ 2284.078604]        __x64_sys_delete_module+0x1df/0x240
      [ 2284.078604]        do_syscall_64+0x66/0x460
      [ 2284.078604]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 2284.078604]
      [ 2284.078604] other info that might help us debug this:
      [ 2284.078604]
      [ 2284.078604]  Possible unsafe locking scenario:
      [ 2284.078604]
      [ 2284.078604]        CPU0                    CPU1
      [ 2284.078604]        ----                    ----
      [ 2284.078604]   lock(&(&tn->node_list_lock)->rlock);
      [ 2284.078604]                                lock((&n->timer)#2);
      [ 2284.078604]                                lock(&(&tn->node_list_lock)->rlock);
      [ 2284.078604]   lock((&n->timer)#2);
      [ 2284.078604]
      [ 2284.078604]  *** DEADLOCK ***
      [ 2284.078604]
      [ 2284.078604] 3 locks held by rmmod/254:
      [ 2284.078604]  #0: 000000003368be9b (pernet_ops_rwsem){+.+.}, at: unregister_pernet_subsys+0x15/0x30
      [ 2284.078604]  #1: 0000000046ed9c86 (rtnl_mutex){+.+.}, at: tipc_net_stop+0x144/0x170 [tipc]
      [ 2284.078604]  #2: 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x19
      [...}
      
      The reason is that the node timer handler sometimes needs to delete a
      node which has been disconnected for too long. To do this, it grabs
      the lock 'node_list_lock', which may at the same time be held by the
      generic node cleanup function, tipc_node_stop(), during module removal.
      Since the latter is calling del_timer_sync() inside the same lock, we
      have a potential deadlock.
      
      We fix this letting the timer cleanup function use spin_trylock()
      instead of just spin_lock(), and when it fails to grab the lock it
      just returns so that the timer handler can terminate its execution.
      This is safe to do, since tipc_node_stop() anyway is about to
      delete both the timer and the node instance.
      
      Fixes: 6a939f36 ("tipc: Auto removal of peer down node instance")
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec835f89
    • N
      netns: enable to dump full nsid translation table · 288f06a0
      Nicolas Dichtel 提交于
      Like the previous patch, the goal is to ease to convert nsids from one
      netns to another netns.
      A new attribute (NETNSA_CURRENT_NSID) is added to the kernel answer when
      NETNSA_TARGET_NSID is provided, thus the user can easily convert nsids.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      288f06a0
    • N
      netns: enable to specify a nsid for a get request · 3a4f68bf
      Nicolas Dichtel 提交于
      Combined with NETNSA_TARGET_NSID, it enables to "translate" a nsid from one
      netns to a nsid of another netns.
      This is useful when using NETLINK_F_LISTEN_ALL_NSID because it helps the
      user to interpret a nsid received from an other netns.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a4f68bf
    • N
      netns: add support of NETNSA_TARGET_NSID · cff478b9
      Nicolas Dichtel 提交于
      Like it was done for link and address, add the ability to perform get/dump
      in another netns by specifying a target nsid attribute.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cff478b9
    • N
      netns: introduce 'struct net_fill_args' · a0732ad1
      Nicolas Dichtel 提交于
      This is a preparatory work. To avoid having to much arguments for the
      function rtnl_net_fill(), a new structure is defined.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0732ad1
    • N
      netns: remove net arg from rtnl_net_fill() · 74be39eb
      Nicolas Dichtel 提交于
      This argument is not used anymore.
      
      Fixes: cab3c8ec ("netns: always provide the id to rtnl_net_fill()")
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74be39eb
    • X
      sctp: increase sk_wmem_alloc when head->truesize is increased · 0d32f177
      Xin Long 提交于
      I changed to count sk_wmem_alloc by skb truesize instead of 1 to
      fix the sk_wmem_alloc leak caused by later truesize's change in
      xfrm in Commit 02968ccf ("sctp: count sk_wmem_alloc by skb
      truesize in sctp_packet_transmit").
      
      But I should have also increased sk_wmem_alloc when head->truesize
      is increased in sctp_packet_gso_append() as xfrm does. Otherwise,
      sctp gso packet will cause sk_wmem_alloc underflow.
      
      Fixes: 02968ccf ("sctp: count sk_wmem_alloc by skb truesize in sctp_packet_transmit")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d32f177
    • N
      net: bridge: export supported boolopts · 1ed1ccb9
      Nikolay Aleksandrov 提交于
      Now that we have at least one bool option, we can export all of the
      supported bool options via optmask when dumping them.
      
      v2: new patch
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ed1ccb9
    • N
      net: bridge: add no_linklocal_learn bool option · 70e4272b
      Nikolay Aleksandrov 提交于
      Use the new boolopt API to add an option which disables learning from
      link-local packets. The default is kept as before and learning is
      enabled. This is a simple map from a boolopt bit to a bridge private
      flag that is tested before learning.
      
      v2: pass NULL for extack via sysfs
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70e4272b
    • N
      net: bridge: add support for user-controlled bool options · a428afe8
      Nikolay Aleksandrov 提交于
      We have been adding many new bridge options, a big number of which are
      boolean but still take up netlink attribute ids and waste space in the skb.
      Recently we discussed learning from link-local packets[1] and decided
      yet another new boolean option will be needed, thus introducing this API
      to save some bridge nl space.
      The API supports changing the value of multiple boolean options at once
      via the br_boolopt_multi struct which has an optmask (which options to
      set, bit per opt) and optval (options' new values). Future boolean
      options will only be added to the br_boolopt_id enum and then will have
      to be handled in br_boolopt_toggle/get. The API will automatically
      add the ability to change and export them via netlink, sysfs can use the
      single boolopt function versions to do the same. The behaviour with
      failing/succeeding is the same as with normal netlink option changing.
      
      If an option requires mapping to internal kernel flag or needs special
      configuration to be enabled then it should be handled in
      br_boolopt_toggle. It should also be able to retrieve an option's current
      state via br_boolopt_get.
      
      v2: WARN_ON() on unsupported option as that shouldn't be possible and
          also will help catch people who add new options without handling
          them for both set and get. Pass down extack so if an option desires
          it could set it on error and be more user-friendly.
      
      [1] https://www.spinics.net/lists/netdev/msg532698.htmlSigned-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a428afe8
  5. 27 11月, 2018 5 次提交
    • T
      netfilter: nf_conncount: remove wrong condition check routine · 53ca0f2f
      Taehee Yoo 提交于
      All lists that reach the tree_nodes_free() function have both zero
      counter and true dead flag. The reason for this is that lists to be
      release are selected by nf_conncount_gc_list() which already decrements
      the list counter and sets on the dead flag. Therefore, this if statement
      in tree_nodes_free() is unnecessary and wrong.
      
      Fixes: 31568ec0 ("netfilter: nf_conncount: fix list_del corruption in conn_free")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      53ca0f2f
    • T
      netfilter: nat: fix double register in masquerade modules · 095faf45
      Taehee Yoo 提交于
      There is a reference counter to ensure that masquerade modules register
      notifiers only once. However, the existing reference counter approach is
      not safe, test commands are:
      
         while :
         do
         	   modprobe ip6t_MASQUERADE &
      	   modprobe nft_masq_ipv6 &
      	   modprobe -rv ip6t_MASQUERADE &
      	   modprobe -rv nft_masq_ipv6 &
         done
      
      numbers below represent the reference counter.
      --------------------------------------------------------
      CPU0        CPU1        CPU2        CPU3        CPU4
      [insmod]    [insmod]    [rmmod]     [rmmod]     [insmod]
      --------------------------------------------------------
      0->1
      register    1->2
                  returns     2->1
      			returns     1->0
                                                      0->1
                                                      register <--
                                          unregister
      --------------------------------------------------------
      
      The unregistation of CPU3 should be processed before the
      registration of CPU4.
      
      In order to fix this, use a mutex instead of reference counter.
      
      splat looks like:
      [  323.869557] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:1381]
      [  323.869574] Modules linked in: nf_tables(+) nf_nat_ipv6(-) nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 n]
      [  323.869574] irq event stamp: 194074
      [  323.898930] hardirqs last  enabled at (194073): [<ffffffff90004a0d>] trace_hardirqs_on_thunk+0x1a/0x1c
      [  323.898930] hardirqs last disabled at (194074): [<ffffffff90004a29>] trace_hardirqs_off_thunk+0x1a/0x1c
      [  323.898930] softirqs last  enabled at (182132): [<ffffffff922006ec>] __do_softirq+0x6ec/0xa3b
      [  323.898930] softirqs last disabled at (182109): [<ffffffff90193426>] irq_exit+0x1a6/0x1e0
      [  323.898930] CPU: 0 PID: 1381 Comm: modprobe Not tainted 4.20.0-rc2+ #27
      [  323.898930] RIP: 0010:raw_notifier_chain_register+0xea/0x240
      [  323.898930] Code: 3c 03 0f 8e f2 00 00 00 44 3b 6b 10 7f 4d 49 bc 00 00 00 00 00 fc ff df eb 22 48 8d 7b 10 488
      [  323.898930] RSP: 0018:ffff888101597218 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
      [  323.898930] RAX: 0000000000000000 RBX: ffffffffc04361c0 RCX: 0000000000000000
      [  323.898930] RDX: 1ffffffff26132ae RSI: ffffffffc04aa3c0 RDI: ffffffffc04361d0
      [  323.898930] RBP: ffffffffc04361c8 R08: 0000000000000000 R09: 0000000000000001
      [  323.898930] R10: ffff8881015972b0 R11: fffffbfff26132c4 R12: dffffc0000000000
      [  323.898930] R13: 0000000000000000 R14: 1ffff110202b2e44 R15: ffffffffc04aa3c0
      [  323.898930] FS:  00007f813ed41540(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
      [  323.898930] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  323.898930] CR2: 0000559bf2c9f120 CR3: 000000010bc80000 CR4: 00000000001006f0
      [  323.898930] Call Trace:
      [  323.898930]  ? atomic_notifier_chain_register+0x2d0/0x2d0
      [  323.898930]  ? down_read+0x150/0x150
      [  323.898930]  ? sched_clock_cpu+0x126/0x170
      [  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  323.898930]  register_netdevice_notifier+0xbb/0x790
      [  323.898930]  ? __dev_close_many+0x2d0/0x2d0
      [  323.898930]  ? __mutex_unlock_slowpath+0x17f/0x740
      [  323.898930]  ? wait_for_completion+0x710/0x710
      [  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  323.898930]  ? up_write+0x6c/0x210
      [  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  324.127073]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  324.127073]  nft_chain_filter_init+0x1e/0xe8a [nf_tables]
      [  324.127073]  nf_tables_module_init+0x37/0x92 [nf_tables]
      [ ... ]
      
      Fixes: 8dd33cc9 ("netfilter: nf_nat: generalize IPv4 masquerading support for nf_tables")
      Fixes: be6b635c ("netfilter: nf_nat: generalize IPv6 masquerading support for nf_tables")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      095faf45
    • T
      netfilter: add missing error handling code for register functions · 584eab29
      Taehee Yoo 提交于
      register_{netdevice/inetaddr/inet6addr}_notifier may return an error
      value, this patch adds the code to handle these error paths.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      584eab29
    • A
      netfilter: ipv6: Preserve link scope traffic original oif · 508b0904
      Alin Nastac 提交于
      When ip6_route_me_harder is invoked, it resets outgoing interface of:
        - link-local scoped packets sent by neighbor discovery
        - multicast packets sent by MLD host
        - multicast packets send by MLD proxy daemon that sets outgoing
          interface through IPV6_PKTINFO ipi6_ifindex
      
      Link-local and multicast packets must keep their original oif after
      ip6_route_me_harder is called.
      Signed-off-by: NAlin Nastac <alin.nastac@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      508b0904
    • D
      bpf: Avoid unnecessary instruction in convert_bpf_ld_abs() · d8f3e978
      David Miller 提交于
      'offset' is constant and if it is zero, no need to subtract it
      from BPF_REG_TMP.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d8f3e978
  6. 26 11月, 2018 4 次提交
    • F
      netfilter: nfnetlink_cttimeout: fetch timeouts for udplite and gre, too · 89259088
      Florian Westphal 提交于
      syzbot was able to trigger the WARN in cttimeout_default_get() by
      passing UDPLITE as l4protocol.  Alias UDPLITE to UDP, both use
      same timeout values.
      
      Furthermore, also fetch GRE timeouts.  GRE is a bit more complicated,
      as it still can be a module and its netns_proto_gre struct layout isn't
      visible outside of the gre module. Can't move timeouts around, it
      appears conntrack sysctl unregister assumes net_generic() returns
      nf_proto_net, so we get crash. Expose layout of netns_proto_gre instead.
      
      A followup nf-next patch could make gre tracker be built-in as well
      if needed, its not that large.
      
      Last, make the WARN() mention the missing protocol value in case
      anything else is missing.
      
      Reported-by: syzbot+2fae8fa157dd92618cae@syzkaller.appspotmail.com
      Fixes: 8866df92 ("netfilter: nfnetlink_cttimeout: pass default timeout policy to obj_to_nlattr")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      89259088
    • X
      ipvs: call ip_vs_dst_notifier earlier than ipv6_dev_notf · 2a31e4bd
      Xin Long 提交于
      ip_vs_dst_event is supposed to clean up all dst used in ipvs'
      destinations when a net dev is going down. But it works only
      when the dst's dev is the same as the dev from the event.
      
      Now with the same priority but late registration,
      ip_vs_dst_notifier is always called later than ipv6_dev_notf
      where the dst's dev is set to lo for NETDEV_DOWN event.
      
      As the dst's dev lo is not the same as the dev from the event
      in ip_vs_dst_event, ip_vs_dst_notifier doesn't actually work.
      Also as these dst have to wait for dest_trash_timer to clean
      them up. It would cause some non-permanent kernel warnings:
      
        unregister_netdevice: waiting for br0 to become free. Usage count = 3
      
      To fix it, call ip_vs_dst_notifier earlier than ipv6_dev_notf
      by increasing its priority to ADDRCONF_NOTIFY_PRIORITY + 5.
      
      Note that for ipv4 route fib_netdev_notifier doesn't set dst's
      dev to lo in NETDEV_DOWN event, so this fix is only needed when
      IP_VS_IPV6 is defined.
      
      Fixes: 7a4f0761 ("IPVS: init and cleanup restructuring")
      Reported-by: NLi Shuang <shuali@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2a31e4bd
    • E
      net: remove unsafe skb_insert() · 4bffc669
      Eric Dumazet 提交于
      I do not see how one can effectively use skb_insert() without holding
      some kind of lock. Otherwise other cpus could have changed the list
      right before we have a chance of acquiring list->lock.
      
      Only existing user is in drivers/infiniband/hw/nes/nes_mgt.c and this
      one probably meant to use __skb_insert() since it appears nesqp->pau_list
      is protected by nesqp->pau_lock. This looks like nesqp->pau_lock
      could be removed, since nesqp->pau_list.lock could be used instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Faisal Latif <faisal.latif@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: linux-rdma <linux-rdma@vger.kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bffc669
    • C
      net: bridge: remove redundant checks for null p->dev and p->br · 40b1c813
      Colin Ian King 提交于
      A recent change added a null check on p->dev after p->dev was being
      dereferenced by the ns_capable check on p->dev. It turns out that
      neither the p->dev and p->br null checks are necessary, and can be
      removed, which cleans up a static analyis warning.
      
      As Nikolay Aleksandrov noted, these checks can be removed because:
      
      "My reasoning of why it shouldn't be possible:
      - On port add new_nbp() sets both p->dev and p->br before creating
        kobj/sysfs
      
      - On port del (trickier) del_nbp() calls kobject_del() before call_rcu()
        to destroy the port which in turn calls sysfs_remove_dir() which uses
        kernfs_remove() which deactivates (shouldn't be able to open new
        files) and calls kernfs_drain() to drain current open/mmaped files in
        the respective dir before continuing, thus making it impossible to
        open a bridge port sysfs file with p->dev and p->br equal to NULL.
      
      So I think it's safe to remove those checks altogether. It'd be nice to
      get a second look over my reasoning as I might be missing something in
      sysfs/kernfs call path."
      
      Thanks to Nikolay Aleksandrov's suggestion to remove the check and
      David Miller for sanity checking this.
      
      Detected by CoverityScan, CID#751490 ("Dereference before null check")
      
      Fixes: a5f3ea54 ("net: bridge: add support for raw sysfs port options")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40b1c813
  7. 25 11月, 2018 2 次提交
    • W
      net: always initialize pagedlen · aba36930
      Willem de Bruijn 提交于
      In ip packet generation, pagedlen is initialized for each skb at the
      start of the loop in __ip(6)_append_data, before label alloc_new_skb.
      
      Depending on compiler options, code can be generated that jumps to
      this label, triggering use of an an uninitialized variable.
      
      In practice, at -O2, the generated code moves the initialization below
      the label. But the code should not rely on that for correctness.
      
      Fixes: 15e36f5b ("udp: paged allocation with gso")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aba36930
    • E
      tcp: address problems caused by EDT misshaps · 9efdda4e
      Eric Dumazet 提交于
      When a qdisc setup including pacing FQ is dismantled and recreated,
      some TCP packets are sent earlier than instructed by TCP stack.
      
      TCP can be fooled when ACK comes back, because the following
      operation can return a negative value.
      
          tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr;
      
      Some paths in TCP stack were not dealing properly with this,
      this patch addresses four of them.
      
      Fixes: ab408b6d ("tcp: switch tcp and sch_fq to new earliest departure time model")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9efdda4e
  8. 24 11月, 2018 8 次提交
    • P
      rocker, dsa, ethsw: Don't filter VLAN events on bridge itself · ab4a1686
      Petr Machata 提交于
      Due to an explicit check in rocker_world_port_obj_vlan_add(),
      dsa_slave_switchdev_event() resp. port_switchdev_event(), VLAN objects
      that are added to a device that is not a front-panel port device are
      ignored. Therefore this check is immaterial.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab4a1686
    • P
      switchdev: Replace port obj add/del SDO with a notification · d17d9f5e
      Petr Machata 提交于
      Drop switchdev_ops.switchdev_port_obj_add and _del. Drop the uses of
      this field from all clients, which were migrated to use switchdev
      notification in the previous patches.
      
      Add a new function switchdev_port_obj_notify() that sends the switchdev
      notifications SWITCHDEV_PORT_OBJ_ADD and _DEL.
      
      Update switchdev_port_obj_del_now() to dispatch to this new function.
      Drop __switchdev_port_obj_add() and update switchdev_port_obj_add()
      likewise.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d17d9f5e
    • P
      switchdev: Add helpers to aid traversal through lower devices · f30f0601
      Petr Machata 提交于
      After the transition from switchdev operations to notifier chain (which
      will take place in following patches), the onus is on the driver to find
      its own devices below possible layer of LAG or other uppers.
      
      The logic to do so is fairly repetitive: each driver is looking for its
      own devices among the lowers of the notified device. For those that it
      finds, it calls a handler. To indicate that the event was handled,
      struct switchdev_notifier_port_obj_info.handled is set. The differences
      lie only in what constitutes an "own" device and what handler to call.
      
      Therefore abstract this logic into two helpers,
      switchdev_handle_port_obj_add() and switchdev_handle_port_obj_del(). If
      a driver only supports physical ports under a bridge device, it will
      simply avoid this layer of indirection.
      
      One area where this helper diverges from the current switchdev behavior
      is the case of mixed lowers, some of which are switchdev ports and some
      of which are not. Previously, such scenario would fail with -EOPNOTSUPP.
      The helper could do that for lowers for which the passed-in predicate
      doesn't hold. That would however break the case that switchdev ports
      from several different drivers are stashed under one master, a scenario
      that switchdev currently happily supports. Therefore tolerate any and
      all unknown netdevices, whether they are backed by a switchdev driver
      or not.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f30f0601
    • P
      net: dsa: slave: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL · 2b239f67
      Petr Machata 提交于
      Following patches will change the way of distributing port object
      changes from a switchdev operation to a switchdev notifier. The
      switchdev code currently recursively descends through layers of lower
      devices, eventually calling the op on a front-panel port device. The
      notifier will instead be sent referencing the bridge port device, which
      may be a stacking device that's one of front-panel ports uppers, or a
      completely unrelated device.
      
      DSA currently doesn't support any other uppers than bridge.
      SWITCHDEV_OBJ_ID_HOST_MDB and _PORT_MDB objects are always notified on
      the bridge port device. Thus the only case that a stacked device could
      be validly referenced by port object notifications are bridge
      notifications for VLAN objects added to the bridge itself. But the
      driver explicitly rejects such notifications in dsa_port_vlan_add(). It
      is therefore safe to assume that the only interesting case is that the
      notification is on a front-panel port netdevice. Therefore keep the
      filtering by dsa_slave_dev_check() in place.
      
      To handle SWITCHDEV_PORT_OBJ_ADD and _DEL, subscribe to the blocking
      notifier chain. Dispatch to rocker_port_obj_add() resp. _del() to
      maintain the behavior that the switchdev operation based code currently
      has.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b239f67
    • P
      switchdev: Add a blocking notifier chain · a93e3b17
      Petr Machata 提交于
      In general one can't assume that a switchdev notifier is called in a
      non-atomic context, and correspondingly, the switchdev notifier chain is
      an atomic one.
      
      However, port object addition and deletion messages are delivered from a
      process context. Even the MDB addition messages, whose delivery is
      scheduled from atomic context, are queued and the delivery itself takes
      place in blocking context. For VLAN messages in particular, keeping the
      blocking nature is important for error reporting.
      
      Therefore introduce a blocking notifier chain and related service
      functions to distribute the notifications for which a blocking context
      can be assumed.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a93e3b17
    • K
      net/smc: unregister rkeys of unused buffer · c7674c00
      Karsten Graul 提交于
      When an rmb is no longer in use by a connection, unregister its rkey at
      the remote peer with an LLC DELETE RKEY message. With this change,
      unused buffers held in the buffer pool are no longer registered at the
      remote peer. They are registered before the buffer is actually used and
      unregistered when they are no longer used by a connection.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7674c00
    • K
      net/smc: add infrastructure to send delete rkey messages · 60e03c62
      Karsten Graul 提交于
      Add the infrastructure to send LLC messages of type DELETE RKEY to
      unregister a shared memory region at the peer.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60e03c62
    • K
      net/smc: avoid a delay by waiting for nothing · 4600cfc3
      Karsten Graul 提交于
      When a send failed then don't start to wait for a response in
      smc_llc_do_confirm_rkey.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4600cfc3