1. 16 4月, 2019 1 次提交
    • G
      net: kernel hookers service for toa module · aff365b6
      George Zhang 提交于
      LVS fullnat will replace network traffic's source ip with its local ip,
      and thus the backend servers cannot obtain the real client ip.
      
      To solve this, LVS has introduced the tcp option address (TOA) to store
      the essential ip address information in the last tcp ack packet of the
      3-way handshake, and the backend servers need to retrieve it from the
      packet header.
      
      In this patch, we have introduced the sk_toa_data member in the sock
      structure to hold the TOA information. There used to be an in-tree
      module for TOA managing, whereas it has now been maintained as an
      standalone module.
      
      In this case, the toa module should register its hook function(s) using
      the provided interfaces in the hookers module.
      
      TOA in sock structure:
      
      	__be32 sk_toa_data[16];
      
      The hookers module only provides the sk_toa_data placeholder, and the
      toa module can use this variable through the layout it needs.
      
      Hook interfaces:
      
      The hookers module replaces the kernel's syn_recv_sock and getname
      handler with a stub that chains the toa module's hook function(s) to the
      original handling function. The hookers module allows hook functions to
      be installed and uninstalled in any order.
      
      toa module:
      
      The external toa module will be provided in separate RPM package.
      
      [xuyu@linux.alibaba.com: amend commit log]
      Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      aff365b6
  2. 06 4月, 2019 4 次提交
    • F
      netfilter: physdev: relax br_netfilter dependency · 2e6bcc32
      Florian Westphal 提交于
      [ Upstream commit 8e2f311a68494a6677c1724bdcb10bada21af37c ]
      
      Following command:
        iptables -D FORWARD -m physdev ...
      causes connectivity loss in some setups.
      
      Reason is that iptables userspace will probe kernel for the module revision
      of the physdev patch, and physdev has an artificial dependency on
      br_netfilter (xt_physdev use makes no sense unless a br_netfilter module
      is loaded).
      
      This causes the "phydev" module to be loaded, which in turn enables the
      "call-iptables" infrastructure.
      
      bridged packets might then get dropped by the iptables ruleset.
      
      The better fix would be to change the "call-iptables" defaults to 0 and
      enforce explicit setting to 1, but that breaks backwards compatibility.
      
      This does the next best thing: add a request_module call to checkentry.
      This was a stray '-D ... -m physdev' won't activate br_netfilter
      anymore.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2e6bcc32
    • C
      netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm · 3a1ce979
      Chieh-Min Wang 提交于
      [ Upstream commit 13f5251fd17088170c18844534682d9cab5ff5aa ]
      
      For bridge(br_flood) or broadcast/multicast packets, they could clone
      skb with unconfirmed conntrack which break the rule that unconfirmed
      skb->_nfct is never shared.  With nfqueue running on my system, the race
      can be easily reproduced with following warning calltrace:
      
      [13257.707525] CPU: 0 PID: 12132 Comm: main Tainted: P        W       4.4.60 #7744
      [13257.707568] Hardware name: Qualcomm (Flattened Device Tree)
      [13257.714700] [<c021f6dc>] (unwind_backtrace) from [<c021bce8>] (show_stack+0x10/0x14)
      [13257.720253] [<c021bce8>] (show_stack) from [<c0449e10>] (dump_stack+0x94/0xa8)
      [13257.728240] [<c0449e10>] (dump_stack) from [<c022a7e0>] (warn_slowpath_common+0x94/0xb0)
      [13257.735268] [<c022a7e0>] (warn_slowpath_common) from [<c022a898>] (warn_slowpath_null+0x1c/0x24)
      [13257.743519] [<c022a898>] (warn_slowpath_null) from [<c06ee450>] (__nf_conntrack_confirm+0xa8/0x618)
      [13257.752284] [<c06ee450>] (__nf_conntrack_confirm) from [<c0772670>] (ipv4_confirm+0xb8/0xfc)
      [13257.761049] [<c0772670>] (ipv4_confirm) from [<c06e7a60>] (nf_iterate+0x48/0xa8)
      [13257.769725] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0)
      [13257.777108] [<c06e7af0>] (nf_hook_slow) from [<c07f20b4>] (br_nf_post_routing+0x274/0x31c)
      [13257.784486] [<c07f20b4>] (br_nf_post_routing) from [<c06e7a60>] (nf_iterate+0x48/0xa8)
      [13257.792556] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0)
      [13257.800458] [<c06e7af0>] (nf_hook_slow) from [<c07e5580>] (br_forward_finish+0x94/0xa4)
      [13257.808010] [<c07e5580>] (br_forward_finish) from [<c07f22ac>] (br_nf_forward_finish+0x150/0x1ac)
      [13257.815736] [<c07f22ac>] (br_nf_forward_finish) from [<c06e8df0>] (nf_reinject+0x108/0x170)
      [13257.824762] [<c06e8df0>] (nf_reinject) from [<c06ea854>] (nfqnl_recv_verdict+0x3d8/0x420)
      [13257.832924] [<c06ea854>] (nfqnl_recv_verdict) from [<c06e940c>] (nfnetlink_rcv_msg+0x158/0x248)
      [13257.841256] [<c06e940c>] (nfnetlink_rcv_msg) from [<c06e5564>] (netlink_rcv_skb+0x54/0xb0)
      [13257.849762] [<c06e5564>] (netlink_rcv_skb) from [<c06e4ec8>] (netlink_unicast+0x148/0x23c)
      [13257.858093] [<c06e4ec8>] (netlink_unicast) from [<c06e5364>] (netlink_sendmsg+0x2ec/0x368)
      [13257.866348] [<c06e5364>] (netlink_sendmsg) from [<c069fb8c>] (sock_sendmsg+0x34/0x44)
      [13257.874590] [<c069fb8c>] (sock_sendmsg) from [<c06a03dc>] (___sys_sendmsg+0x1ec/0x200)
      [13257.882489] [<c06a03dc>] (___sys_sendmsg) from [<c06a11c8>] (__sys_sendmsg+0x3c/0x64)
      [13257.890300] [<c06a11c8>] (__sys_sendmsg) from [<c0209b40>] (ret_fast_syscall+0x0/0x34)
      
      The original code just triggered the warning but do nothing. It will
      caused the shared conntrack moves to the dying list and the packet be
      droppped (nf_ct_resolve_clash returns NF_DROP for dying conntrack).
      
      - Reproduce steps:
      
      +----------------------------+
      |          br0(bridge)       |
      |                            |
      +-+---------+---------+------+
        | eth0|   | eth1|   | eth2|
        |     |   |     |   |     |
        +--+--+   +--+--+   +---+-+
           |         |          |
           |         |          |
        +--+-+     +-+--+    +--+-+
        | PC1|     | PC2|    | PC3|
        +----+     +----+    +----+
      
      iptables -A FORWARD -m mark --mark 0x1000000/0x1000000 -j NFQUEUE --queue-num 100 --queue-bypass
      
      ps: Our nfq userspace program will set mark on packets whose connection
      has already been processed.
      
      PC1 sends broadcast packets simulated by hping3:
      
      hping3 --rand-source --udp 192.168.1.255 -i u100
      
      - Broadcast racing flow chart is as follow:
      
      br_handle_frame
        BR_HOOK(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, br_handle_frame_finish)
        // skb->_nfct (unconfirmed conntrack) is constructed at PRE_ROUTING stage
        br_handle_frame_finish
          // check if this packet is broadcast
          br_flood_forward
            br_flood
              list_for_each_entry_rcu(p, &br->port_list, list) // iterate through each port
                maybe_deliver
                  deliver_clone
                    skb = skb_clone(skb)
                    __br_forward
                      BR_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD,...)
                      // queue in our nfq and received by our userspace program
                      // goto __nf_conntrack_confirm with process context on CPU 1
          br_pass_frame_up
            BR_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN,...)
            // goto __nf_conntrack_confirm with softirq context on CPU 0
      
      Because conntrack confirm can happen at both INPUT and POSTROUTING
      stage.  So with NFQUEUE running, skb->_nfct with the same unconfirmed
      conntrack could race on different core.
      
      This patch fixes a repeating kernel splat, now it is only displayed
      once.
      Signed-off-by: NChieh-Min Wang <chiehminw@synology.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3a1ce979
    • F
      netfilter: conntrack: tcp: only close if RST matches exact sequence · ca66f667
      Florian Westphal 提交于
      [ Upstream commit be0502a3f2e94211a8809a09ecbc3a017189b8fb ]
      
      TCP resets cause instant transition from established to closed state
      provided the reset is in-window.  Endpoints that implement RFC 5961
      require resets to match the next expected sequence number.
      RST segments that are in-window (but that do not match RCV.NXT) are
      ignored, and a "challenge ACK" is sent back.
      
      Main problem for conntrack is that its a middlebox, i.e.  whereas an end
      host might have ACK'd SEQ (and would thus accept an RST with this
      sequence number), conntrack might not have seen this ACK (yet).
      
      Therefore we can't simply flag RSTs with non-exact match as invalid.
      
      This updates RST processing as follows:
      
      1. If the connection is in a state other than ESTABLISHED, nothing is
         changed, RST is subject to normal in-window check.
      
      2. If the RSTs sequence number either matches exactly RCV.NXT,
         connection state moves to CLOSE.
      
      3. The same applies if the RST sequence number aligns with a previous
         packet in the same direction.
      
      In all other cases, the connection remains in ESTABLISHED state.
      If the normal-in-window check passes, the timeout will be lowered
      to that of CLOSE.
      
      If the peer sends a challenge ack, connection timeout will be reset.
      
      If the challenge ACK triggers another RST (RST was valid after all),
      this 2nd RST will match expected sequence and conntrack state changes to
      CLOSE.
      
      If no challenge ACK is received, the connection will time out after
      CLOSE seconds (10 seconds by default), just like without this patch.
      
      Packetdrill test case:
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 win 64240 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      
      // Receive a segment.
      0.210 < P. 1:1001(1000) ack 1 win 46
      0.210 > . 1:1(0) ack 1001
      
      // Application writes 1000 bytes.
      0.250 write(4, ..., 1000) = 1000
      0.250 > P. 1:1001(1000) ack 1001
      
      // First reset, old sequence. Conntrack (correctly) considers this
      // invalid due to failed window validation (regardless of this patch).
      0.260 < R  2:2(0) ack 1001 win 260
      
      // 2nd reset, but too far ahead sequence.  Same: correctly handled
      // as invalid.
      0.270 < R 99990001:99990001(0) ack 1001 win 260
      
      // in-window, but not exact sequence.
      // Current Linux kernels might reply with a challenge ack, and do not
      // remove connection.
      // Without this patch, conntrack state moves to CLOSE.
      // With patch, timeout is lowered like CLOSE, but connection stays
      // in ESTABLISHED state.
      0.280 < R 1010:1010(0) ack 1001 win 260
      
      // Expect challenge ACK
      0.281 > . 1001:1001(0) ack 1001 win 501
      
      // With or without this patch, RST will cause connection
      // to move to CLOSE (sequence number matches)
      // 0.282 < R 1001:1001(0) ack 1001 win 260
      
      // ACK
      0.300 < . 1001:1001(0) ack 1001 win 257
      
      // more data could be exchanged here, connection
      // is still established
      
      // Client closes the connection.
      0.610 < F. 1001:1001(0) ack 1001 win 260
      0.650 > . 1001:1001(0) ack 1002
      
      // Close the connection without reading outstanding data
      0.700 close(4) = 0
      
      // so one more reset.  Will be deemed acceptable with patch as well:
      // connection is already closing.
      0.701 > R. 1001:1001(0) ack 1002 win 501
      // End packetdrill test case.
      
      With patch, this generates following conntrack events:
         [NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [UNREPLIED]
      [UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80
      [UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
      [UPDATE] 120 FIN_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
      [UPDATE] 60 CLOSE_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
      [UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
      
      Without patch, first RST moves connection to close, whereas socket state
      does not change until FIN is received.
         [NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [UNREPLIED]
      [UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80
      [UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED]
      [UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED]
      
      Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ca66f667
    • L
      netfilter: nf_tables: check the result of dereferencing base_chain->stats · 709aaa09
      Li RongQing 提交于
      [ Upstream commit a9f5e78c403d2d62ade4f4c85040efc85f4049b8 ]
      
      Check the result of dereferencing base_chain->stats, instead of result
      of this_cpu_ptr with NULL.
      
      base_chain->stats maybe be changed to NULL when a chain is updated and a
      new NULL counter can be attached.
      
      And we do not need to check returning of this_cpu_ptr since
      base_chain->stats is from percpu allocator if it is non-NULL,
      this_cpu_ptr returns a valid value.
      
      And fix two sparse error by replacing rcu_access_pointer and
      rcu_dereference with READ_ONCE under rcu_read_lock.
      
      Thanks for Eric's help to finish this patch.
      
      Fixes: 00924094 ("netfilter: nf_tables: don't assume chain stats are set when jumplabel is set")
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NZhang Yu <zhangyu31@baidu.com>
      Signed-off-by: NLi RongQing <lirongqing@baidu.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      709aaa09
  3. 03 4月, 2019 17 次提交
    • J
      net: sched: fix cleanup NULL pointer exception in act_mirr · a491de90
      John Hurley 提交于
      [ Upstream commit 064c5d6881e897077639e04973de26440ee205e6 ]
      
      A new mirred action is created by the tcf_mirred_init function. This
      contains a list head struct which is inserted into a global list on
      successful creation of a new action. However, after a creation, it is
      still possible to error out and call the tcf_idr_release function. This,
      in turn, calls the act_mirr cleanup function via __tcf_idr_release and
      __tcf_action_put. This cleanup function tries to delete the list entry
      which is as yet uninitialised, leading to a NULL pointer exception.
      
      Fix this by initialising the list entry on creation of a new action.
      
      Bug report:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      PGD 8000000840c73067 P4D 8000000840c73067 PUD 858dcc067 PMD 0
      Oops: 0002 [#1] SMP PTI
      CPU: 32 PID: 5636 Comm: handler194 Tainted: G           OE     5.0.0+ #186
      Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.3.6 06/03/2015
      RIP: 0010:tcf_mirred_release+0x42/0xa7 [act_mirred]
      Code: f0 90 39 c0 e8 52 04 57 c8 48 c7 c7 b8 80 39 c0 e8 94 fa d4 c7 48 8b 93 d0 00 00 00 48 8b 83 d8 00 00 00 48 c7 c7 f0 90 39 c0 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 83 d0 00
      RSP: 0018:ffffac4aa059f688 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffff9dcd1b214d00 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff9dcd1fa165f8 RDI: ffffffffc03990f0
      RBP: ffff9dccf9c7af80 R08: 0000000000000a3b R09: 0000000000000000
      R10: ffff9dccfa11f420 R11: 0000000000000000 R12: 0000000000000001
      R13: ffff9dcd16b433c0 R14: ffff9dcd1b214d80 R15: 0000000000000000
      FS:  00007f441bfff700(0000) GS:ffff9dcd1fa00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000008 CR3: 0000000839e64004 CR4: 00000000001606e0
      Call Trace:
      tcf_action_cleanup+0x59/0xca
      __tcf_action_put+0x54/0x6b
      __tcf_idr_release.cold.33+0x9/0x12
      tcf_mirred_init.cold.20+0x22e/0x3b0 [act_mirred]
      tcf_action_init_1+0x3d0/0x4c0
      tcf_action_init+0x9c/0x130
      tcf_exts_validate+0xab/0xc0
      fl_change+0x1ca/0x982 [cls_flower]
      tc_new_tfilter+0x647/0x8d0
      ? load_balance+0x14b/0x9e0
      rtnetlink_rcv_msg+0xe3/0x370
      ? __switch_to_asm+0x40/0x70
      ? __switch_to_asm+0x34/0x70
      ? _cond_resched+0x15/0x30
      ? __kmalloc_node_track_caller+0x1d4/0x2b0
      ? rtnl_calcit.isra.31+0xf0/0xf0
      netlink_rcv_skb+0x49/0x110
      netlink_unicast+0x16f/0x210
      netlink_sendmsg+0x1df/0x390
      sock_sendmsg+0x36/0x40
      ___sys_sendmsg+0x27b/0x2c0
      ? futex_wake+0x80/0x140
      ? do_futex+0x2b9/0xac0
      ? ep_scan_ready_list.constprop.22+0x1f2/0x210
      ? ep_poll+0x7a/0x430
      __sys_sendmsg+0x47/0x80
      do_syscall_64+0x55/0x100
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 4e232818 ("net: sched: act_mirred: remove dependency on rtnl lock")
      Signed-off-by: NJohn Hurley <john.hurley@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a491de90
    • H
      ila: Fix rhashtable walker list corruption · 7254ad09
      Herbert Xu 提交于
      [ Upstream commit b5f9bd15b88563b55a99ed588416881367a0ce5f ]
      
      ila_xlat_nl_cmd_flush uses rhashtable walkers allocated from the
      stack but it never frees them.  This corrupts the walker list of
      the hash table.
      
      This patch fixes it.
      
      Reported-by: syzbot+dae72a112334aa65a159@syzkaller.appspotmail.com
      Fixes: b6e71bde ("ila: Flush netlink command to clear xlat...")
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7254ad09
    • E
      tipc: fix cancellation of topology subscriptions · 52a7505c
      Erik Hugne 提交于
      [ Upstream commit 33872d79f5d1cbedaaab79669cc38f16097a9450 ]
      
      When cancelling a subscription, we have to clear the cancel bit in the
      request before iterating over any established subscriptions with memcmp.
      Otherwise no subscription will ever be found, and it will not be
      possible to explicitly unsubscribe individual subscriptions.
      
      Fixes: 8985ecc7 ("tipc: simplify endianness handling in topology subscriber")
      Signed-off-by: NErik Hugne <erik.hugne@gmail.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      52a7505c
    • X
      tipc: change to check tipc_own_id to return in tipc_net_stop · 1be6c0c7
      Xin Long 提交于
      [ Upstream commit 9926cb5f8b0f0aea535735185600d74db7608550 ]
      
      When running a syz script, a panic occurred:
      
      [  156.088228] BUG: KASAN: use-after-free in tipc_disc_timeout+0x9c9/0xb20 [tipc]
      [  156.094315] Call Trace:
      [  156.094844]  <IRQ>
      [  156.095306]  dump_stack+0x7c/0xc0
      [  156.097346]  print_address_description+0x65/0x22e
      [  156.100445]  kasan_report.cold.3+0x37/0x7a
      [  156.102402]  tipc_disc_timeout+0x9c9/0xb20 [tipc]
      [  156.106517]  call_timer_fn+0x19a/0x610
      [  156.112749]  run_timer_softirq+0xb51/0x1090
      
      It was caused by the netns freed without deleting the discoverer timer,
      while later on the netns would be accessed in the timer handler.
      
      The timer should have been deleted by tipc_net_stop() when cleaning up a
      netns. However, tipc has been able to enable a bearer and start d->timer
      without the local node_addr set since Commit 52dfae5c ("tipc: obtain
      node identity from interface by default"), which caused the timer not to
      be deleted in tipc_net_stop() then.
      
      So fix it in tipc_net_stop() by changing to check local node_id instead
      of local node_addr, as Jon suggested.
      
      While at it, remove the calling of tipc_nametbl_withdraw() there, since
      tipc_nametbl_stop() will take of the nametbl's freeing after.
      
      Fixes: 52dfae5c ("tipc: obtain node identity from interface by default")
      Reported-by: syzbot+a25307ad099309f1c2b9@syzkaller.appspotmail.com
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1be6c0c7
    • E
      tipc: allow service ranges to be connect()'ed on RDM/DGRAM · 24d1a625
      Erik Hugne 提交于
      [ Upstream commit ea239314fe42ace880bdd834256834679346c80e ]
      
      We move the check that prevents connecting service ranges to after
      the RDM/DGRAM check, and move address sanity control to a separate
      function that also validates the service range.
      
      Fixes: 23998835 ("tipc: improve address sanity check in tipc_connect()")
      Signed-off-by: NErik Hugne <erik.hugne@gmail.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      24d1a625
    • E
      tcp: do not use ipv6 header for ipv4 flow · 7115df61
      Eric Dumazet 提交于
      [ Upstream commit 89e4130939a20304f4059ab72179da81f5347528 ]
      
      When a dual stack tcp listener accepts an ipv4 flow,
      it should not attempt to use an ipv6 header or tcp_v6_iif() helper.
      
      Fixes: 1397ed35 ("ipv6: add flowinfo for tcp6 pkt_options for all cases")
      Fixes: df3687ff ("ipv6: add the IPV6_FL_F_REFLECT flag to IPV6_FL_A_GET")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7115df61
    • X
      sctp: use memdup_user instead of vmemdup_user · cab576f1
      Xin Long 提交于
      [ Upstream commit ef82bcfa671b9a635bab5fa669005663d8b177c5 ]
      
      In sctp_setsockopt_bindx()/__sctp_setsockopt_connectx(), it allocates
      memory with addrs_size which is passed from userspace. We used flag
      GFP_USER to put some more restrictions on it in Commit cacc0621
      ("sctp: use GFP_USER for user-controlled kmalloc").
      
      However, since Commit c981f254 ("sctp: use vmemdup_user() rather
      than badly open-coding memdup_user()"), vmemdup_user() has been used,
      which doesn't check GFP_USER flag when goes to vmalloc_*(). So when
      addrs_size is a huge value, it could exhaust memory and even trigger
      oom killer.
      
      This patch is to use memdup_user() instead, in which GFP_USER would
      work to limit the memory allocation with a huge addrs_size.
      
      Note we can't fix it by limiting 'addrs_size', as there's no demand
      for it from RFC.
      
      Reported-by: syzbot+ec1b7575afef85a0e5ca@syzkaller.appspotmail.com
      Fixes: c981f254 ("sctp: use vmemdup_user() rather than badly open-coding memdup_user()")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cab576f1
    • M
      packets: Always register packet sk in the same order · 69cea7cf
      Maxime Chevallier 提交于
      [ Upstream commit a4dc6a49156b1f8d6e17251ffda17c9e6a5db78a ]
      
      When using fanouts with AF_PACKET, the demux functions such as
      fanout_demux_cpu will return an index in the fanout socket array, which
      corresponds to the selected socket.
      
      The ordering of this array depends on the order the sockets were added
      to a given fanout group, so for FANOUT_CPU this means sockets are bound
      to cpus in the order they are configured, which is OK.
      
      However, when stopping then restarting the interface these sockets are
      bound to, the sockets are reassigned to the fanout group in the reverse
      order, due to the fact that they were inserted at the head of the
      interface's AF_PACKET socket list.
      
      This means that traffic that was directed to the first socket in the
      fanout group is now directed to the last one after an interface restart.
      
      In the case of FANOUT_CPU, traffic from CPU0 will be directed to the
      socket that used to receive traffic from the last CPU after an interface
      restart.
      
      This commit introduces a helper to add a socket at the tail of a list,
      then uses it to register AF_PACKET sockets.
      
      Note that this changes the order in which sockets are listed in /proc and
      with sock_diag.
      
      Fixes: dc99f600 ("packet: Add fanout support")
      Signed-off-by: NMaxime Chevallier <maxime.chevallier@bootlin.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69cea7cf
    • Y
      net-sysfs: call dev_hold if kobject_init_and_add success · d9d215be
      YueHaibing 提交于
      [ Upstream commit a3e23f719f5c4a38ffb3d30c8d7632a4ed8ccd9e ]
      
      In netdev_queue_add_kobject and rx_queue_add_kobject,
      if sysfs_create_group failed, kobject_put will call
      netdev_queue_release to decrease dev refcont, however
      dev_hold has not be called. So we will see this while
      unregistering dev:
      
      unregister_netdevice: waiting for bcsh0 to become free. Usage count = -1
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Fixes: d0d66837 ("net: don't decrement kobj reference count on init failure")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9d215be
    • E
      net: rose: fix a possible stack overflow · 7eeb12ed
      Eric Dumazet 提交于
      [ Upstream commit e5dcc0c3223c45c94100f05f28d8ef814db3d82c ]
      
      rose_write_internal() uses a temp buffer of 100 bytes, but a manual
      inspection showed that given arbitrary input, rose_create_facilities()
      can fill up to 110 bytes.
      
      Lets use a tailroom of 256 bytes for peace of mind, and remove
      the bounce buffer : we can simply allocate a big enough skb
      and adjust its length as needed.
      
      syzbot report :
      
      BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:352 [inline]
      BUG: KASAN: stack-out-of-bounds in rose_create_facilities net/rose/rose_subr.c:521 [inline]
      BUG: KASAN: stack-out-of-bounds in rose_write_internal+0x597/0x15d0 net/rose/rose_subr.c:116
      Write of size 7 at addr ffff88808b1ffbef by task syz-executor.0/24854
      
      CPU: 0 PID: 24854 Comm: syz-executor.0 Not tainted 5.0.0+ #97
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
       kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       check_memory_region_inline mm/kasan/generic.c:185 [inline]
       check_memory_region+0x123/0x190 mm/kasan/generic.c:191
       memcpy+0x38/0x50 mm/kasan/common.c:131
       memcpy include/linux/string.h:352 [inline]
       rose_create_facilities net/rose/rose_subr.c:521 [inline]
       rose_write_internal+0x597/0x15d0 net/rose/rose_subr.c:116
       rose_connect+0x7cb/0x1510 net/rose/af_rose.c:826
       __sys_connect+0x266/0x330 net/socket.c:1685
       __do_sys_connect net/socket.c:1696 [inline]
       __se_sys_connect net/socket.c:1693 [inline]
       __x64_sys_connect+0x73/0xb0 net/socket.c:1693
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x458079
      Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f47b8d9dc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458079
      RDX: 000000000000001c RSI: 0000000020000040 RDI: 0000000000000004
      RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f47b8d9e6d4
      R13: 00000000004be4a4 R14: 00000000004ceca8 R15: 00000000ffffffff
      
      The buggy address belongs to the page:
      page:ffffea00022c7fc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
      flags: 0x1fffc0000000000()
      raw: 01fffc0000000000 0000000000000000 ffffffff022c0101 0000000000000000
      raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88808b1ffa80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffff88808b1ffb00: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 00 03
      >ffff88808b1ffb80: f2 f2 00 00 00 00 00 00 00 00 00 00 00 00 04 f3
                                                                   ^
       ffff88808b1ffc00: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
       ffff88808b1ffc80: 00 00 00 00 00 00 00 f1 f1 f1 f1 f1 f1 01 f2 01
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7eeb12ed
    • C
      net/packet: Set __GFP_NOWARN upon allocation in alloc_pg_vec · 85ef72d8
      Christoph Paasch 提交于
      [ Upstream commit 398f0132c14754fcd03c1c4f8e7176d001ce8ea1 ]
      
      Since commit fc62814d690c ("net/packet: fix 4gb buffer limit due to overflow check")
      one can now allocate packet ring buffers >= UINT_MAX. However, syzkaller
      found that that triggers a warning:
      
      [   21.100000] WARNING: CPU: 2 PID: 2075 at mm/page_alloc.c:4584 __alloc_pages_nod0
      [   21.101490] Modules linked in:
      [   21.101921] CPU: 2 PID: 2075 Comm: syz-executor.0 Not tainted 5.0.0 #146
      [   21.102784] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
      [   21.103887] RIP: 0010:__alloc_pages_nodemask+0x2a0/0x630
      [   21.104640] Code: fe ff ff 65 48 8b 04 25 c0 de 01 00 48 05 90 0f 00 00 41 bd 01 00 00 00 48 89 44 24 48 e9 9c fe 3
      [   21.107121] RSP: 0018:ffff88805e1cf920 EFLAGS: 00010246
      [   21.107819] RAX: 0000000000000000 RBX: ffffffff85a488a0 RCX: 0000000000000000
      [   21.108753] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000000
      [   21.109699] RBP: 1ffff1100bc39f28 R08: ffffed100bcefb67 R09: ffffed100bcefb67
      [   21.110646] R10: 0000000000000001 R11: ffffed100bcefb66 R12: 000000000000000d
      [   21.111623] R13: 0000000000000000 R14: ffff88805e77d888 R15: 000000000000000d
      [   21.112552] FS:  00007f7c7de05700(0000) GS:ffff88806d100000(0000) knlGS:0000000000000000
      [   21.113612] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   21.114405] CR2: 000000000065c000 CR3: 000000005e58e006 CR4: 00000000001606e0
      [   21.115367] Call Trace:
      [   21.115705]  ? __alloc_pages_slowpath+0x21c0/0x21c0
      [   21.116362]  alloc_pages_current+0xac/0x1e0
      [   21.116923]  kmalloc_order+0x18/0x70
      [   21.117393]  kmalloc_order_trace+0x18/0x110
      [   21.117949]  packet_set_ring+0x9d5/0x1770
      [   21.118524]  ? packet_rcv_spkt+0x440/0x440
      [   21.119094]  ? lock_downgrade+0x620/0x620
      [   21.119646]  ? __might_fault+0x177/0x1b0
      [   21.120177]  packet_setsockopt+0x981/0x2940
      [   21.120753]  ? __fget+0x2fb/0x4b0
      [   21.121209]  ? packet_release+0xab0/0xab0
      [   21.121740]  ? sock_has_perm+0x1cd/0x260
      [   21.122297]  ? selinux_secmark_relabel_packet+0xd0/0xd0
      [   21.123013]  ? __fget+0x324/0x4b0
      [   21.123451]  ? selinux_netlbl_socket_setsockopt+0x101/0x320
      [   21.124186]  ? selinux_netlbl_sock_rcv_skb+0x3a0/0x3a0
      [   21.124908]  ? __lock_acquire+0x529/0x3200
      [   21.125453]  ? selinux_socket_setsockopt+0x5d/0x70
      [   21.126075]  ? __sys_setsockopt+0x131/0x210
      [   21.126533]  ? packet_release+0xab0/0xab0
      [   21.127004]  __sys_setsockopt+0x131/0x210
      [   21.127449]  ? kernel_accept+0x2f0/0x2f0
      [   21.127911]  ? ret_from_fork+0x8/0x50
      [   21.128313]  ? do_raw_spin_lock+0x11b/0x280
      [   21.128800]  __x64_sys_setsockopt+0xba/0x150
      [   21.129271]  ? lockdep_hardirqs_on+0x37f/0x560
      [   21.129769]  do_syscall_64+0x9f/0x450
      [   21.130182]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      We should allocate with __GFP_NOWARN to handle this.
      
      Cc: Kal Conley <kal.conley@dectris.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Fixes: fc62814d690c ("net/packet: fix 4gb buffer limit due to overflow check")
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      85ef72d8
    • P
      net: datagram: fix unbounded loop in __skb_try_recv_datagram() · 88c64f9c
      Paolo Abeni 提交于
      [ Upstream commit 0b91bce1ebfc797ff3de60c8f4a1e6219a8a3187 ]
      
      Christoph reported a stall while peeking datagram with an offset when
      busy polling is enabled. __skb_try_recv_datagram() uses as the loop
      termination condition 'queue empty'. When peeking, the socket
      queue can be not empty, even when no additional packets are received.
      
      Address the issue explicitly checking for receive queue changes,
      as currently done by __skb_wait_for_more_packets().
      
      Fixes: 2b5cd0df ("net: Change return type of sk_busy_loop from bool to void")
      Reported-and-tested-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      88c64f9c
    • X
      ipv6: make ip6_create_rt_rcu return ip6_null_entry instead of NULL · be092113
      Xin Long 提交于
      [ Upstream commit 1c87e79a002f6a159396138cd3f3ab554a2a8887 ]
      
      Jianlin reported a crash:
      
        [  381.484332] BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
        [  381.619802] RIP: 0010:fib6_rule_lookup+0xa3/0x160
        [  382.009615] Call Trace:
        [  382.020762]  <IRQ>
        [  382.030174]  ip6_route_redirect.isra.52+0xc9/0xf0
        [  382.050984]  ip6_redirect+0xb6/0xf0
        [  382.066731]  icmpv6_notify+0xca/0x190
        [  382.083185]  ndisc_redirect_rcv+0x10f/0x160
        [  382.102569]  ndisc_rcv+0xfb/0x100
        [  382.117725]  icmpv6_rcv+0x3f2/0x520
        [  382.133637]  ip6_input_finish+0xbf/0x460
        [  382.151634]  ip6_input+0x3b/0xb0
        [  382.166097]  ipv6_rcv+0x378/0x4e0
      
      It was caused by the lookup function __ip6_route_redirect() returns NULL in
      fib6_rule_lookup() when ip6_create_rt_rcu() returns NULL.
      
      So we fix it by simply making ip6_create_rt_rcu() return ip6_null_entry
      instead of NULL.
      
      v1->v2:
        - move down 'fallback:' to make it more readable.
      
      Fixes: e873e4b9 ("ipv6: use fib6_info_hold_safe() when necessary")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Suggested-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Acked-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      be092113
    • Y
      genetlink: Fix a memory leak on error path · 9b8ef421
      YueHaibing 提交于
      [ Upstream commit ceabee6c59943bdd5e1da1a6a20dc7ee5f8113a2 ]
      
      In genl_register_family(), when idr_alloc() fails,
      we forget to free the memory we possibly allocate for
      family->attrbuf.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Fixes: 2ae0f17d ("genetlink: use idr to track families")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9b8ef421
    • E
      dccp: do not use ipv6 header for ipv4 flow · 321461f2
      Eric Dumazet 提交于
      [ Upstream commit e0aa67709f89d08c8d8e5bdd9e0b649df61d0090 ]
      
      When a dual stack dccp listener accepts an ipv4 flow,
      it should not attempt to use an ipv6 header or
      inet6_iif() helper.
      
      Fixes: 3df80d93 ("[DCCP]: Introduce DCCPv6")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      321461f2
    • M
      Bluetooth: Verify that l2cap_get_conf_opt provides large enough buffer · 15d6538a
      Marcel Holtmann 提交于
      commit 7c9cbd0b5e38a1672fcd137894ace3b042dfbf69 upstream.
      
      The function l2cap_get_conf_opt will return L2CAP_CONF_OPT_SIZE + opt->len
      as length value. The opt->len however is in control over the remote user
      and can be used by an attacker to gain access beyond the bounds of the
      actual packet.
      
      To prevent any potential leak of heap memory, it is enough to check that
      the resulting len calculation after calling l2cap_get_conf_opt is not
      below zero. A well formed packet will always return >= 0 here and will
      end with the length value being zero after the last option has been
      parsed. In case of malformed packets messing with the opt->len field the
      length value will become negative. If that is the case, then just abort
      and ignore the option.
      
      In case an attacker uses a too short opt->len value, then garbage will
      be parsed, but that is protected by the unknown option handling and also
      the option parameter size checks.
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      15d6538a
    • M
      Bluetooth: Check L2CAP option sizes returned from l2cap_get_conf_opt · 2318c0e4
      Marcel Holtmann 提交于
      commit af3d5d1c87664a4f150fcf3534c6567cb19909b0 upstream.
      
      When doing option parsing for standard type values of 1, 2 or 4 octets,
      the value is converted directly into a variable instead of a pointer. To
      avoid being tricked into being a pointer, check that for these option
      types that sizes actually match. In L2CAP every option is fixed size and
      thus it is prudent anyway to ensure that the remote side sends us the
      right option size along with option paramters.
      
      If the option size is not matching the option type, then that option is
      silently ignored. It is a protocol violation and instead of trying to
      give the remote attacker any further hints just pretend that option is
      not present and proceed with the default values. Implementation
      following the specification and its qualification procedures will always
      use the correct size and thus not being impacted here.
      
      To keep the code readable and consistent accross all options, a few
      cosmetic changes were also required.
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2318c0e4
  4. 27 3月, 2019 3 次提交
  5. 24 3月, 2019 14 次提交
    • J
      svcrpc: fix UDP on servers with lots of threads · 43bceddc
      J. Bruce Fields 提交于
      commit b7e5034cbecf5a65b7bfdc2b20a8378039577706 upstream.
      
      James Pearson found that an NFS server stopped responding to UDP
      requests if started with more than 1017 threads.
      
      sv_max_mesg is about 2^20, so that is probably where the calculation
      performed by
      
      	svc_sock_setbufsize(svsk->sk_sock,
                                  (serv->sv_nrthreads+3) * serv->sv_max_mesg,
                                  (serv->sv_nrthreads+3) * serv->sv_max_mesg);
      
      starts to overflow an int.
      Reported-by: NJames Pearson <jcpearson@gmail.com>
      Tested-by: NJames Pearson <jcpearson@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43bceddc
    • W
      bpf: only test gso type on gso packets · 9920eb40
      Willem de Bruijn 提交于
      [ Upstream commit 4c3024debf62de4c6ac6d3cb4c0063be21d4f652 ]
      
      BPF can adjust gso only for tcp bytestreams. Fail on other gso types.
      
      But only on gso packets. It does not touch this field if !gso_size.
      
      Fixes: b90efd225874 ("bpf: only adjust gso_size on bytestream protocols")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9920eb40
    • A
      netfilter: ipt_CLUSTERIP: fix warning unused variable cn · 0d98ecb1
      Anders Roxell 提交于
      commit 206b8cc514d7ff2b79dd2d5ad939adc7c493f07a upstream.
      
      When CONFIG_PROC_FS isn't set the variable cn isn't used.
      
      net/ipv4/netfilter/ipt_CLUSTERIP.c: In function ‘clusterip_net_exit’:
      net/ipv4/netfilter/ipt_CLUSTERIP.c:849:24: warning: unused variable ‘cn’ [-Wunused-variable]
        struct clusterip_net *cn = clusterip_pernet(net);
                              ^~
      
      Rework so the variable 'cn' is declared inside "#ifdef CONFIG_PROC_FS".
      
      Fixes: b12f7bad5ad3 ("netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine")
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0d98ecb1
    • A
      phonet: fix building with clang · ee01ac61
      Arnd Bergmann 提交于
      [ Upstream commit 6321aa197547da397753757bd84c6ce64b3e3d89 ]
      
      clang warns about overflowing the data[] member in the struct pnpipehdr:
      
      net/phonet/pep.c:295:8: warning: array index 4 is past the end of the array (which contains 1 element) [-Warray-bounds]
                              if (hdr->data[4] == PEP_IND_READY)
                                  ^         ~
      include/net/phonet/pep.h:66:3: note: array 'data' declared here
                      u8              data[1];
      
      Using a flexible array member at the end of the struct avoids the
      warning, but since we cannot have a flexible array member inside
      of the union, each index now has to be moved back by one, which
      makes it a little uglier.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NRémi Denis-Courmont <remi@remlab.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ee01ac61
    • T
      xfrm: Fix inbound traffic via XFRM interfaces across network namespaces · 6ac400b7
      Tobias Brunner 提交于
      [ Upstream commit 660899ddf06ae8bb5bbbd0a19418b739375430c5 ]
      
      After moving an XFRM interface to another namespace it stays associated
      with the original namespace (net in `struct xfrm_if` and the list keyed
      with `xfrmi_net_id`), allowing processes in the new namespace to use
      SAs/policies that were created in the original namespace.  For instance,
      this allows a keying daemon in one namespace to establish IPsec SAs for
      other namespaces without processes there having access to the keys or IKE
      credentials.
      
      This worked fine for outbound traffic, however, for inbound traffic the
      lookup for the interfaces and the policies used the incorrect namespace
      (the one the XFRM interface was moved to).
      
      Fixes: f203b76d ("xfrm: Add virtual xfrm interfaces")
      Signed-off-by: NTobias Brunner <tobias@strongswan.org>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6ac400b7
    • S
      af_key: unconditionally clone on broadcast · 978e0388
      Sean Tranchetti 提交于
      [ Upstream commit fc2d5cfdcfe2ab76b263d91429caa22451123085 ]
      
      Attempting to avoid cloning the skb when broadcasting by inflating
      the refcount with sock_hold/sock_put while under RCU lock is dangerous
      and violates RCU principles. It leads to subtle race conditions when
      attempting to free the SKB, as we may reference sockets that have
      already been freed by the stack.
      
      Unable to handle kernel paging request at virtual address 6b6b6b6b6b6c4b
      [006b6b6b6b6b6c4b] address between user and kernel address ranges
      Internal error: Oops: 96000004 [#1] PREEMPT SMP
      task: fffffff78f65b380 task.stack: ffffff8049a88000
      pc : sock_rfree+0x38/0x6c
      lr : skb_release_head_state+0x6c/0xcc
      Process repro (pid: 7117, stack limit = 0xffffff8049a88000)
      Call trace:
      	sock_rfree+0x38/0x6c
      	skb_release_head_state+0x6c/0xcc
      	skb_release_all+0x1c/0x38
      	__kfree_skb+0x1c/0x30
      	kfree_skb+0xd0/0xf4
      	pfkey_broadcast+0x14c/0x18c
      	pfkey_sendmsg+0x1d8/0x408
      	sock_sendmsg+0x44/0x60
      	___sys_sendmsg+0x1d0/0x2a8
      	__sys_sendmsg+0x64/0xb4
      	SyS_sendmsg+0x34/0x4c
      	el0_svc_naked+0x34/0x38
      Kernel panic - not syncing: Fatal exception
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NSean Tranchetti <stranche@codeaurora.org>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      978e0388
    • W
      bpf: only adjust gso_size on bytestream protocols · 413e3985
      Willem de Bruijn 提交于
      [ Upstream commit b90efd2258749e04e1b3f71ef0d716f2ac2337e0 ]
      
      bpf_skb_change_proto and bpf_skb_adjust_room change skb header length.
      For GSO packets they adjust gso_size to maintain the same MTU.
      
      The gso size can only be safely adjusted on bytestream protocols.
      Commit d02f51cb ("bpf: fix bpf_skb_adjust_net/bpf_skb_proto_xlat
      to deal with gso sctp skbs") excluded SKB_GSO_SCTP.
      
      Since then type SKB_GSO_UDP_L4 has been added, whose contents are one
      gso_size unit per datagram. Also exclude these.
      
      Move from a blacklist to a whitelist check to future proof against
      additional such new GSO types, e.g., for fraglist based GRO.
      
      Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      413e3985
    • M
      esp: Skip TX bytes accounting when sending from a request socket · b92eaed3
      Martin Willi 提交于
      [ Upstream commit 09db51241118aeb06e1c8cd393b45879ce099b36 ]
      
      On ESP output, sk_wmem_alloc is incremented for the added padding if a
      socket is associated to the skb. When replying with TCP SYNACKs over
      IPsec, the associated sk is a casted request socket, only. Increasing
      sk_wmem_alloc on a request socket results in a write at an arbitrary
      struct offset. In the best case, this produces the following WARNING:
      
      WARNING: CPU: 1 PID: 0 at lib/refcount.c:102 esp_output_head+0x2e4/0x308 [esp4]
      refcount_t: addition on 0; use-after-free.
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-rc3 #2
      Hardware name: Marvell Armada 380/385 (Device Tree)
      [...]
      [<bf0ff354>] (esp_output_head [esp4]) from [<bf1006a4>] (esp_output+0xb8/0x180 [esp4])
      [<bf1006a4>] (esp_output [esp4]) from [<c05dee64>] (xfrm_output_resume+0x558/0x664)
      [<c05dee64>] (xfrm_output_resume) from [<c05d07b0>] (xfrm4_output+0x44/0xc4)
      [<c05d07b0>] (xfrm4_output) from [<c05956bc>] (tcp_v4_send_synack+0xa8/0xe8)
      [<c05956bc>] (tcp_v4_send_synack) from [<c0586ad8>] (tcp_conn_request+0x7f4/0x948)
      [<c0586ad8>] (tcp_conn_request) from [<c058c404>] (tcp_rcv_state_process+0x2a0/0xe64)
      [<c058c404>] (tcp_rcv_state_process) from [<c05958ac>] (tcp_v4_do_rcv+0xf0/0x1f4)
      [<c05958ac>] (tcp_v4_do_rcv) from [<c0598a4c>] (tcp_v4_rcv+0xdb8/0xe20)
      [<c0598a4c>] (tcp_v4_rcv) from [<c056eb74>] (ip_protocol_deliver_rcu+0x2c/0x2dc)
      [<c056eb74>] (ip_protocol_deliver_rcu) from [<c056ee6c>] (ip_local_deliver_finish+0x48/0x54)
      [<c056ee6c>] (ip_local_deliver_finish) from [<c056eecc>] (ip_local_deliver+0x54/0xec)
      [<c056eecc>] (ip_local_deliver) from [<c056efac>] (ip_rcv+0x48/0xb8)
      [<c056efac>] (ip_rcv) from [<c0519c2c>] (__netif_receive_skb_one_core+0x50/0x6c)
      [...]
      
      The issue triggers only when not using TCP syncookies, as for syncookies
      no socket is associated.
      
      Fixes: cac2661c ("esp4: Avoid skb_cow_data whenever possible")
      Fixes: 03e2a30f ("esp6: Avoid skb_cow_data whenever possible")
      Signed-off-by: NMartin Willi <martin@strongswan.org>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b92eaed3
    • N
      xprtrdma: Make sure Send CQ is allocated on an existing compvec · d84bc704
      Nicolas Morey-Chaisemartin 提交于
      [ Upstream commit a4cb5bdb754afe21f3e9e7164213e8600cf69427 ]
      
      Make sure the device has at least 2 completion vectors
      before allocating to compvec#1
      
      Fixes: a4699f56 (xprtrdma: Put Send CQ in IB_POLL_WORKQUEUE mode)
      Signed-off-by: NNicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>
      Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d84bc704
    • A
      ipvs: fix dependency on nf_defrag_ipv6 · 5ca2ef67
      Andrea Claudi 提交于
      [ Upstream commit 098e13f5b21d3398065fce8780f07a3ef62f4812 ]
      
      ipvs relies on nf_defrag_ipv6 module to manage IPv6 fragmentation,
      but lacks proper Kconfig dependencies and does not explicitly
      request defrag features.
      
      As a result, if netfilter hooks are not loaded, when IPv6 fragmented
      packet are handled by ipvs only the first fragment makes through.
      
      Fix it properly declaring the dependency on Kconfig and registering
      netfilter hooks on ip_vs_add_service() and ip_vs_new_dest().
      Reported-by: NLi Shuang <shuali@redhat.com>
      Signed-off-by: NAndrea Claudi <aclaudi@redhat.com>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5ca2ef67
    • F
      netfilter: compat: initialize all fields in xt_init · e0e6b0d7
      Francesco Ruggeri 提交于
      [ Upstream commit 8d29d16d21342a0c86405d46de0c4ac5daf1760f ]
      
      If a non zero value happens to be in xt[NFPROTO_BRIDGE].cur at init
      time, the following panic can be caused by running
      
      % ebtables -t broute -F BROUTING
      
      from a 32-bit user level on a 64-bit kernel. This patch replaces
      kmalloc_array with kcalloc when allocating xt.
      
      [  474.680846] BUG: unable to handle kernel paging request at 0000000009600920
      [  474.687869] PGD 2037006067 P4D 2037006067 PUD 2038938067 PMD 0
      [  474.693838] Oops: 0000 [#1] SMP
      [  474.697055] CPU: 9 PID: 4662 Comm: ebtables Kdump: loaded Not tainted 4.19.17-11302235.AroraKernelnext.fc18.x86_64 #1
      [  474.707721] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.0 06/28/2013
      [  474.714313] RIP: 0010:xt_compat_calc_jump+0x2f/0x63 [x_tables]
      [  474.720201] Code: 40 0f b6 ff 55 31 c0 48 6b ff 70 48 03 3d dc 45 00 00 48 89 e5 8b 4f 6c 4c 8b 47 60 ff c9 39 c8 7f 2f 8d 14 08 d1 fa 48 63 fa <41> 39 34 f8 4c 8d 0c fd 00 00 00 00 73 05 8d 42 01 eb e1 76 05 8d
      [  474.739023] RSP: 0018:ffffc9000943fc58 EFLAGS: 00010207
      [  474.744296] RAX: 0000000000000000 RBX: ffffc90006465000 RCX: 0000000002580249
      [  474.751485] RDX: 00000000012c0124 RSI: fffffffff7be17e9 RDI: 00000000012c0124
      [  474.758670] RBP: ffffc9000943fc58 R08: 0000000000000000 R09: ffffffff8117cf8f
      [  474.765855] R10: ffffc90006477000 R11: 0000000000000000 R12: 0000000000000001
      [  474.773048] R13: 0000000000000000 R14: ffffc9000943fcb8 R15: ffffc9000943fcb8
      [  474.780234] FS:  0000000000000000(0000) GS:ffff88a03f840000(0063) knlGS:00000000f7ac7700
      [  474.788612] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
      [  474.794632] CR2: 0000000009600920 CR3: 0000002037422006 CR4: 00000000000606e0
      [  474.802052] Call Trace:
      [  474.804789]  compat_do_replace+0x1fb/0x2a3 [ebtables]
      [  474.810105]  compat_do_ebt_set_ctl+0x69/0xe6 [ebtables]
      [  474.815605]  ? try_module_get+0x37/0x42
      [  474.819716]  compat_nf_setsockopt+0x4f/0x6d
      [  474.824172]  compat_ip_setsockopt+0x7e/0x8c
      [  474.828641]  compat_raw_setsockopt+0x16/0x3a
      [  474.833220]  compat_sock_common_setsockopt+0x1d/0x24
      [  474.838458]  __compat_sys_setsockopt+0x17e/0x1b1
      [  474.843343]  ? __check_object_size+0x76/0x19a
      [  474.847960]  __ia32_compat_sys_socketcall+0x1cb/0x25b
      [  474.853276]  do_fast_syscall_32+0xaf/0xf6
      [  474.857548]  entry_SYSENTER_compat+0x6b/0x7a
      Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e0e6b0d7
    • I
      mac80211: Fix Tx aggregation session tear down with ITXQs · a5a24445
      Ilan Peer 提交于
      [ Upstream commit 6157ca0d6bfe437691b1e98a62e2efe12b6714da ]
      
      When mac80211 requests the low level driver to stop an ongoing
      Tx aggregation, the low level driver is expected to call
      ieee80211_stop_tx_ba_cb_irqsafe() to indicate that it is ready
      to stop the session. The callback in turn schedules a worker
      to complete the session tear down, which in turn also handles
      the relevant state for the intermediate Tx queue.
      
      However, as this flow in asynchronous, the intermediate queue
      should be stopped and not continue servicing frames, as in
      such a case frames that are dequeued would be marked as part
      of an aggregation, although the aggregation is already been
      stopped.
      
      Fix this by stopping the intermediate Tx queue, before
      calling the low level driver to stop the Tx aggregation.
      Signed-off-by: NIlan Peer <ilan.peer@intel.com>
      Signed-off-by: NLuca Coelho <luciano.coelho@intel.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a5a24445
    • J
      mac80211: call drv_ibss_join() on restart · bff33ba4
      Johannes Berg 提交于
      [ Upstream commit 4926b51bfaa6d36bd6f398fb7698679d3962e19d ]
      
      If a driver does any significant activity in its ibss_join method,
      then it will very well expect that to be called during restart,
      before any stations are added. Do that.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NLuca Coelho <luciano.coelho@intel.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bff33ba4
    • Z
      9p/net: fix memory leak in p9_client_create · 85bdc9da
      zhengbin 提交于
      commit bb06c388fa20ae24cfe80c52488de718a7e3a53f upstream.
      
      If msize is less than 4096, we should close and put trans, destroy
      tagpool, not just free client. This patch fixes that.
      
      Link: http://lkml.kernel.org/m/1552464097-142659-1-git-send-email-zhengbin13@huawei.com
      Cc: stable@vger.kernel.org
      Fixes: 574d356b7a02 ("9p/net: put a lower bound on msize")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NDominique Martinet <dominique.martinet@cea.fr>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      85bdc9da
  6. 19 3月, 2019 1 次提交
    • V
      net: sched: flower: insert new filter to idr after setting its mask · 275a2c08
      Vlad Buslov 提交于
      [ Upstream commit ecb3dea400d3beaf611ce76ac7a51d4230492cf2 ]
      
      When adding new filter to flower classifier, fl_change() inserts it to
      handle_idr before initializing filter extensions and assigning it a mask.
      Normally this ordering doesn't matter because all flower classifier ops
      callbacks assume rtnl lock protection. However, when filter has an action
      that doesn't have its kernel module loaded, rtnl lock is released before
      call to request_module(). During this time the filter can be accessed bu
      concurrent task before its initialization is completed, which can lead to a
      crash.
      
      Example case of NULL pointer dereference in concurrent dump:
      
      Task 1                           Task 2
      
      tc_new_tfilter()
       fl_change()
        idr_alloc_u32(fnew)
        fl_set_parms()
         tcf_exts_validate()
          tcf_action_init()
           tcf_action_init_1()
            rtnl_unlock()
            request_module()
            ...                        rtnl_lock()
            				 tc_dump_tfilter()
            				  tcf_chain_dump()
      				   fl_walk()
      				    idr_get_next_ul()
      				    tcf_node_dump()
      				     tcf_fill_node()
      				      fl_dump()
      				       mask = &f->mask->key; <- NULL ptr
            rtnl_lock()
      
      Extension initialization and mask assignment don't depend on fnew->handle
      that is allocated by idr_alloc_u32(). Move idr allocation code after action
      creation and mask assignment in fl_change() to prevent concurrent access
      to not fully initialized filter when rtnl lock is released to load action
      module.
      
      Fixes: 01683a14 ("net: sched: refactor flower walk to iterate over idr")
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: NRoi Dayan <roid@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      275a2c08