1. 24 8月, 2016 4 次提交
  2. 23 8月, 2016 1 次提交
    • S
      net: ip_finish_output_gso: Allow fragmenting segments of tunneled skbs if their DF is unset · c0451fe1
      Shmulik Ladkani 提交于
      In b8247f09,
      
         "net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs"
      
      gso skbs arriving from an ingress interface that go through UDP
      tunneling, are allowed to be fragmented if the resulting encapulated
      segments exceed the dst mtu of the egress interface.
      
      This aligned the behavior of gso skbs to non-gso skbs going through udp
      encapsulation path.
      
      However the non-gso vs gso anomaly is present also in the following
      cases of a GRE tunnel:
       - ip_gre in collect_md mode, where TUNNEL_DONT_FRAGMENT is not set
         (e.g. OvS vport-gre with df_default=false)
       - ip_gre in nopmtudisc mode, where IFLA_GRE_IGNORE_DF is set
      
      In both of the above cases, the non-gso skbs get fragmented, whereas the
      gso skbs (having skb_gso_network_seglen that exceeds dst mtu) get dropped,
      as they don't go through the segment+fragment code path.
      
      Fix: Setting IPSKB_FRAG_SEGS if the tunnel specified IP_DF bit is NOT set.
      
      Tunnels that do set IP_DF, will not go to fragmentation of segments.
      This preserves behavior of ip_gre in (the default) pmtudisc mode.
      
      Fixes: b8247f09 ("net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs")
      Reported-by: Nwenxu <wenxu@ucloud.cn>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Tested-by: Nwenxu <wenxu@ucloud.cn>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0451fe1
  3. 19 8月, 2016 1 次提交
  4. 16 8月, 2016 1 次提交
  5. 10 8月, 2016 1 次提交
    • L
      vti: flush x-netns xfrm cache when vti interface is removed · a5d0dc81
      Lance Richardson 提交于
      When executing the script included below, the netns delete operation
      hangs with the following message (repeated at 10 second intervals):
      
        kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1
      
      This occurs because a reference to the lo interface in the "secure" netns
      is still held by a dst entry in the xfrm bundle cache in the init netns.
      
      Address this problem by garbage collecting the tunnel netns flow cache
      when a cross-namespace vti interface receives a NETDEV_DOWN notification.
      
      A more detailed description of the problem scenario (referencing commands
      in the script below):
      
      (1) ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1
      
        The vti_test interface is created in the init namespace. vti_tunnel_init()
        attaches a struct ip_tunnel to the vti interface's netdev_priv(dev),
        setting the tunnel net to &init_net.
      
      (2) ip link set vti_test netns secure
      
        The vti_test interface is moved to the "secure" netns. Note that
        the associated struct ip_tunnel still has tunnel->net set to &init_net.
      
      (3) ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1
      
        The first packet sent using the vti device causes xfrm_lookup() to be
        called as follows:
      
            dst = xfrm_lookup(tunnel->net, skb_dst(skb), fl, NULL, 0);
      
        Note that tunnel->net is the init namespace, while skb_dst(skb) references
        the vti_test interface in the "secure" namespace. The returned dst
        references an interface in the init namespace.
      
        Also note that the first parameter to xfrm_lookup() determines which flow
        cache is used to store the computed xfrm bundle, so after xfrm_lookup()
        returns there will be a cached bundle in the init namespace flow cache
        with a dst referencing a device in the "secure" namespace.
      
      (4) ip netns del secure
      
        Kernel begins to delete the "secure" namespace.  At some point the
        vti_test interface is deleted, at which point dst_ifdown() changes
        the dst->dev in the cached xfrm bundle flow from vti_test to lo (still
        in the "secure" namespace however).
        Since nothing has happened to cause the init namespace's flow cache
        to be garbage collected, this dst remains attached to the flow cache,
        so the kernel loops waiting for the last reference to lo to go away.
      
      <Begin script>
      ip link add br1 type bridge
      ip link set dev br1 up
      ip addr add dev br1 1.1.1.1/8
      
      ip netns add secure
      ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1
      ip link set vti_test netns secure
      ip netns exec secure ip link set vti_test up
      ip netns exec secure ip link s lo up
      ip netns exec secure ip addr add dev lo 192.168.100.1/24
      ip netns exec secure ip route add 192.168.200.0/24 dev vti_test
      ip xfrm policy flush
      ip xfrm state flush
      ip xfrm policy add dir out tmpl src 1.1.1.1 dst 1.1.1.2 \
         proto esp mode tunnel mark 1
      ip xfrm policy add dir in tmpl src 1.1.1.2 dst 1.1.1.1 \
         proto esp mode tunnel mark 1
      ip xfrm state add src 1.1.1.1 dst 1.1.1.2 proto esp spi 1 \
         mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788
      ip xfrm state add src 1.1.1.2 dst 1.1.1.1 proto esp spi 1 \
         mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788
      
      ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1
      
      ip netns del secure
      <End script>
      Reported-by: NHangbin Liu <haliu@redhat.com>
      Reported-by: NJan Tluka <jtluka@redhat.com>
      Signed-off-by: NLance Richardson <lrichard@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5d0dc81
  6. 06 8月, 2016 1 次提交
    • D
      ipv4: panic in leaf_walk_rcu due to stale node pointer · 94d9f1c5
      David Forster 提交于
      Panic occurs when issuing "cat /proc/net/route" whilst
      populating FIB with > 1M routes.
      
      Use of cached node pointer in fib_route_get_idx is unsafe.
      
       BUG: unable to handle kernel paging request at ffffc90001630024
       IP: [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
       PGD 11b08d067 PUD 11b08e067 PMD dac4b067 PTE 0
       Oops: 0000 [#1] SMP
       Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscac
       snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep virti
       acpi_cpufreq button parport_pc ppdev lp parport autofs4 ext4 crc16 mbcache jbd
      tio_ring virtio floppy uhci_hcd ehci_hcd usbcore usb_common libata scsi_mod
       CPU: 1 PID: 785 Comm: cat Not tainted 4.2.0-rc8+ #4
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
       task: ffff8800da1c0bc0 ti: ffff88011a05c000 task.ti: ffff88011a05c000
       RIP: 0010:[<ffffffff814cf6a0>]  [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
       RSP: 0018:ffff88011a05fda0  EFLAGS: 00010202
       RAX: ffff8800d8a40c00 RBX: ffff8800da4af940 RCX: ffff88011a05ff20
       RDX: ffffc90001630020 RSI: 0000000001013531 RDI: ffff8800da4af950
       RBP: 0000000000000000 R08: ffff8800da1f9a00 R09: 0000000000000000
       R10: ffff8800db45b7e4 R11: 0000000000000246 R12: ffff8800da4af950
       R13: ffff8800d97a74c0 R14: 0000000000000000 R15: ffff8800d97a7480
       FS:  00007fd3970e0700(0000) GS:ffff88011fd00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: ffffc90001630024 CR3: 000000011a7e4000 CR4: 00000000000006e0
       Stack:
        ffffffff814d00d3 0000000000000000 ffff88011a05ff20 ffff8800da1f9a00
        ffffffff811dd8b9 0000000000000800 0000000000020000 00007fd396f35000
        ffffffff811f8714 0000000000003431 ffffffff8138dce0 0000000000000f80
       Call Trace:
        [<ffffffff814d00d3>] ? fib_route_seq_start+0x93/0xc0
        [<ffffffff811dd8b9>] ? seq_read+0x149/0x380
        [<ffffffff811f8714>] ? fsnotify+0x3b4/0x500
        [<ffffffff8138dce0>] ? process_echoes+0x70/0x70
        [<ffffffff8121cfa7>] ? proc_reg_read+0x47/0x70
        [<ffffffff811bb823>] ? __vfs_read+0x23/0xd0
        [<ffffffff811bbd42>] ? rw_verify_area+0x52/0xf0
        [<ffffffff811bbe61>] ? vfs_read+0x81/0x120
        [<ffffffff811bcbc2>] ? SyS_read+0x42/0xa0
        [<ffffffff81549ab2>] ? entry_SYSCALL_64_fastpath+0x16/0x75
       Code: 48 85 c0 75 d8 f3 c3 31 c0 c3 f3 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00
      a 04 89 f0 33 02 44 89 c9 48 d3 e8 0f b6 4a 05 49 89
       RIP  [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
        RSP <ffff88011a05fda0>
       CR2: ffffc90001630024
      Signed-off-by: NDave Forster <dforster@brocade.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94d9f1c5
  7. 31 7月, 2016 1 次提交
  8. 27 7月, 2016 1 次提交
  9. 26 7月, 2016 1 次提交
  10. 20 7月, 2016 2 次提交
    • S
      net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow... · b8247f09
      Shmulik Ladkani 提交于
      net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs
      
      Given:
       - tap0 and vxlan0 are bridged
       - vxlan0 stacked on eth0, eth0 having small mtu (e.g. 1400)
      
      Assume GSO skbs arriving from tap0 having a gso_size as determined by
      user-provided virtio_net_hdr (e.g. 1460 corresponding to VM mtu of 1500).
      
      After encapsulation these skbs have skb_gso_network_seglen that exceed
      eth0's ip_skb_dst_mtu.
      
      These skbs are accidentally passed to ip_finish_output2 AS IS.
      Alas, each final segment (segmented either by validate_xmit_skb or by
      hardware UFO) would be larger than eth0 mtu.
      As a result, those above-mtu segments get dropped on certain networks.
      
      This behavior is not aligned with the NON-GSO case:
      Assume a non-gso 1500-sized IP packet arrives from tap0. After
      encapsulation, the vxlan datagram is fragmented normally at the
      ip_finish_output-->ip_fragment code path.
      
      The expected behavior for the GSO case would be segmenting the
      "gso-oversized" skb first, then fragmenting each segment according to
      dst mtu, and finally passing the resulting fragments to ip_finish_output2.
      
      'ip_finish_output_gso' already supports this "Slowpath" behavior,
      according to the IPSKB_FRAG_SEGS flag, which is only set during ipv4
      forwarding (not set in the bridged case).
      
      In order to support the bridged case, we'll mark skbs arriving from an
      ingress interface that get udp-encaspulated as "allowed to be fragmented",
      causing their network_seglen to be validated by 'ip_finish_output_gso'
      (and fragment if needed).
      
      Note the TUNNEL_DONT_FRAGMENT tun_flag is still honoured (both in the
      gso and non-gso cases), which serves users wishing to forbid
      fragmentation at the udp tunnel endpoint.
      
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8247f09
    • S
      net/ipv4: Introduce IPSKB_FRAG_SEGS bit to inet_skb_parm.flags · 359ebda2
      Shmulik Ladkani 提交于
      This flag indicates whether fragmentation of segments is allowed.
      
      Formerly this policy was hardcoded according to IPSKB_FORWARDED (set by
      either ip_forward or ipmr_forward).
      
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      359ebda2
  11. 19 7月, 2016 1 次提交
    • F
      netfilter: x_tables: speed up jump target validation · f4dc7771
      Florian Westphal 提交于
      The dummy ruleset I used to test the original validation change was broken,
      most rules were unreachable and were not tested by mark_source_chains().
      
      In some cases rulesets that used to load in a few seconds now require
      several minutes.
      
      sample ruleset that shows the behaviour:
      
      echo "*filter"
      for i in $(seq 0 100000);do
              printf ":chain_%06x - [0:0]\n" $i
      done
      for i in $(seq 0 100000);do
         printf -- "-A INPUT -j chain_%06x\n" $i
         printf -- "-A INPUT -j chain_%06x\n" $i
         printf -- "-A INPUT -j chain_%06x\n" $i
      done
      echo COMMIT
      
      [ pipe result into iptables-restore ]
      
      This ruleset will be about 74mbyte in size, with ~500k searches
      though all 500k[1] rule entries. iptables-restore will take forever
      (gave up after 10 minutes)
      
      Instead of always searching the entire blob for a match, fill an
      array with the start offsets of every single ipt_entry struct,
      then do a binary search to check if the jump target is present or not.
      
      After this change ruleset restore times get again close to what one
      gets when reverting 36472341 (~3 seconds on my workstation).
      
      [1] every user-defined rule gets an implicit RETURN, so we get
      300k jumps + 100k userchains + 100k returns -> 500k rule entries
      
      Fixes: 36472341 ("netfilter: x_tables: validate targets of jumps")
      Reported-by: NJeff Wu <wujiafu@gmail.com>
      Tested-by: NJeff Wu <wujiafu@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f4dc7771
  12. 17 7月, 2016 1 次提交
    • N
      net: ipmr/ip6mr: add support for keeping an entry age · 43b9e127
      Nikolay Aleksandrov 提交于
      In preparation for hardware offloading of ipmr/ip6mr we need an
      interface that allows to check (and later update) the age of entries.
      Relying on stats alone can show activity but not actual age of the entry,
      furthermore when there're tens of thousands of entries a lot of the
      hardware implementations only support "hit" bits which are cleared on
      read to denote that the entry was active and shouldn't be aged out,
      these can then be naturally translated into age timestamp and will be
      compatible with the software forwarding age. Using a lastuse entry doesn't
      affect performance because the members in that cache line are written to
      along with the age.
      Since all new users are encouraged to use ipmr via netlink, this is
      exported via the RTA_EXPIRES attribute.
      Also do a minor local variable declaration style adjustment - arrange them
      longest to shortest.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      CC: Shrijeet Mukherjee <shm@cumulusnetworks.com>
      CC: Satish Ashok <sashok@cumulusnetworks.com>
      CC: Donald Sharp <sharpd@cumulusnetworks.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43b9e127
  13. 16 7月, 2016 2 次提交
    • R
      tcp_timer.c: Add kernel-doc function descriptions · c380d37e
      Richard Sailer 提交于
      This adds kernel-doc style descriptions for 6 functions and
      fixes 1 typo.
      Signed-off-by: NRichard Sailer <richard@weltraumpflege.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c380d37e
    • J
      tcp: enable per-socket rate limiting of all 'challenge acks' · 083ae308
      Jason Baron 提交于
      The per-socket rate limit for 'challenge acks' was introduced in the
      context of limiting ack loops:
      
      commit f2b2c582 ("tcp: mitigate ACK loops for connections as tcp_sock")
      
      And I think it can be extended to rate limit all 'challenge acks' on a
      per-socket basis.
      
      Since we have the global tcp_challenge_ack_limit, this patch allows for
      tcp_challenge_ack_limit to be set to a large value and effectively rely on
      the per-socket limit, or set tcp_challenge_ack_limit to a lower value and
      still prevents a single connections from consuming the entire challenge ack
      quota.
      
      It further moves in the direction of eliminating the global limit at some
      point, as Eric Dumazet has suggested. This a follow-up to:
      Subject: tcp: make challenge acks less predictable
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Yue Cao <ycao009@ucr.edu>
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      083ae308
  14. 12 7月, 2016 5 次提交
    • P
      ipv4: af_inet: make it explicitly non-modular · d3fc0353
      Paul Gortmaker 提交于
      The Makefile controlling compilation of this file is obj-y,
      meaning that it currently is never being built as a module.
      
      Since MODULE_ALIAS is a no-op for non-modular code, we can simply
      remove the MODULE_ALIAS_NETPROTO variant used here.
      
      We replace module.h with kmod.h since the file does make use of
      request_module() in order to load other modules from here.
      
      We don't have to worry about init.h coming in via the removed
      module.h since the file explicitly includes init.h already.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3fc0353
    • J
      ipv4: reject RTNH_F_DEAD and RTNH_F_LINKDOWN from user space · 80610229
      Julian Anastasov 提交于
      Vegard Nossum is reporting for a crash in fib_dump_info
      when nh_dev = NULL and fib_nhs == 1:
      
      Pid: 50, comm: netlink.exe Not tainted 4.7.0-rc5+
      RIP: 0033:[<00000000602b3d18>]
      RSP: 0000000062623890  EFLAGS: 00010202
      RAX: 0000000000000000 RBX: 000000006261b800 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000024 RDI: 000000006245ba00
      RBP: 00000000626238f0 R08: 000000000000029c R09: 0000000000000000
      R10: 0000000062468038 R11: 000000006245ba00 R12: 000000006245ba00
      R13: 00000000625f96c0 R14: 00000000601e16f0 R15: 0000000000000000
      Kernel panic - not syncing: Kernel mode fault at addr 0x2e0, ip 0x602b3d18
      CPU: 0 PID: 50 Comm: netlink.exe Not tainted 4.7.0-rc5+ #581
      Stack:
       626238f0 960226a02 00000400 000000fe
       62623910 600afca7 62623970 62623a48
       62468038 00000018 00000000 00000000
      Call Trace:
       [<602b3e93>] rtmsg_fib+0xd3/0x190
       [<602b6680>] fib_table_insert+0x260/0x500
       [<602b0e5d>] inet_rtm_newroute+0x4d/0x60
       [<60250def>] rtnetlink_rcv_msg+0x8f/0x270
       [<60267079>] netlink_rcv_skb+0xc9/0xe0
       [<60250d4b>] rtnetlink_rcv+0x3b/0x50
       [<60265400>] netlink_unicast+0x1a0/0x2c0
       [<60265e47>] netlink_sendmsg+0x3f7/0x470
       [<6021dc9a>] sock_sendmsg+0x3a/0x90
       [<6021e0d0>] ___sys_sendmsg+0x300/0x360
       [<6021fa64>] __sys_sendmsg+0x54/0xa0
       [<6021fac0>] SyS_sendmsg+0x10/0x20
       [<6001ea68>] handle_syscall+0x88/0x90
       [<600295fd>] userspace+0x3fd/0x500
       [<6001ac55>] fork_handler+0x85/0x90
      
      $ addr2line -e vmlinux -i 0x602b3d18
      include/linux/inetdevice.h:222
      net/ipv4/fib_semantics.c:1264
      
      Problem happens when RTNH_F_LINKDOWN is provided from user space
      when creating routes that do not use the flag, catched with
      netlink fuzzer.
      
      Currently, the kernel allows user space to set both flags
      to nh_flags and fib_flags but this is not intentional, the
      assumption was that they are not set. Fix this by rejecting
      both flags with EINVAL.
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Fixes: 0eeb075f ("net: ipv4 sysctl option to ignore routes when nexthop link is down")
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
      Cc: Dinesh Dutt <ddutt@cumulusnetworks.com>
      Cc: Scott Feldman <sfeldma@gmail.com>
      Reviewed-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80610229
    • E
      tcp: make challenge acks less predictable · 75ff39cc
      Eric Dumazet 提交于
      Yue Cao claims that current host rate limiting of challenge ACKS
      (RFC 5961) could leak enough information to allow a patient attacker
      to hijack TCP sessions. He will soon provide details in an academic
      paper.
      
      This patch increases the default limit from 100 to 1000, and adds
      some randomization so that the attacker can no longer hijack
      sessions without spending a considerable amount of probes.
      
      Based on initial analysis and patch from Linus.
      
      Note that we also have per socket rate limiting, so it is tempting
      to remove the host limit in the future.
      
      v2: randomize the count of challenge acks per second, not the period.
      
      Fixes: 282f23c6 ("tcp: implement RFC 5961 3.2")
      Reported-by: NYue Cao <ycao009@ucr.edu>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75ff39cc
    • S
      tunnels: correct conditional build of MPLS and IPv6 · aa9667e7
      Simon Horman 提交于
      Using a combination if #if conditionals and goto labels to unwind
      tunnel4_init seems unwieldy. This patch takes a simpler approach of
      directly unregistering previously registered protocols when an error
      occurs.
      
      This fixes a number of problems with the current implementation
      including the potential presence of labels when they are unused
      and the potential absence of unregister code when it is needed.
      
      Fixes: 8afe97e5 ("tunnels: support MPLS over IPv4 tunnels")
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa9667e7
    • M
      udp: prevent bugcheck if filter truncates packet too much · a6127697
      Michal Kubeček 提交于
      If socket filter truncates an udp packet below the length of UDP header
      in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a
      BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if
      kernel is configured that way) can be easily enforced by an unprivileged
      user which was reported as CVE-2016-6162. For a reproducer, see
      http://seclists.org/oss-sec/2016/q3/8
      
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Reported-by: NMarco Grassi <marco.gra@gmail.com>
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6127697
  15. 11 7月, 2016 1 次提交
    • L
      netfilter: conntrack: fix race between nf_conntrack proc read and hash resize · 64b87639
      Liping Zhang 提交于
      When we do "cat /proc/net/nf_conntrack", and meanwhile resize the conntrack
      hash table via /sys/module/nf_conntrack/parameters/hashsize, race will
      happen, because reader can observe a newly allocated hash but the old size
      (or vice versa). So oops will happen like follows:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000017
        IP: [<ffffffffa0418e21>] seq_print_acct+0x11/0x50 [nf_conntrack]
        Call Trace:
        [<ffffffffa0412f4e>] ? ct_seq_show+0x14e/0x340 [nf_conntrack]
        [<ffffffff81261a1c>] seq_read+0x2cc/0x390
        [<ffffffff812a8d62>] proc_reg_read+0x42/0x70
        [<ffffffff8123bee7>] __vfs_read+0x37/0x130
        [<ffffffff81347980>] ? security_file_permission+0xa0/0xc0
        [<ffffffff8123cf75>] vfs_read+0x95/0x140
        [<ffffffff8123e475>] SyS_read+0x55/0xc0
        [<ffffffff817c2572>] entry_SYSCALL_64_fastpath+0x1a/0xa4
      
      It is very easy to reproduce this kernel crash.
      1. open one shell and input the following cmds:
        while : ; do
          echo $RANDOM > /sys/module/nf_conntrack/parameters/hashsize
        done
      2. open more shells and input the following cmds:
        while : ; do
          cat /proc/net/nf_conntrack
        done
      3. just wait a monent, oops will happen soon.
      
      The solution in this patch is based on Florian's Commit 5e3c61f9
      ("netfilter: conntrack: fix lookup race during hash resize"). And
      add a wrapper function nf_conntrack_get_ht to get hash and hsize
      suggested by Florian Westphal.
      Signed-off-by: NLiping Zhang <liping.zhang@spreadtrum.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      64b87639
  16. 10 7月, 2016 3 次提交
  17. 07 7月, 2016 1 次提交
  18. 03 7月, 2016 1 次提交
    • J
      netfilter: Convert FWINV<[foo]> macros and uses to NF_INVF · c37a2dfa
      Joe Perches 提交于
      netfilter uses multiple FWINV #defines with identical form that hide a
      specific structure variable and dereference it with a invflags member.
      
      $ git grep "#define FWINV"
      include/linux/netfilter_bridge/ebtables.h:#define FWINV(bool,invflg) ((bool) ^ !!(info->invflags & invflg))
      net/bridge/netfilter/ebtables.c:#define FWINV2(bool, invflg) ((bool) ^ !!(e->invflags & invflg))
      net/ipv4/netfilter/arp_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(arpinfo->invflags & (invflg)))
      net/ipv4/netfilter/ip_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
      net/ipv6/netfilter/ip6_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ip6info->invflags & (invflg)))
      net/netfilter/xt_tcpudp.c:#define FWINVTCP(bool, invflg) ((bool) ^ !!(tcpinfo->invflags & (invflg)))
      
      Consolidate these macros into a single NF_INVF macro.
      
      Miscellanea:
      
      o Neaten the alignment around these uses
      o A few lines are > 80 columns for intelligibility
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c37a2dfa
  19. 01 7月, 2016 2 次提交
  20. 30 6月, 2016 2 次提交
    • S
      ipv4: Fix ip_skb_dst_mtu to use the sk passed by ip_finish_output · fedbb6b4
      Shmulik Ladkani 提交于
      ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
      calls ip_sk_use_pmtu which casts sk as an inet_sk).
      
      However, in the case of UDP tunneling, the skb->sk is not necessarily an
      inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
      tun/tap).
      
      OTOH, the sk passed as an argument throughout IP stack's output path is
      the one which is of PMTU interest:
       - In case of local sockets, sk is same as skb->sk;
       - In case of a udp tunnel, sk is the tunneling socket.
      
      Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
      This augments 7026b1dd 'netfilter: Pass socket pointer down through okfn().'
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fedbb6b4
    • A
      tcp: add an ability to dump and restore window parameters · b1ed4c4f
      Andrey Vagin 提交于
      We found that sometimes a restored tcp socket doesn't work.
      
      A reason of this bug is incorrect window parameters and in this case
      tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The
      other side drops packets with this seq, because seq is less than
      tp->rcv_nxt ( tcp_sequence() ).
      
      Data from a send queue is sent only if there is enough space in a
      window, so when we restore unacked data, we need to expand a window to
      fit this data.
      
      This was in a first version of this patch:
      "tcp: extend window to fit all restored unacked data in a send queue"
      
      Then Alexey recommended me to restore window parameters instead of
      adjusted them according with data in a sent queue. This sounds resonable.
      
      rcv_wnd has to be restored, because it was reported to another side
      and the offered window is never shrunk.
      One of reasons why we need to restore snd_wnd was described above.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1ed4c4f
  21. 29 6月, 2016 1 次提交
  22. 28 6月, 2016 4 次提交
  23. 24 6月, 2016 1 次提交
  24. 23 6月, 2016 1 次提交