1. 05 6月, 2017 4 次提交
    • J
      skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow · 48a1df65
      Jason A. Donenfeld 提交于
      This is a defense-in-depth measure in response to bugs like
      4d6fa57b ("macsec: avoid heap overflow in skb_to_sgvec"). There's
      not only a potential overflow of sglist items, but also a stack overflow
      potential, so we fix this by limiting the amount of recursion this function
      is allowed to do. Not actually providing a bounded base case is a future
      disaster that we can easily avoid here.
      
      As a small matter of house keeping, we take this opportunity to move the
      documentation comment over the actual function the documentation is for.
      
      While this could be implemented by using an explicit stack of skbuffs,
      when implementing this, the function complexity increased considerably,
      and I don't think such complexity and bloat is actually worth it. So,
      instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS,
      and measured the stack usage there. I also reverted the recent MIPS
      changes that give it a separate IRQ stack, so that I could experience
      some worst-case situations. I found that limiting it to 24 layers deep
      yielded a good stack usage with room for safety, as well as being much
      deeper than any driver actually ever creates.
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48a1df65
    • S
      neigh: Really delete an arp/neigh entry on "ip neigh delete" or "arp -d" · 5071034e
      Sowmini Varadhan 提交于
      The command
        # arp -s 62.2.0.1 a:b:c:d:e:f dev eth2
      adds an entry like the following (listed by "arp -an")
        ? (62.2.0.1) at 0a:0b:0c:0d:0e:0f [ether] PERM on eth2
      but the symmetric deletion command
        # arp -i eth2 -d 62.2.0.1
      does not remove the PERM entry from the table, and instead leaves behind
        ? (62.2.0.1) at <incomplete> on eth2
      
      The reason is that there is a refcnt of 1 for the arp_tbl itself
      (neigh_alloc starts off the entry with a refcnt of 1), thus
      the neigh_release() call from arp_invalidate() will (at best) just
      decrement the ref to 1, but will never actually free it from the
      table.
      
      To fix this, we need to do something like neigh_forced_gc: if
      the refcnt is 1 (i.e., on the table's ref), remove the entry from
      the table and free it. This patch refactors and shares common code
      between neigh_forced_gc and the newly added neigh_remove_one.
      
      A similar issue exists for IPv6 Neighbor Cache entries, and is fixed
      in a similar manner by this patch.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5071034e
    • J
      net-procfs: Use vsnprintf extension %phN · fbd0ac60
      Joe Perches 提交于
      Save a bit of code by using the kernel extension.
      
      $ size net/core/net-procfs.o*
         text	   data	    bss	    dec	    hex	filename
         3701	    120	      0	   3821	    eed	net/core/net-procfs.o.new
         3764	    120	      0	   3884	    f2c	net/core/net-procfs.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbd0ac60
    • O
      net/flow_dissector: add support for dissection of misc ip header fields · 518d8a2e
      Or Gerlitz 提交于
      Add support for dissection of ip tos and ttl and ipv6 traffic-class
      and hoplimit. Both are dissected into the same struct.
      
      Uses similar call to ip dissection function as with tcp, arp and others.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      518d8a2e
  2. 01 6月, 2017 2 次提交
  3. 30 5月, 2017 2 次提交
  4. 28 5月, 2017 1 次提交
  5. 27 5月, 2017 1 次提交
    • E
      ipv4: add reference counting to metrics · 3fb07daf
      Eric Dumazet 提交于
      Andrey Konovalov reported crashes in ipv4_mtu()
      
      I could reproduce the issue with KASAN kernels, between
      10.246.7.151 and 10.246.7.152 :
      
      1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &
      
      2) At the same time run following loop :
      while :
      do
       ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
       ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
      done
      
      Cong Wang attempted to add back rt->fi in commit
      82486aa6 ("ipv4: restore rt->fi for reference counting")
      but this proved to add some issues that were complex to solve.
      
      Instead, I suggested to add a refcount to the metrics themselves,
      being a standalone object (in particular, no reference to other objects)
      
      I tried to make this patch as small as possible to ease its backport,
      instead of being super clean. Note that we believe that only ipv4 dst
      need to take care of the metric refcount. But if this is wrong,
      this patch adds the basic infrastructure to extend this to other
      families.
      
      Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
      for his efforts on this problem.
      
      Fixes: 2860583f ("ipv4: Kill rt->fi")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fb07daf
  6. 26 5月, 2017 2 次提交
  7. 25 5月, 2017 2 次提交
    • J
      net: flow_dissector: add support for dissection of tcp flags · ac4bb5de
      Jiri Pirko 提交于
      Add support for dissection of tcp flags. Uses similar function call to
      tcp dissection function as arp, mpls and others.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac4bb5de
    • A
      net: rtnetlink: bail out from rtnl_fdb_dump() on parse error · 0ff50e83
      Alexander Potapenko 提交于
      rtnl_fdb_dump() failed to check the result of nlmsg_parse(), which led
      to contents of |ifm| being uninitialized because nlh->nlmsglen was too
      small to accommodate |ifm|. The uninitialized data may affect some
      branches and result in unwanted effects, although kernel data doesn't
      seem to leak to the userspace directly.
      
      The bug has been detected with KMSAN and syzkaller.
      
      For the record, here is the KMSAN report:
      
      ==================================================================
      BUG: KMSAN: use of unitialized memory in rtnl_fdb_dump+0x5dc/0x1000
      CPU: 0 PID: 1039 Comm: probe Not tainted 4.11.0-rc5+ #2727
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16
       dump_stack+0x143/0x1b0 lib/dump_stack.c:52
       kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:1007
       __kmsan_warning_32+0x66/0xb0 mm/kmsan/kmsan_instr.c:491
       rtnl_fdb_dump+0x5dc/0x1000 net/core/rtnetlink.c:3230
       netlink_dump+0x84f/0x1190 net/netlink/af_netlink.c:2168
       __netlink_dump_start+0xc97/0xe50 net/netlink/af_netlink.c:2258
       netlink_dump_start ./include/linux/netlink.h:165
       rtnetlink_rcv_msg+0xae9/0xb40 net/core/rtnetlink.c:4094
       netlink_rcv_skb+0x339/0x5a0 net/netlink/af_netlink.c:2339
       rtnetlink_rcv+0x83/0xa0 net/core/rtnetlink.c:4110
       netlink_unicast_kernel net/netlink/af_netlink.c:1272
       netlink_unicast+0x13b7/0x1480 net/netlink/af_netlink.c:1298
       netlink_sendmsg+0x10b8/0x10f0 net/netlink/af_netlink.c:1844
       sock_sendmsg_nosec net/socket.c:633
       sock_sendmsg net/socket.c:643
       ___sys_sendmsg+0xd4b/0x10f0 net/socket.c:1997
       __sys_sendmsg net/socket.c:2031
       SYSC_sendmsg+0x2c6/0x3f0 net/socket.c:2042
       SyS_sendmsg+0x87/0xb0 net/socket.c:2038
       do_syscall_64+0x102/0x150 arch/x86/entry/common.c:285
       entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:246
      RIP: 0033:0x401300
      RSP: 002b:00007ffc3b0e6d58 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00000000004002b0 RCX: 0000000000401300
      RDX: 0000000000000000 RSI: 00007ffc3b0e6d80 RDI: 0000000000000003
      RBP: 00007ffc3b0e6e00 R08: 000000000000000b R09: 0000000000000004
      R10: 000000000000000d R11: 0000000000000246 R12: 0000000000000000
      R13: 00000000004065a0 R14: 0000000000406630 R15: 0000000000000000
      origin: 000000008fe00056
       save_stack_trace+0x59/0x60 arch/x86/kernel/stacktrace.c:59
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:352
       kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:247
       kmsan_poison_shadow+0x6d/0xc0 mm/kmsan/kmsan.c:260
       slab_alloc_node mm/slub.c:2743
       __kmalloc_node_track_caller+0x1f4/0x390 mm/slub.c:4349
       __kmalloc_reserve net/core/skbuff.c:138
       __alloc_skb+0x2cd/0x740 net/core/skbuff.c:231
       alloc_skb ./include/linux/skbuff.h:933
       netlink_alloc_large_skb net/netlink/af_netlink.c:1144
       netlink_sendmsg+0x934/0x10f0 net/netlink/af_netlink.c:1819
       sock_sendmsg_nosec net/socket.c:633
       sock_sendmsg net/socket.c:643
       ___sys_sendmsg+0xd4b/0x10f0 net/socket.c:1997
       __sys_sendmsg net/socket.c:2031
       SYSC_sendmsg+0x2c6/0x3f0 net/socket.c:2042
       SyS_sendmsg+0x87/0xb0 net/socket.c:2038
       do_syscall_64+0x102/0x150 arch/x86/entry/common.c:285
       return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:246
      ==================================================================
      
      and the reproducer:
      
      ==================================================================
        #include <sys/socket.h>
        #include <net/if_arp.h>
        #include <linux/netlink.h>
        #include <stdint.h>
      
        int main()
        {
          int sock = socket(PF_NETLINK, SOCK_DGRAM | SOCK_NONBLOCK, 0);
          struct msghdr msg;
          memset(&msg, 0, sizeof(msg));
          char nlmsg_buf[32];
          memset(nlmsg_buf, 0, sizeof(nlmsg_buf));
          struct nlmsghdr *nlmsg = nlmsg_buf;
          nlmsg->nlmsg_len = 0x11;
          nlmsg->nlmsg_type = 0x1e; // RTM_NEWROUTE = RTM_BASE + 0x0e
          // type = 0x0e = 1110b
          // kind = 2
          nlmsg->nlmsg_flags = 0x101; // NLM_F_ROOT | NLM_F_REQUEST
          nlmsg->nlmsg_seq = 0;
          nlmsg->nlmsg_pid = 0;
          nlmsg_buf[16] = (char)7;
          struct iovec iov;
          iov.iov_base = nlmsg_buf;
          iov.iov_len = 17;
          msg.msg_iov = &iov;
          msg.msg_iovlen = 1;
          sendmsg(sock, &msg, 0);
          return 0;
        }
      ==================================================================
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ff50e83
  8. 22 5月, 2017 4 次提交
  9. 20 5月, 2017 5 次提交
  10. 18 5月, 2017 4 次提交
  11. 17 5月, 2017 4 次提交
    • I
      neighbour: update neigh timestamps iff update is effective · 77d71233
      Ihar Hrachyshka 提交于
      It's a common practice to send gratuitous ARPs after moving an
      IP address to another device to speed up healing of a service. To
      fulfill service availability constraints, the timing of network peers
      updating their caches to point to a new location of an IP address can be
      particularly important.
      
      Sometimes neigh_update calls won't touch neither lladdr nor state, for
      example if an update arrives in locktime interval. The neigh->updated
      value is tested by the protocol specific neigh code, which in turn
      will influence whether NEIGH_UPDATE_F_OVERRIDE gets set in the
      call to neigh_update() or not. As a result, we may effectively ignore
      the update request, bailing out of touching the neigh entry, except that
      we still bump its timestamps inside neigh_update.
      
      This may be a problem for updates arriving in quick succession. For
      example, consider the following scenario:
      
      A service is moved to another device with its IP address. The new device
      sends three gratuitous ARP requests into the network with ~1 seconds
      interval between them. Just before the first request arrives to one of
      network peer nodes, its neigh entry for the IP address transitions from
      STALE to DELAY.  This transition, among other things, updates
      neigh->updated. Once the kernel receives the first gratuitous ARP, it
      ignores it because its arrival time is inside the locktime interval. The
      kernel still bumps neigh->updated. Then the second gratuitous ARP
      request arrives, and it's also ignored because it's still in the (new)
      locktime interval. Same happens for the third request. The node
      eventually heals itself (after delay_first_probe_time seconds since the
      initial transition to DELAY state), but it just wasted some time and
      require a new ARP request/reply round trip. This unfortunate behaviour
      both puts more load on the network, as well as reduces service
      availability.
      
      This patch changes neigh_update so that it bumps neigh->updated (as well
      as neigh->confirmed) only once we are sure that either lladdr or entry
      state will change). In the scenario described above, it means that the
      second gratuitous ARP request will actually update the entry lladdr.
      
      Ideally, we would update the neigh entry on the very first gratuitous
      ARP request. The locktime mechanism is designed to ignore ARP updates in
      a short timeframe after a previous ARP update was honoured by the kernel
      layer. This would require tracking timestamps for state transitions
      separately from timestamps when actual updates are received. This would
      probably involve changes in neighbour struct. Therefore, the patch
      doesn't tackle the issue of the first gratuitous APR ignored, leaving
      it for a follow-up.
      Signed-off-by: NIhar Hrachyshka <ihrachys@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77d71233
    • E
      tcp: internal implementation for pacing · 218af599
      Eric Dumazet 提交于
      BBR congestion control depends on pacing, and pacing is
      currently handled by sch_fq packet scheduler for performance reasons,
      and also because implemening pacing with FQ was convenient to truly
      avoid bursts.
      
      However there are many cases where this packet scheduler constraint
      is not practical.
      - Many linux hosts are not focusing on handling thousands of TCP
        flows in the most efficient way.
      - Some routers use fq_codel or other AQM, but still would like
        to use BBR for the few TCP flows they initiate/terminate.
      
      This patch implements an automatic fallback to internal pacing.
      
      Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
      
      If sch_fq happens to be in the egress path, pacing is delegated to
      the qdisc, otherwise pacing is done by TCP itself.
      
      One advantage of pacing from TCP stack is to get more precise rtt
      estimations, and less work done from TX completion, since TCP Small
      queue limits are not generally hit. Setups with single TX queue but
      many cpus might even benefit from this.
      
      Note that unlike sch_fq, we do not take into account header sizes.
      Taking care of these headers would add additional complexity for
      no practical differences in behavior.
      
      Some performance numbers using 800 TCP_STREAM flows rate limited to
      ~48 Mbit per second on 40Gbit NIC.
      
      If MQ+pfifo_fast is used on the NIC :
      
      $ sar -n DEV 1 5 | grep eth
      14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
      14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
      14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
      14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
      14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
      Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
       4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
       0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
       1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
       1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0
      
      Now use MQ+FQ :
      
      lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
      lpaa23:~# tc qdisc replace dev eth0 root mq
      
      $ sar -n DEV 1 5 | grep eth
      14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
      14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
      14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
      14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
      14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
      Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
       1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
       0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
       4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
       3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0
      
      As expected, number of interrupts per second is very different.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      218af599
    • P
      net/sock: factor out dequeue/peek with offset code · 65101aec
      Paolo Abeni 提交于
      And update __sk_queue_drop_skb() to work on the specified queue.
      This will help the udp protocol to use an additional private
      rx queue in a later patch.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65101aec
    • D
      net: Improve handling of failures on link and route dumps · f6c5775f
      David Ahern 提交于
      In general, rtnetlink dumps do not anticipate failure to dump a single
      object (e.g., link or route) on a single pass. As both route and link
      objects have grown via more attributes, that is no longer a given.
      
      netlink dumps can handle a failure if the dump function returns an
      error; specifically, netlink_dump adds the return code to the response
      if it is <= 0 so userspace is notified of the failure. The missing
      piece is the rtnetlink dump functions returning the error.
      
      Fix route and link dump functions to return the errors if no object is
      added to an skb (detected by skb->len != 0). IPv6 route dumps
      (rt6_dump_route) already return the error; this patch updates IPv4 and
      link dumps. Other dump functions may need to be ajusted as well.
      Reported-by: NJan Moskyto Matejka <mq@ucw.cz>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6c5775f
  12. 12 5月, 2017 3 次提交
    • E
      netem: fix skb_orphan_partial() · f6ba8d33
      Eric Dumazet 提交于
      I should have known that lowering skb->truesize was dangerous :/
      
      In case packets are not leaving the host via a standard Ethernet device,
      but looped back to local sockets, bad things can happen, as reported
      by Michael Madsen ( https://bugzilla.kernel.org/show_bug.cgi?id=195713 )
      
      So instead of tweaking skb->truesize, lets change skb->destructor
      and keep a reference on the owner socket via its sk_refcnt.
      
      Fixes: f2f872f9 ("netem: Introduce skb_orphan_partial() helper")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMichael Madsen <mkm@nabto.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6ba8d33
    • D
      xdp: refine xdp api with regards to generic xdp · d67b9cd2
      Daniel Borkmann 提交于
      While working on the iproute2 generic XDP frontend, I noticed that
      as of right now it's possible to have native *and* generic XDP
      programs loaded both at the same time for the case when a driver
      supports native XDP.
      
      The intended model for generic XDP from b5cdae32 ("net: Generic
      XDP") is, however, that only one out of the two can be present at
      once which is also indicated as such in the XDP netlink dump part.
      The main rationale for generic XDP is to ease accessibility (in
      case a driver does not yet have XDP support) and to generically
      provide a semantical model as an example for driver developers
      wanting to add XDP support. The generic XDP option for an XDP
      aware driver can still be useful for comparing and testing both
      implementations.
      
      However, it is not intended to have a second XDP processing stage
      or layer with exactly the same functionality of the first native
      stage. Only reason could be to have a partial fallback for future
      XDP features that are not supported yet in the native implementation
      and we probably also shouldn't strive for such fallback and instead
      encourage native feature support in the first place. Given there's
      currently no such fallback issue or use case, lets not go there yet
      if we don't need to.
      
      Therefore, change semantics for loading XDP and bail out if the
      user tries to load a generic XDP program when a native one is
      present and vice versa. Another alternative to bailing out would
      be to handle the transition from one flavor to another gracefully,
      but that would require to bring the device down, exchange both
      types of programs, and bring it up again in order to avoid a tiny
      window where a packet could hit both hooks. Given this complicates
      the logic for just a debugging feature in the native case, I went
      with the simpler variant.
      
      For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae32
      and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
      or just a subset of flags that were used for loading the XDP prog
      is suboptimal in the long run since not all flags are useful for
      dumping and if we start to reuse the same flag definitions for
      load and dump, then we'll waste bit space. What we really just
      want is to dump the mode for now.
      
      Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
      a program is running at the native driver layer (1). Thus, add a
      mode that says that a program is running at generic XDP layer (2).
      Applications will handle this fine in that older binaries will
      just indicate that something is attached at XDP layer, effectively
      this is similar to IFLA_XDP_FLAGS attr that we would have had
      modulo the redundancy.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d67b9cd2
    • D
      xdp: add flag to enforce driver mode · 0489df9a
      Daniel Borkmann 提交于
      After commit b5cdae32 ("net: Generic XDP") we automatically fall
      back to a generic XDP variant if the driver does not support native
      XDP. Allow for an option where the user can specify that always the
      native XDP variant should be selected and in case it's not supported
      by a driver, just bail out.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0489df9a
  13. 09 5月, 2017 2 次提交
  14. 06 5月, 2017 1 次提交
    • E
      tcp: randomize timestamps on syncookies · 84b114b9
      Eric Dumazet 提交于
      Whole point of randomization was to hide server uptime, but an attacker
      can simply start a syn flood and TCP generates 'old style' timestamps,
      directly revealing server jiffies value.
      
      Also, TSval sent by the server to a particular remote address vary
      depending on syncookies being sent or not, potentially triggering PAWS
      drops for innocent clients.
      
      Lets implement proper randomization, including for SYNcookies.
      
      Also we do not need to export sysctl_tcp_timestamps, since it is not
      used from a module.
      
      In v2, I added Florian feedback and contribution, adding tsoff to
      tcp_get_cookie_sock().
      
      v3 removed one unused variable in tcp_v4_connect() as Florian spotted.
      
      Fixes: 95a22cae ("tcp: randomize tcp timestamp offsets for each connection")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Tested-by: NFlorian Westphal <fw@strlen.de>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84b114b9
  15. 04 5月, 2017 1 次提交
  16. 02 5月, 2017 2 次提交