1. 04 11月, 2013 3 次提交
    • E
      net: extend net_device allocation to vmalloc() · 74d332c1
      Eric Dumazet 提交于
      Joby Poriyath provided a xen-netback patch to reduce the size of
      xenvif structure as some netdev allocation could fail under
      memory pressure/fragmentation.
      
      This patch is handling the problem at the core level, allowing
      any netdev structures to use vmalloc() if kmalloc() failed.
      
      As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT
      to kzalloc() flags to do this fallback only when really needed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJoby Poriyath <joby.poriyath@citrix.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74d332c1
    • D
      net: sctp: fix and consolidate SCTP checksumming code · e6d8b64b
      Daniel Borkmann 提交于
      This fixes an outstanding bug found through IPVS, where SCTP packets
      with skb->data_len > 0 (non-linearized) and empty frag_list, but data
      accumulated in frags[] member, are forwarded with incorrect checksum
      letting SCTP initial handshake fail on some systems. Linearizing each
      SCTP skb in IPVS to prevent that would not be a good solution as
      this leads to an additional and unnecessary performance penalty on
      the load-balancer itself for no good reason (as we actually only want
      to update the checksum, and can do that in a different/better way
      presented here).
      
      The actual problem is elsewhere, namely, that SCTP's checksumming
      in sctp_compute_cksum() does not take frags[] into account like
      skb_checksum() does. So while we are fixing this up, we better reuse
      the existing code that we have anyway in __skb_checksum() and use it
      for walking through the data doing checksumming. This will not only
      fix this issue, but also consolidates some SCTP code with core
      sk_buff code, bringing it closer together and removing respectively
      avoiding reimplementation of skb_checksum() for no good reason.
      
      As crc32c() can use hardware implementation within the crypto layer,
      we leave that intact (it wraps around / falls back to e.g. slice-by-8
      algorithm in __crc32c_le() otherwise); plus use the __crc32c_le_combine()
      combinator for crc32c blocks.
      
      Also, we remove all other SCTP checksumming code, so that we only
      have to use sctp_compute_cksum() from now on; for doing that, we need
      to transform SCTP checkumming in output path slightly, and can leave
      the rest intact.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6d8b64b
    • D
      net: skb_checksum: allow custom update/combine for walking skb · 2817a336
      Daniel Borkmann 提交于
      Currently, skb_checksum walks over 1) linearized, 2) frags[], and
      3) frag_list data and calculats the one's complement, a 32 bit
      result suitable for feeding into itself or csum_tcpudp_magic(),
      but unsuitable for SCTP as we're calculating CRC32c there.
      
      Hence, in order to not re-implement the very same function in
      SCTP (and maybe other protocols) over and over again, use an
      update() + combine() callback internally to allow for walking
      over the skb with different algorithms.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2817a336
  2. 31 10月, 2013 7 次提交
  3. 30 10月, 2013 2 次提交
    • Y
      tcp: temporarily disable Fast Open on SYN timeout · c968601d
      Yuchung Cheng 提交于
      Fast Open currently has a fall back feature to address SYN-data being
      dropped but it requires the middle-box to pass on regular SYN retry
      after SYN-data. This is implemented in commit aab48743 ("net-tcp:
      Fast Open client - detecting SYN-data drops")
      
      However some NAT boxes will drop all subsequent packets after first
      SYN-data and blackholes the entire connections.  An example is in
      commit 356d7d88 "netfilter: nf_conntrack: fix tcp_in_window for Fast
      Open".
      
      The sender should note such incidents and fall back to use the regular
      TCP handshake on subsequent attempts temporarily as well: after the
      second SYN timeouts the original Fast Open SYN is most likely lost.
      When such an event recurs Fast Open is disabled based on the number of
      recurrences exponentially.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c968601d
    • D
      net: sched: cls_bpf: add BPF-based classifier · 7d1d65cb
      Daniel Borkmann 提交于
      This work contains a lightweight BPF-based traffic classifier that can
      serve as a flexible alternative to ematch-based tree classification, i.e.
      now that BPF filter engine can also be JITed in the kernel. Naturally, tc
      actions and policies are supported as well with cls_bpf. Multiple BPF
      programs/filter can be attached for a class, or they can just as well be
      written within a single BPF program, that's really up to the user how he
      wishes to run/optimize the code, e.g. also for inversion of verdicts etc.
      The notion of a BPF program's return/exit codes is being kept as follows:
      
           0: No match
          -1: Select classid given in "tc filter ..." command
        else: flowid, overwrite the default one
      
      As a minimal usage example with iproute2, we use a 3 band prio root qdisc
      on a router with sfq each as leave, and assign ssh and icmp bpf-based
      filters to band 1, http traffic to band 2 and the rest to band 3. For the
      first two bands we load the bytecode from a file, in the 2nd we load it
      inline as an example:
      
      echo 1 > /proc/sys/net/core/bpf_jit_enable
      
      tc qdisc del dev em1 root
      tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      
      tc qdisc add dev em1 parent 1:1 sfq perturb 16
      tc qdisc add dev em1 parent 1:2 sfq perturb 16
      tc qdisc add dev em1 parent 1:3 sfq perturb 16
      
      tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1
      tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1
      tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2
      tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3
      
      BPF programs can be easily created and passed to tc, either as inline
      'bytecode' or 'bytecode-file'. There are a couple of front-ends that can
      compile opcodes, for example:
      
      1) People familiar with tcpdump-like filters:
      
         tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf
      
      2) People that want to low-level program their filters or use BPF
         extensions that lack support by libpcap's compiler:
      
         bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf
      
         ssh.ops example code:
         ldh [12]
         jne #0x800, drop
         ldb [23]
         jneq #6, drop
         ldh [20]
         jset #0x1fff, drop
         ldxb 4 * ([14] & 0xf)
         ldh [%x + 14]
         jeq #0x16, pass
         ldh [%x + 16]
         jne #0x16, drop
         pass: ret #-1
         drop: ret #0
      
      It was chosen to load bytecode into tc, since the reverse operation,
      tc filter list dev em1, is then able to show the exact commands again.
      Possible follow-up work could also include a small expression compiler
      for iproute2. Tested with the help of bmon. This idea came up during
      the Netfilter Workshop 2013 in Copenhagen. Also thanks to feedback from
      Eric Dumazet!
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d1d65cb
  4. 29 10月, 2013 12 次提交
  5. 28 10月, 2013 4 次提交
  6. 26 10月, 2013 5 次提交
    • A
      net: fix rtnl notification in atomic context · 7f294054
      Alexei Starovoitov 提交于
      commit 991fb3f7 "dev: always advertise rx_flags changes via netlink"
      introduced rtnl notification from __dev_set_promiscuity(),
      which can be called in atomic context.
      
      Steps to reproduce:
      ip tuntap add dev tap1 mode tap
      ifconfig tap1 up
      tcpdump -nei tap1 &
      ip tuntap del dev tap1 mode tap
      
      [  271.627994] device tap1 left promiscuous mode
      [  271.639897] BUG: sleeping function called from invalid context at mm/slub.c:940
      [  271.664491] in_atomic(): 1, irqs_disabled(): 0, pid: 3394, name: ip
      [  271.677525] INFO: lockdep is turned off.
      [  271.690503] CPU: 0 PID: 3394 Comm: ip Tainted: G        W    3.12.0-rc3+ #73
      [  271.703996] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
      [  271.731254]  ffffffff81a58506 ffff8807f0d57a58 ffffffff817544e5 ffff88082fa0f428
      [  271.760261]  ffff8808071f5f40 ffff8807f0d57a88 ffffffff8108bad1 ffffffff81110ff8
      [  271.790683]  0000000000000010 00000000000000d0 00000000000000d0 ffff8807f0d57af8
      [  271.822332] Call Trace:
      [  271.838234]  [<ffffffff817544e5>] dump_stack+0x55/0x76
      [  271.854446]  [<ffffffff8108bad1>] __might_sleep+0x181/0x240
      [  271.870836]  [<ffffffff81110ff8>] ? rcu_irq_exit+0x68/0xb0
      [  271.887076]  [<ffffffff811a80be>] kmem_cache_alloc_node+0x4e/0x2a0
      [  271.903368]  [<ffffffff810b4ddc>] ? vprintk_emit+0x1dc/0x5a0
      [  271.919716]  [<ffffffff81614d67>] ? __alloc_skb+0x57/0x2a0
      [  271.936088]  [<ffffffff810b4de0>] ? vprintk_emit+0x1e0/0x5a0
      [  271.952504]  [<ffffffff81614d67>] __alloc_skb+0x57/0x2a0
      [  271.968902]  [<ffffffff8163a0b2>] rtmsg_ifinfo+0x52/0x100
      [  271.985302]  [<ffffffff8162ac6d>] __dev_notify_flags+0xad/0xc0
      [  272.001642]  [<ffffffff8162ad0c>] __dev_set_promiscuity+0x8c/0x1c0
      [  272.017917]  [<ffffffff81731ea5>] ? packet_notifier+0x5/0x380
      [  272.033961]  [<ffffffff8162b109>] dev_set_promiscuity+0x29/0x50
      [  272.049855]  [<ffffffff8172e937>] packet_dev_mc+0x87/0xc0
      [  272.065494]  [<ffffffff81732052>] packet_notifier+0x1b2/0x380
      [  272.080915]  [<ffffffff81731ea5>] ? packet_notifier+0x5/0x380
      [  272.096009]  [<ffffffff81761c66>] notifier_call_chain+0x66/0x150
      [  272.110803]  [<ffffffff8108503e>] __raw_notifier_call_chain+0xe/0x10
      [  272.125468]  [<ffffffff81085056>] raw_notifier_call_chain+0x16/0x20
      [  272.139984]  [<ffffffff81620190>] call_netdevice_notifiers_info+0x40/0x70
      [  272.154523]  [<ffffffff816201d6>] call_netdevice_notifiers+0x16/0x20
      [  272.168552]  [<ffffffff816224c5>] rollback_registered_many+0x145/0x240
      [  272.182263]  [<ffffffff81622641>] rollback_registered+0x31/0x40
      [  272.195369]  [<ffffffff816229c8>] unregister_netdevice_queue+0x58/0x90
      [  272.208230]  [<ffffffff81547ca0>] __tun_detach+0x140/0x340
      [  272.220686]  [<ffffffff81547ed6>] tun_chr_close+0x36/0x60
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f294054
    • H
      net: initialize hashrnd in flow_dissector with net_get_random_once · 66415cf8
      Hannes Frederic Sowa 提交于
      We also can defer the initialization of hashrnd in flow_dissector
      to its first use. Since net_get_random_once is irq safe now we don't
      have to audit the call paths if one of this functions get called by an
      interrupt handler.
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66415cf8
    • H
      net: make net_get_random_once irq safe · f84be2bd
      Hannes Frederic Sowa 提交于
      I initial build non irq safe version of net_get_random_once because I
      would liked to have the freedom to defer even the extraction process of
      get_random_bytes until the nonblocking pool is fully seeded.
      
      I don't think this is a good idea anymore and thus this patch makes
      net_get_random_once irq safe. Now someone using net_get_random_once does
      not need to care from where it is called.
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f84be2bd
    • N
      net: add missing dev_put() in __netdev_adjacent_dev_insert · 974daef7
      Nikolay Aleksandrov 提交于
      I think that a dev_put() is needed in the error path to preserve the
      proper dev refcount.
      
      CC: Veaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Acked-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      974daef7
    • H
      netem: markov loss model transition fix · 4a3ad7b3
      Hagen Paul Pfeifer 提交于
      The transition from markov state "3 => lost packets within a burst
      period" to "1 => successfully transmitted packets within a gap period"
      has no *additional* loss event. The loss already happen for transition
      from 1 -> 3, this additional loss will make things go wild.
      
      E.g. transition probabilities:
      
      p13:   10%
      p31:  100%
      
      Expected:
      
      Ploss = p13 / (p13 + p31)
      Ploss = ~9.09%
      
      ... but it isn't. Even worse: we get a double loss - each time.
      So simple don't return true to indicate loss, rather break and return
      false.
      Signed-off-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Stefano Salsano <stefano.salsano@uniroma2.it>
      Cc: Fabio Ludovici <fabio.ludovici@yahoo.it>
      Signed-off-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a3ad7b3
  7. 24 10月, 2013 4 次提交
  8. 23 10月, 2013 3 次提交