1. 27 3月, 2013 1 次提交
  2. 20 3月, 2013 1 次提交
    • W
      packet: packet fanout rollover during socket overload · 77f65ebd
      Willem de Bruijn 提交于
      Changes:
        v3->v2: rebase (no other changes)
                passes selftest
        v2->v1: read f->num_members only once
                fix bug: test rollover mode + flag
      
      Minimize packet drop in a fanout group. If one socket is full,
      roll over packets to another from the group. Maintain flow
      affinity during normal load using an rxhash fanout policy, while
      dispersing unexpected traffic storms that hit a single cpu, such
      as spoofed-source DoS flows. Rollover breaks affinity for flows
      arriving at saturated sockets during those conditions.
      
      The patch adds a fanout policy ROLLOVER that rotates between sockets,
      filling each socket before moving to the next. It also adds a fanout
      flag ROLLOVER. If passed along with any other fanout policy, the
      primary policy is applied until the chosen socket is full. Then,
      rollover selects another socket, to delay packet drop until the
      entire system is saturated.
      
      Probing sockets is not free. Selecting the last used socket, as
      rollover does, is a greedy approach that maximizes chance of
      success, at the cost of extreme load imbalance. In practice, with
      sufficiently long queues to absorb bursts, sockets are drained in
      parallel and load balance looks uniform in `top`.
      
      To avoid contention, scales counters with number of sockets and
      accesses them lockfree. Values are bounds checked to ensure
      correctness.
      
      Tested using an application with 9 threads pinned to CPUs, one socket
      per thread and sufficient busywork per packet operation to limits each
      thread to handling 32 Kpps. When sent 500 Kpps single UDP stream
      packets, a FANOUT_CPU setup processes 32 Kpps in total without this
      patch, 270 Kpps with the patch. Tested with read() and with a packet
      ring (V1).
      
      Also, passes psock_fanout.c unit test added to selftests.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77f65ebd
  3. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  4. 19 2月, 2013 2 次提交
  5. 04 2月, 2013 1 次提交
    • P
      packet: fix leakage of tx_ring memory · 9665d5d6
      Phil Sutter 提交于
      When releasing a packet socket, the routine packet_set_ring() is reused
      to free rings instead of allocating them. But when calling it for the
      first time, it fills req->tp_block_nr with the value of rb->pg_vec_len
      which in the second invocation makes it bail out since req->tp_block_nr
      is greater zero but req->tp_block_size is zero.
      
      This patch solves the problem by passing a zeroed auto-variable to
      packet_set_ring() upon each invocation from packet_release().
      
      As far as I can tell, this issue exists even since 69e3c75f (net: TX_RING
      and packet mmap), i.e. the original inclusion of TX ring support into
      af_packet, but applies only to sockets with both RX and TX ring
      allocated, which is probably why this was unnoticed all the time.
      Signed-off-by: NPhil Sutter <phil.sutter@viprinet.com>
      Cc: Johann Baudy <johann.baudy@gnu-log.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9665d5d6
  6. 19 11月, 2012 1 次提交
    • E
      net: Allow userns root to control llc, netfilter, netlink, packet, and xfrm · df008c91
      Eric W. Biederman 提交于
      Allow an unpriviled user who has created a user namespace, and then
      created a network namespace to effectively use the new network
      namespace, by reducing capable(CAP_NET_ADMIN) and
      capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
      CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
      
      Allow creation of af_key sockets.
      Allow creation of llc sockets.
      Allow creation of af_packet sockets.
      
      Allow sending xfrm netlink control messages.
      
      Allow binding to netlink multicast groups.
      Allow sending to netlink multicast groups.
      Allow adding and dropping netlink multicast groups.
      Allow sending to all netlink multicast groups and port ids.
      
      Allow reading the netfilter SO_IP_SET socket option.
      Allow sending netfilter netlink messages.
      Allow setting and getting ip_vs netfilter socket options.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df008c91
  7. 08 11月, 2012 1 次提交
  8. 26 10月, 2012 1 次提交
  9. 11 9月, 2012 1 次提交
  10. 24 8月, 2012 1 次提交
  11. 23 8月, 2012 2 次提交
    • P
      packet: Protect packet sk list with mutex (v2) · 0fa7fa98
      Pavel Emelyanov 提交于
      Change since v1:
      
      * Fixed inuse counters access spotted by Eric
      
      In patch eea68e2f (packet: Report socket mclist info via diag module) I've
      introduced a "scheduling in atomic" problem in packet diag module -- the
      socket list is traversed under rcu_read_lock() while performed under it sk
      mclist access requires rtnl lock (i.e. -- mutex) to be taken.
      
      [152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002
      [152363.820573] 4 locks held by crtools/12517:
      [152363.820581]  #0:  (sock_diag_mutex){+.+.+.}, at: [<ffffffff81a2dcb5>] sock_diag_rcv+0x1f/0x3e
      [152363.820613]  #1:  (sock_diag_table_mutex){+.+.+.}, at: [<ffffffff81a2de70>] sock_diag_rcv_msg+0xdb/0x11a
      [152363.820644]  #2:  (nlk->cb_mutex){+.+.+.}, at: [<ffffffff81a67d01>] netlink_dump+0x23/0x1ab
      [152363.820693]  #3:  (rcu_read_lock){.+.+..}, at: [<ffffffff81b6a049>] packet_diag_dump+0x0/0x1af
      
      Similar thing was then re-introduced by further packet diag patches (fanount
      mutex and pgvec mutex for rings) :(
      
      Apart from being terribly sorry for the above, I propose to change the packet
      sk list protection from spinlock to mutex. This lock currently protects two
      modifications:
      
      * sklist
      * prot inuse counters
      
      The sklist modifications can be just reprotected with mutex since they already
      occur in a sleeping context. The inuse counters modifications are trickier -- the
      __this_cpu_-s are used inside, thus requiring the caller to handle the potential
      issues with contexts himself. Since packet sockets' counters are modified in two
      places only (packet_create and packet_release) we only need to protect the context
      from being preempted. BH disabling is not required in this case.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fa7fa98
    • D
      af_packet: use define instead of constant · 9e67030a
      danborkmann@iogearbox.net 提交于
      Instead of using a hard-coded value for the status variable, it would make
      the code more readable to use its destined define from linux/if_packet.h.
      
      Signed-off-by: daniel.borkmann@tik.ee.ethz.ch
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e67030a
  12. 20 8月, 2012 3 次提交
  13. 15 8月, 2012 5 次提交
  14. 13 8月, 2012 1 次提交
    • D
      af_packet: remove BUG statement in tpacket_destruct_skb · 7f5c3e3a
      danborkmann@iogearbox.net 提交于
      Here's a quote of the comment about the BUG macro from asm-generic/bug.h:
      
       Don't use BUG() or BUG_ON() unless there's really no way out; one
       example might be detecting data structure corruption in the middle
       of an operation that can't be backed out of.  If the (sub)system
       can somehow continue operating, perhaps with reduced functionality,
       it's probably not BUG-worthy.
      
       If you're tempted to BUG(), think again:  is completely giving up
       really the *only* solution?  There are usually better options, where
       users don't need to reboot ASAP and can mostly shut down cleanly.
      
      In our case, the status flag of a ring buffer slot is managed from both sides,
      the kernel space and the user space. This means that even though the kernel
      side might work as expected, the user space screws up and changes this flag
      right between the send(2) is triggered when the flag is changed to
      TP_STATUS_SENDING and a given skb is destructed after some time. Then, this
      will hit the BUG macro. As David suggested, the best solution is to simply
      remove this statement since it cannot be used for kernel side internal
      consistency checks. I've tested it and the system still behaves /stable/ in
      this case, so in accordance with the above comment, we should rather remove it.
      Signed-off-by: NDaniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f5c3e3a
  15. 09 8月, 2012 1 次提交
  16. 28 6月, 2012 1 次提交
  17. 12 6月, 2012 1 次提交
  18. 04 6月, 2012 1 次提交
    • J
      net: Remove casts to same type · e3192690
      Joe Perches 提交于
      Adding casts of objects to the same type is unnecessary
      and confusing for a human reader.
      
      For example, this cast:
      
      	int y;
      	int *p = (int *)&y;
      
      I used the coccinelle script below to find and remove these
      unnecessary casts.  I manually removed the conversions this
      script produces of casts with __force and __user.
      
      @@
      type T;
      T *p;
      @@
      
      -	(T *)p
      +	p
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3192690
  19. 22 4月, 2012 1 次提交
  20. 20 4月, 2012 1 次提交
  21. 16 4月, 2012 1 次提交
  22. 29 3月, 2012 1 次提交
  23. 24 2月, 2012 1 次提交
  24. 28 12月, 2011 1 次提交
  25. 23 12月, 2011 1 次提交
  26. 19 11月, 2011 2 次提交
  27. 15 11月, 2011 1 次提交
  28. 14 11月, 2011 1 次提交
  29. 04 11月, 2011 1 次提交
    • O
      af_packet: de-inline some helper functions · eea49cc9
      Olof Johansson 提交于
      This popped some compiler errors due to mismatched prototypes. Just
      remove most manual inlines, the compiler should be able to figure out
      what makes sense to inline and not.
      
      net/packet/af_packet.c:252: warning: 'prb_curr_blk_in_use' declared inline after being called
      net/packet/af_packet.c:252: warning: previous declaration of 'prb_curr_blk_in_use' was here
      net/packet/af_packet.c:258: warning: 'prb_queue_frozen' declared inline after being called
      net/packet/af_packet.c:258: warning: previous declaration of 'prb_queue_frozen' was here
      net/packet/af_packet.c:248: warning: 'packet_previous_frame' declared inline after being called
      net/packet/af_packet.c:248: warning: previous declaration of 'packet_previous_frame' was here
      net/packet/af_packet.c:251: warning: 'packet_increment_head' declared inline after being called
      net/packet/af_packet.c:251: warning: previous declaration of 'packet_increment_head' was here
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      Cc: Chetan Loke <loke.chetan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eea49cc9
  30. 19 10月, 2011 1 次提交
  31. 11 10月, 2011 1 次提交