1. 30 8月, 2013 2 次提交
  2. 21 8月, 2013 1 次提交
  3. 10 8月, 2013 1 次提交
    • E
      net: attempt high order allocations in sock_alloc_send_pskb() · 28d64271
      Eric Dumazet 提交于
      Adding paged frags skbs to af_unix sockets introduced a performance
      regression on large sends because of additional page allocations, even
      if each skb could carry at least 100% more payload than before.
      
      We can instruct sock_alloc_send_pskb() to attempt high order
      allocations.
      
      Most of the time, it does a single page allocation instead of 8.
      
      I added an additional parameter to sock_alloc_send_pskb() to
      let other users to opt-in for this new feature on followup patches.
      
      Tested:
      
      Before patch :
      
      $ netperf -t STREAM_STREAM
      STREAM STREAM TEST
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       2304  212992  212992    10.00    46861.15
      
      After patch :
      
      $ netperf -t STREAM_STREAM
      STREAM STREAM TEST
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       2304  212992  212992    10.00    57981.11
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28d64271
  4. 08 8月, 2013 1 次提交
  5. 03 8月, 2013 3 次提交
  6. 23 7月, 2013 1 次提交
  7. 13 6月, 2013 1 次提交
  8. 29 5月, 2013 1 次提交
  9. 04 5月, 2013 1 次提交
    • D
      packet: tpacket_v3: do not trigger bug() on wrong header status · 8da3056c
      Daniel Borkmann 提交于
      Jakub reported that it is fairly easy to trigger the BUG() macro
      from user space with TPACKET_V3's RX_RING by just giving a wrong
      header status flag. We already had a similar situation in commit
      7f5c3e3a (``af_packet: remove BUG statement in
      tpacket_destruct_skb'') where this was the case in the TX_RING
      side that could be triggered from user space. So really, don't use
      BUG() or BUG_ON() unless there's really no way out, and i.e.
      don't use it for consistency checking when there's user space
      involved, no excuses, especially not if you're slapping the user
      with WARN + dump_stack + BUG all at once. The two functions are
      of concern:
      
        prb_retire_current_block() [when block status != TP_STATUS_KERNEL]
        prb_open_block() [when block_status != TP_STATUS_KERNEL]
      
      Calls to prb_open_block() are guarded by ealier checks if block_status
      is really TP_STATUS_KERNEL (racy!), but the first one BUG() is easily
      triggable from user space. System behaves still stable after they are
      removed. Also remove that yoda condition entirely, since it's already
      guarded.
      Reported-by: NJakub Zawadzki <darkjames-ws@darkjames.pl>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8da3056c
  10. 25 4月, 2013 4 次提交
    • D
      packet: account statistics only in tpacket_stats_u · ee80fbf3
      Daniel Borkmann 提交于
      Currently, packet_sock has a struct tpacket_stats stats member for
      TPACKET_V1 and TPACKET_V2 statistic accounting, and with TPACKET_V3
      ``union tpacket_stats_u stats_u'' was introduced, where however only
      statistics for TPACKET_V3 are held, and when copied to user space,
      TPACKET_V3 does some hackery and access also tpacket_stats' stats,
      although everything could have been done within the union itself.
      
      Unify accounting within the tpacket_stats_u union so that we can
      remove 8 bytes from packet_sock that are there unnecessary. Note that
      even if we switch to TPACKET_V3 and would use non mmap(2)ed option,
      this still works due to the union with same types + offsets, that are
      exposed to the user space.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee80fbf3
    • D
      packet: if hw/sw ts enabled in rx/tx ring, report which ts we got · b9c32fb2
      Daniel Borkmann 提交于
      Currently, there is no way to find out which timestamp is reported in
      tpacket{,2,3}_hdr's tp_sec, tp_{n,u}sec members. It can be one of
      SOF_TIMESTAMPING_SYS_HARDWARE, SOF_TIMESTAMPING_RAW_HARDWARE,
      SOF_TIMESTAMPING_SOFTWARE, or a fallback variant late call from the
      PF_PACKET code in software.
      
      Therefore, report in the tp_status member of the ring buffer which
      timestamp has been reported for RX and TX path. This should not break
      anything for the following reasons: i) in RX ring path, the user needs
      to test for tp_status & TP_STATUS_USER, and later for other flags as
      well such as TP_STATUS_VLAN_VALID et al, so adding other flags will
      do no harm; ii) in TX ring path, time stamps with PACKET_TIMESTAMP
      socketoption are not available resp. had no effect except that the
      application setting this is buggy. Next to TP_STATUS_AVAILABLE, the
      user also should check for other flags such as TP_STATUS_WRONG_FORMAT
      to reclaim frames to the application. Thus, in case TX ts are turned
      off (default case), nothing happens to the application logic, and in
      case we want to use this new feature, we now can also check which of
      the ts source is reported in the status field as provided in the docs.
      Reported-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9c32fb2
    • D
      packet: enable hardware tx timestamping on tpacket ring · 7a51384c
      Daniel Borkmann 提交于
      Currently, we only have software timestamping for the TX ring buffer
      path, but this limitation stems rather from the implementation. By
      just reusing tpacket_get_timestamp(), we can also allow hardware
      timestamping just as in the RX path.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a51384c
    • W
      packet: tx timestamping on tpacket ring · 2e31396f
      Willem de Bruijn 提交于
      When transmit timestamping is enabled at the socket level, record a
      timestamp on packets written to a PACKET_TX_RING. Tx timestamps are
      always looped to the application over the socket error queue. Software
      timestamps are also written back into the packet frame header in the
      packet ring.
      Reported-by: NPaul Chavent <paul.chavent@onera.fr>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e31396f
  11. 20 4月, 2013 1 次提交
  12. 17 4月, 2013 1 次提交
  13. 15 4月, 2013 1 次提交
  14. 28 3月, 2013 1 次提交
  15. 27 3月, 2013 1 次提交
  16. 20 3月, 2013 1 次提交
    • W
      packet: packet fanout rollover during socket overload · 77f65ebd
      Willem de Bruijn 提交于
      Changes:
        v3->v2: rebase (no other changes)
                passes selftest
        v2->v1: read f->num_members only once
                fix bug: test rollover mode + flag
      
      Minimize packet drop in a fanout group. If one socket is full,
      roll over packets to another from the group. Maintain flow
      affinity during normal load using an rxhash fanout policy, while
      dispersing unexpected traffic storms that hit a single cpu, such
      as spoofed-source DoS flows. Rollover breaks affinity for flows
      arriving at saturated sockets during those conditions.
      
      The patch adds a fanout policy ROLLOVER that rotates between sockets,
      filling each socket before moving to the next. It also adds a fanout
      flag ROLLOVER. If passed along with any other fanout policy, the
      primary policy is applied until the chosen socket is full. Then,
      rollover selects another socket, to delay packet drop until the
      entire system is saturated.
      
      Probing sockets is not free. Selecting the last used socket, as
      rollover does, is a greedy approach that maximizes chance of
      success, at the cost of extreme load imbalance. In practice, with
      sufficiently long queues to absorb bursts, sockets are drained in
      parallel and load balance looks uniform in `top`.
      
      To avoid contention, scales counters with number of sockets and
      accesses them lockfree. Values are bounds checked to ensure
      correctness.
      
      Tested using an application with 9 threads pinned to CPUs, one socket
      per thread and sufficient busywork per packet operation to limits each
      thread to handling 32 Kpps. When sent 500 Kpps single UDP stream
      packets, a FANOUT_CPU setup processes 32 Kpps in total without this
      patch, 270 Kpps with the patch. Tested with read() and with a packet
      ring (V1).
      
      Also, passes psock_fanout.c unit test added to selftests.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77f65ebd
  17. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  18. 19 2月, 2013 2 次提交
  19. 04 2月, 2013 1 次提交
    • P
      packet: fix leakage of tx_ring memory · 9665d5d6
      Phil Sutter 提交于
      When releasing a packet socket, the routine packet_set_ring() is reused
      to free rings instead of allocating them. But when calling it for the
      first time, it fills req->tp_block_nr with the value of rb->pg_vec_len
      which in the second invocation makes it bail out since req->tp_block_nr
      is greater zero but req->tp_block_size is zero.
      
      This patch solves the problem by passing a zeroed auto-variable to
      packet_set_ring() upon each invocation from packet_release().
      
      As far as I can tell, this issue exists even since 69e3c75f (net: TX_RING
      and packet mmap), i.e. the original inclusion of TX ring support into
      af_packet, but applies only to sockets with both RX and TX ring
      allocated, which is probably why this was unnoticed all the time.
      Signed-off-by: NPhil Sutter <phil.sutter@viprinet.com>
      Cc: Johann Baudy <johann.baudy@gnu-log.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9665d5d6
  20. 19 11月, 2012 1 次提交
    • E
      net: Allow userns root to control llc, netfilter, netlink, packet, and xfrm · df008c91
      Eric W. Biederman 提交于
      Allow an unpriviled user who has created a user namespace, and then
      created a network namespace to effectively use the new network
      namespace, by reducing capable(CAP_NET_ADMIN) and
      capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
      CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
      
      Allow creation of af_key sockets.
      Allow creation of llc sockets.
      Allow creation of af_packet sockets.
      
      Allow sending xfrm netlink control messages.
      
      Allow binding to netlink multicast groups.
      Allow sending to netlink multicast groups.
      Allow adding and dropping netlink multicast groups.
      Allow sending to all netlink multicast groups and port ids.
      
      Allow reading the netfilter SO_IP_SET socket option.
      Allow sending netfilter netlink messages.
      Allow setting and getting ip_vs netfilter socket options.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df008c91
  21. 08 11月, 2012 1 次提交
  22. 26 10月, 2012 1 次提交
  23. 24 8月, 2012 1 次提交
  24. 23 8月, 2012 2 次提交
    • P
      packet: Protect packet sk list with mutex (v2) · 0fa7fa98
      Pavel Emelyanov 提交于
      Change since v1:
      
      * Fixed inuse counters access spotted by Eric
      
      In patch eea68e2f (packet: Report socket mclist info via diag module) I've
      introduced a "scheduling in atomic" problem in packet diag module -- the
      socket list is traversed under rcu_read_lock() while performed under it sk
      mclist access requires rtnl lock (i.e. -- mutex) to be taken.
      
      [152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002
      [152363.820573] 4 locks held by crtools/12517:
      [152363.820581]  #0:  (sock_diag_mutex){+.+.+.}, at: [<ffffffff81a2dcb5>] sock_diag_rcv+0x1f/0x3e
      [152363.820613]  #1:  (sock_diag_table_mutex){+.+.+.}, at: [<ffffffff81a2de70>] sock_diag_rcv_msg+0xdb/0x11a
      [152363.820644]  #2:  (nlk->cb_mutex){+.+.+.}, at: [<ffffffff81a67d01>] netlink_dump+0x23/0x1ab
      [152363.820693]  #3:  (rcu_read_lock){.+.+..}, at: [<ffffffff81b6a049>] packet_diag_dump+0x0/0x1af
      
      Similar thing was then re-introduced by further packet diag patches (fanount
      mutex and pgvec mutex for rings) :(
      
      Apart from being terribly sorry for the above, I propose to change the packet
      sk list protection from spinlock to mutex. This lock currently protects two
      modifications:
      
      * sklist
      * prot inuse counters
      
      The sklist modifications can be just reprotected with mutex since they already
      occur in a sleeping context. The inuse counters modifications are trickier -- the
      __this_cpu_-s are used inside, thus requiring the caller to handle the potential
      issues with contexts himself. Since packet sockets' counters are modified in two
      places only (packet_create and packet_release) we only need to protect the context
      from being preempted. BH disabling is not required in this case.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fa7fa98
    • D
      af_packet: use define instead of constant · 9e67030a
      danborkmann@iogearbox.net 提交于
      Instead of using a hard-coded value for the status variable, it would make
      the code more readable to use its destined define from linux/if_packet.h.
      
      Signed-off-by: daniel.borkmann@tik.ee.ethz.ch
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e67030a
  25. 20 8月, 2012 2 次提交
  26. 15 8月, 2012 2 次提交
  27. 13 8月, 2012 1 次提交
    • D
      af_packet: remove BUG statement in tpacket_destruct_skb · 7f5c3e3a
      danborkmann@iogearbox.net 提交于
      Here's a quote of the comment about the BUG macro from asm-generic/bug.h:
      
       Don't use BUG() or BUG_ON() unless there's really no way out; one
       example might be detecting data structure corruption in the middle
       of an operation that can't be backed out of.  If the (sub)system
       can somehow continue operating, perhaps with reduced functionality,
       it's probably not BUG-worthy.
      
       If you're tempted to BUG(), think again:  is completely giving up
       really the *only* solution?  There are usually better options, where
       users don't need to reboot ASAP and can mostly shut down cleanly.
      
      In our case, the status flag of a ring buffer slot is managed from both sides,
      the kernel space and the user space. This means that even though the kernel
      side might work as expected, the user space screws up and changes this flag
      right between the send(2) is triggered when the flag is changed to
      TP_STATUS_SENDING and a given skb is destructed after some time. Then, this
      will hit the BUG macro. As David suggested, the best solution is to simply
      remove this statement since it cannot be used for kernel side internal
      consistency checks. I've tested it and the system still behaves /stable/ in
      this case, so in accordance with the above comment, we should rather remove it.
      Signed-off-by: NDaniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f5c3e3a
  28. 09 8月, 2012 1 次提交
  29. 28 6月, 2012 1 次提交
  30. 12 6月, 2012 1 次提交