1. 27 1月, 2011 3 次提交
    • D
      net: Implement read-only protection and COW'ing of metrics. · 62fa8a84
      David S. Miller 提交于
      Routing metrics are now copy-on-write.
      
      Initially a route entry points it's metrics at a read-only location.
      If a routing table entry exists, it will point there.  Else it will
      point at the all zero metric place-holder called 'dst_default_metrics'.
      
      The writeability state of the metrics is stored in the low bits of the
      metrics pointer, we have two bits left to spare if we want to store
      more states.
      
      For the initial implementation, COW is implemented simply via kmalloc.
      However future enhancements will change this to place the writable
      metrics somewhere else, in order to increase sharing.  Very likely
      this "somewhere else" will be the inetpeer cache.
      
      Note also that this means that metrics updates may transiently fail
      if we cannot COW the metrics successfully.
      
      But even by itself, this patch should decrease memory usage and
      increase cache locality especially for routing workloads.  In those
      cases the read-only metric copies stay in place and never get written
      to.
      
      TCP workloads where metrics get updated, and those rare cases where
      PMTU triggers occur, will take a very slight performance hit.  But
      that hit will be alleviated when the long-term writable metrics
      move to a more sharable location.
      
      Since the metrics storage went from a u32 array of RTAX_MAX entries to
      what is essentially a pointer, some retooling of the dst_entry layout
      was necessary.
      
      Most importantly, we need to preserve the alignment of the reference
      count so that it doesn't share cache lines with the read-mostly state,
      as per Eric Dumazet's alignment assertion checks.
      
      The only non-trivial bit here is the move of the 'flags' member into
      the writeable cacheline.  This is OK since we are always accessing the
      flags around the same moment when we made a modification to the
      reference count.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62fa8a84
    • D
      xfrm6: Don't forget to propagate peer into ipsec route. · 7cc2edb8
      David S. Miller 提交于
      Like ipv4, we have to propagate the ipv6 route peer into
      the ipsec top-level route during instantiation.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7cc2edb8
    • E
      net_sched: sch_mqprio: dont leak kernel memory · 144ce879
      Eric Dumazet 提交于
      mqprio_dump() should make sure all fields of struct tc_mqprio_qopt are
      initialized.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: John Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      144ce879
  2. 26 1月, 2011 4 次提交
    • J
      TCP: fix a bug that triggers large number of TCP RST by mistake · 44f5324b
      Jerry Chu 提交于
      This patch fixes a bug that causes TCP RST packets to be generated
      on otherwise correctly behaved applications, e.g., no unread data
      on close,..., etc. To trigger the bug, at least two conditions must
      be met:
      
      1. The FIN flag is set on the last data packet, i.e., it's not on a
      separate, FIN only packet.
      2. The size of the last data chunk on the receive side matches
      exactly with the size of buffer posted by the receiver, and the
      receiver closes the socket without any further read attempt.
      
      This bug was first noticed on our netperf based testbed for our IW10
      proposal to IETF where a large number of RST packets were observed.
      netperf's read side code meets the condition 2 above 100%.
      
      Before the fix, tcp_data_queue() will queue the last skb that meets
      condition 1 to sk_receive_queue even though it has fully copied out
      (skb_copy_datagram_iovec()) the data. Then if condition 2 is also met,
      tcp_recvmsg() often returns all the copied out data successfully
      without actually consuming the skb, due to a check
      "if ((chunk = len - tp->ucopy.len) != 0) {"
      and
      "len -= chunk;"
      after tcp_prequeue_process() that causes "len" to become 0 and an
      early exit from the big while loop.
      
      I don't see any reason not to free the skb whose data have been fully
      consumed in tcp_data_queue(), regardless of the FIN flag.  We won't
      get there if MSG_PEEK is on. Am I missing some arcane cases related
      to urgent data?
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44f5324b
    • F
      mac80211: fix a crash in ieee80211_beacon_get_tim on change_interface · eb3e554b
      Felix Fietkau 提交于
      Some drivers (e.g. ath9k) do not always disable beacons when they're
      supposed to. When an interface is changed using the change_interface op,
      the mode specific sdata part is in an undefined state and trying to
      get a beacon at this point can produce weird crashes.
      
      To fix this, add a check for ieee80211_sdata_running before using
      anything from the sdata.
      Signed-off-by: NFelix Fietkau <nbd@openwrt.org>
      Cc: stable@kernel.org
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      eb3e554b
    • E
      pktgen: speedup fragmented skbs · 26ad7879
      Eric Dumazet 提交于
      We spend lot of time clearing pages in pktgen.
      (Or not clearing them on ipv6 and leaking kernel memory)
      
      Since we dont modify them, we can use one zeroed page, and get
      references on it. This page can use NUMA affinity as well.
      
      Define pktgen_finalize_skb() helper, used both in ipv4 and ipv6
      
      Results using skbs with one frag :
      
      Before patch :
      
      Result: OK: 608980458(c608978520+d1938) nsec, 1000000000
      (100byte,1frags)
        1642088pps 1313Mb/sec (1313670400bps) errors: 0
      
      After patch :
      
      Result: OK: 345285014(c345283891+d1123) nsec, 1000000000
      (100byte,1frags)
        2896158pps 2316Mb/sec (2316926400bps) errors: 0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26ad7879
    • D
      ipv6: Revert 'administrative down' address handling changes. · 73a8bd74
      David S. Miller 提交于
      This reverts the following set of commits:
      
      d1ed113f ("ipv6: remove duplicate neigh_ifdown")
      29ba5fed ("ipv6: don't flush routes when setting loopback down")
      9d82ca98 ("ipv6: fix missing in6_ifa_put in addrconf")
      2de79570 ("ipv6: addrconf: don't remove address state on ifdown if the address is being kept")
      8595805a ("IPv6: only notify protocols if address is compeletely gone")
      27bdb2ab ("IPv6: keep tentative addresses in hash table")
      93fa159a ("IPv6: keep route for tentative address")
      8f37ada5 ("IPv6: fix race between cleanup and add/delete address")
      84e8b803 ("IPv6: addrconf notify when address is unavailable")
      dc2b99f7 ("IPv6: keep permanent addresses on admin down")
      
      because the core semantic change to ipv6 address handling on ifdown
      has broken some things, in particular "disable_ipv6" sysctl handling.
      
      Stephen has made several attempts to get things back in working order,
      but nothing has restored disable_ipv6 fully yet.
      Reported-by: NEric W. Biederman <ebiederm@xmission.com>
      Tested-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73a8bd74
  3. 25 1月, 2011 12 次提交
  4. 24 1月, 2011 1 次提交
  5. 22 1月, 2011 2 次提交
  6. 21 1月, 2011 11 次提交
  7. 20 1月, 2011 7 次提交
    • C
      netfilter: nf_nat: place conntrack in source hash after SNAT is done · 41a7cab6
      Changli Gao 提交于
      If SNAT isn't done, the wrong info maybe got by the other cts.
      
      As the filter table is after DNAT table, the packets dropped in filter
      table also bother bysource hash table.
      Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      41a7cab6
    • F
      netfilter: do not omit re-route check on NF_QUEUE verdict · 28a51ba5
      Florian Westphal 提交于
      ret != NF_QUEUE only works in the "--queue-num 0" case; for
      queues > 0 the test should be '(ret & NF_VERDICT_MASK) != NF_QUEUE'.
      
      However, NF_QUEUE no longer DROPs the skb unconditionally if queueing
      fails (due to NF_VERDICT_FLAG_QUEUE_BYPASS verdict flag), so the
      re-route test should also be performed if this flag is set in the
      verdict.
      
      The full test would then look something like
      
      && ((ret & NF_VERDICT_MASK) == NF_QUEUE && (ret & NF_VERDICT_FLAG_QUEUE_BYPASS))
      
      This is rather ugly, so just remove the NF_QUEUE test altogether.
      
      The only effect is that we might perform an unnecessary route lookup
      in the NF_QUEUE case.
      
      ip6table_mangle did not have such a check.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      28a51ba5
    • E
      net_sched: cleanups · cc7ec456
      Eric Dumazet 提交于
      Cleanup net/sched code to current CodingStyle and practices.
      
      Reduce inline abuse
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc7ec456
    • A
    • J
      net_sched: implement a root container qdisc sch_mqprio · b8970f0b
      John Fastabend 提交于
      This implements a mqprio queueing discipline that by default creates
      a pfifo_fast qdisc per tx queue and provides the needed configuration
      interface.
      
      Using the mqprio qdisc the number of tcs currently in use along
      with the range of queues alloted to each class can be configured. By
      default skbs are mapped to traffic classes using the skb priority.
      This mapping is configurable.
      
      Configurable parameters,
      
      struct tc_mqprio_qopt {
      	__u8    num_tc;
      	__u8    prio_tc_map[TC_BITMASK + 1];
      	__u8    hw;
      	__u16   count[TC_MAX_QUEUE];
      	__u16   offset[TC_MAX_QUEUE];
      };
      
      Here the count/offset pairing give the queue alignment and the
      prio_tc_map gives the mapping from skb->priority to tc.
      
      The hw bit determines if the hardware should configure the count
      and offset values. If the hardware bit is set then the operation
      will fail if the hardware does not implement the ndo_setup_tc
      operation. This is to avoid undetermined states where the hardware
      may or may not control the queue mapping. Also minimal bounds
      checking is done on the count/offset to verify a queue does not
      exceed num_tx_queues and that queue ranges do not overlap. Otherwise
      it is left to user policy or hardware configuration to create
      useful mappings.
      
      It is expected that hardware QOS schemes can be implemented by
      creating appropriate mappings of queues in ndo_tc_setup().
      
      One expected use case is drivers will use the ndo_setup_tc to map
      queue ranges onto 802.1Q traffic classes. This provides a generic
      mechanism to map network traffic onto these traffic classes and
      removes the need for lower layer drivers to know specifics about
      traffic types.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8970f0b
    • J
      net: implement mechanism for HW based QOS · 4f57c087
      John Fastabend 提交于
      This patch provides a mechanism for lower layer devices to
      steer traffic using skb->priority to tx queues. This allows
      for hardware based QOS schemes to use the default qdisc without
      incurring the penalties related to global state and the qdisc
      lock. While reliably receiving skbs on the correct tx ring
      to avoid head of line blocking resulting from shuffling in
      the LLD. Finally, all the goodness from txq caching and xps/rps
      can still be leveraged.
      
      Many drivers and hardware exist with the ability to implement
      QOS schemes in the hardware but currently these drivers tend
      to rely on firmware to reroute specific traffic, a driver
      specific select_queue or the queue_mapping action in the
      qdisc.
      
      By using select_queue for this drivers need to be updated for
      each and every traffic type and we lose the goodness of much
      of the upstream work. Firmware solutions are inherently
      inflexible. And finally if admins are expected to build a
      qdisc and filter rules to steer traffic this requires knowledge
      of how the hardware is currently configured. The number of tx
      queues and the queue offsets may change depending on resources.
      Also this approach incurs all the overhead of a qdisc with filters.
      
      With the mechanism in this patch users can set skb priority using
      expected methods ie setsockopt() or the stack can set the priority
      directly. Then the skb will be steered to the correct tx queues
      aligned with hardware QOS traffic classes. In the normal case with
      single traffic class and all queues in this class everything
      works as is until the LLD enables multiple tcs.
      
      To steer the skb we mask out the lower 4 bits of the priority
      and allow the hardware to configure upto 15 distinct classes
      of traffic. This is expected to be sufficient for most applications
      at any rate it is more then the 8021Q spec designates and is
      equal to the number of prio bands currently implemented in
      the default qdisc.
      
      This in conjunction with a userspace application such as
      lldpad can be used to implement 8021Q transmission selection
      algorithms one of these algorithms being the extended transmission
      selection algorithm currently being used for DCB.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f57c087
    • V
      netlink: support setting devgroup parameters · e7ed828f
      Vlad Dogaru 提交于
      If a rtnetlink request specifies a negative or zero ifindex and has no
      interface name attribute, but has a group attribute, then the chenges
      are made to all the interfaces belonging to the specified group.
      Signed-off-by: NVlad Dogaru <ddvlad@rosedu.org>
      Acked-by: NJamal Hadi Salim <hadi@cyberus.ca>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7ed828f
新手
引导
客服 返回
顶部