1. 26 4月, 2007 11 次提交
    • E
      [NET]: Introduce SIOCGSTAMPNS ioctl to get timestamps with nanosec resolution · ae40eb1e
      Eric Dumazet 提交于
      Now network timestamps use ktime_t infrastructure, we can add a new
      ioctl() SIOCGSTAMPNS command to get timestamps in 'struct timespec'.
      User programs can thus access to nanosecond resolution.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      CC: Stephen Hemminger <shemminger@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae40eb1e
    • D
      [TCP]: Abstract out all write queue operations. · fe067e8a
      David S. Miller 提交于
      This allows the write queue implementation to be changed,
      for example, to one which allows fast interval searching.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe067e8a
    • H
      [UDP]: Clean up UDP-Lite receive checksum · 759e5d00
      Herbert Xu 提交于
      This patch eliminates some duplicate code for the verification of
      receive checksums between UDP-Lite and UDP.  It does this by
      introducing __skb_checksum_complete_head which is identical to
      __skb_checksum_complete_head apart from the fact that it takes
      a length parameter rather than computing the first skb->len bytes.
      
      As a result UDP-Lite will be able to use hardware checksum offload
      for packets which do not use partial coverage checksums.  It also
      means that UDP-Lite loopback no longer does unnecessary checksum
      verification.
      
      If any NICs start support UDP-Lite this would also start working
      automatically.
      
      This patch removes the assumption that msg_flags has MSG_TRUNC clear
      upon entry in recvmsg.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      759e5d00
    • N
      [IPV6] ADDRCONF: Optimistic Duplicate Address Detection (RFC 4429) Support. · 95c385b4
      Neil Horman 提交于
      Nominally an autoconfigured IPv6 address is added to an interface in the
      Tentative state (as per RFC 2462).  Addresses in this state remain in this
      state while the Duplicate Address Detection process operates on them to
      determine their uniqueness on the network.  During this period, these
      tentative addresses may not be used for communication, increasing the time
      before a node may be able to communicate on a network.  Using Optimistic
      Duplicate Address Detection, autoconfigured addresses may be used
      immediately for communication on the network, as long as certain rules are
      followed to avoid conflicts with other nodes during the Duplicate Address
      Detection process.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95c385b4
    • E
      [NET]: convert network timestamps to ktime_t · b7aa0bf7
      Eric Dumazet 提交于
      We currently use a special structure (struct skb_timeval) and plain
      'struct timeval' to store packet timestamps in sk_buffs and struct
      sock.
      
      This has some drawbacks :
      - Fixed resolution of micro second.
      - Waste of space on 64bit platforms where sizeof(struct timeval)=16
      
      I suggest using ktime_t that is a nice abstraction of high resolution
      time services, currently capable of nanosecond resolution.
      
      As sizeof(ktime_t) is 8 bytes, using ktime_t in 'struct sock' permits
      a 8 byte shrink of this structure on 64bit architectures. Some other
      structures also benefit from this size reduction (struct ipq in
      ipv4/ip_fragment.c, struct frag_queue in ipv6/reassembly.c, ...)
      
      Once this ktime infrastructure adopted, we can more easily provide
      nanosecond resolution on top of it. (ioctl SIOCGSTAMPNS and/or
      SO_TIMESTAMPNS/SCM_TIMESTAMPNS)
      
      Note : this patch includes a bug correction in
      compat_sock_get_timestamp() where a "err = 0;" was missing (so this
      syscall returned -ENOENT instead of 0)
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      CC: Stephen Hemminger <shemminger@linux-foundation.org>
      CC: John find <linux.kernel@free.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7aa0bf7
    • J
      [NET]: Convert xtime.tv_sec to get_seconds() · 9d729f72
      James Morris 提交于
      Where appropriate, convert references to xtime.tv_sec to the
      get_seconds() helper function.
      Signed-off-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d729f72
    • E
      [NET]: Keep sk_backlog near sk_lock · fa438ccf
      Eric Dumazet 提交于
      sk_backlog is a critical field of struct sock. (known famous words)
      
      It is (ab)used in hot paths, in particular in release_sock(), tcp_recvmsg(),
      tcp_v4_rcv(), sk_receive_skb().
      
      It really makes sense to place it next to sk_lock, because sk_backlog is only
      used after sk_lock locked (and thus memory cache line in L1 cache). This
      should reduce cache misses and sk_lock acquisition time.
      
      (In theory, we could only move the head pointer near sk_lock, and leaving tail
      far away, because 'tail' is normally not so hot, but keep it simple :) )
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa438ccf
    • I
      [TCP]: Add two new spurious RTO responses to FRTO · 3cfe3baa
      Ilpo Järvinen 提交于
      New sysctl tcp_frto_response is added to select amongst these
      responses:
      	- Rate halving based; reuses CA_CWR state (default)
      	- Very conservative; used to be the only one available (=1)
      	- Undo cwr; undoes ssthresh and cwnd reductions (=2)
      
      The response with rate halving requires a new parameter to
      tcp_enter_cwr because FRTO has already reduced ssthresh and
      doing a second reduction there has to be prevented. In addition,
      to keep things nice on 80 cols screen, a local variable was
      added.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3cfe3baa
    • J
    • I
      [TCP] FRTO: Entry is allowed only during (New)Reno like recovery · 46d0de4e
      Ilpo Järvinen 提交于
      This interpretation comes from RFC4138:
          "If the sender implements some loss recovery algorithm other
           than Reno or NewReno [FHG04], the F-RTO algorithm SHOULD
           NOT be entered when earlier fast recovery is underway."
      
      I think the RFC means to say (especially in the light of
      Appendix B) that ...recovery is underway (not just fast recovery)
      or was underway when it was interrupted by an earlier (F-)RTO
      that hasn't yet been resolved (snd_una has not advanced enough).
      Thus, my interpretation is that whenever TCP has ever
      retransmitted other than head, basic version cannot be used
      because then the order assumptions which are used as FRTO basis
      do not hold.
      
      NewReno has only the head segment retransmitted at a time.
      Therefore, walk up to the segment that has not been SACKed, if
      that segment is not retransmitted nor anything before it, we know
      for sure, that nothing after the non-SACKed segment should be
      either. This assumption is valid because TCPCB_EVER_RETRANS does
      not leave holes but each non-SACKed segment is rexmitted
      in-order.
      
      Check for retrans_out > 1 avoids more expensive walk through the
      skb list, as we can know the result beforehand: F-RTO will not be
      allowed.
      
      SACKed skb can turn into non-SACked only in the extremely rare
      case of SACK reneging, in this case we might fail to detect
      retransmissions if there were them for any other than head. To
      get rid of that feature, whole rexmit queue would have to be
      walked (always) or FRTO should be prevented when SACK reneging
      happens. Of course RTO should still trigger after reneging which
      makes this issue even less likely to show up. And as long as the
      response is as conservative as it's now, nothing bad happens even
      then.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46d0de4e
    • I
      [TCP] FRTO: Moved tcp_use_frto from tcp.h to tcp_input.c · bdaae17d
      Ilpo Järvinen 提交于
      In addition, removed inline.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdaae17d
  2. 30 3月, 2007 1 次提交
  3. 28 3月, 2007 1 次提交
  4. 26 3月, 2007 3 次提交
    • D
      [IPV6]: Fix routing round-robin locking. · f11e6659
      David S. Miller 提交于
      As per RFC2461, section 6.3.6, item #2, when no routers on the
      matching list are known to be reachable or probably reachable we
      do round robin on those available routes so that we make sure
      to probe as many of them as possible to detect when one becomes
      reachable faster.
      
      Each routing table has a rwlock protecting the tree and the linked
      list of routes at each leaf.  The round robin code executes during
      lookup and thus with the rwlock taken as a reader.  A small local
      spinlock tries to provide protection but this does not work at all
      for two reasons:
      
      1) The round-robin list manipulation, as coded, goes like this (with
         read lock held):
      
      	walk routes finding head and tail
      
      	spin_lock();
      	rotate list using head and tail
      	spin_unlock();
      
         While one thread is rotating the list, another thread can
         end up with stale values of head and tail and then proceed
         to corrupt the list when it gets the lock.  This ends up causing
         the OOPS in fib6_add() later onthat many people have been hitting.
      
      2) All the other code paths that run with the rwlock held as
         a reader do not expect the list to change on them, they
         expect it to remain completely fixed while they hold the
         lock in that way.
      
      So, simply stated, it is impossible to implement this correctly using
      a manipulation of the list without violating the rwlock locking
      semantics.
      
      Reimplement using a per-fib6_node round-robin pointer.  This way we
      don't need to manipulate the list at all, and since the round-robin
      pointer can only ever point to real existing entries we don't need
      to perform any locking on the changing of the round-robin pointer
      itself.  We only need to reset the round-robin pointer to NULL when
      the entry it is pointing to is removed.
      
      The idea is from Thomas Graf and it is very similar to how this
      was implemented before the advanced router selection code when in.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f11e6659
    • A
      [NET]: Fix neighbour destructor handling. · ecbb4169
      Alexey Kuznetsov 提交于
      ->neigh_destructor() is killed (not used), replaced with
      ->neigh_cleanup(), which is called when neighbor entry goes to dead
      state. At this point everything is still valid: neigh->dev,
      neigh->parms etc.
      
      The device should guarantee that dead neighbor entries (neigh->dead !=
      0) do not get private part initialized, otherwise nobody will cleanup
      it.
      
      I think this is enough for ipoib which is the only user of this thing.
      Initialization private part of neighbor entries happens in ipib
      start_xmit routine, which is not reached when device is down.  But it
      would be better to add explicit test for neigh->dead in any case.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ecbb4169
    • T
      [NET]: Fix fib_rules compatibility breakage · e1701c68
      Thomas Graf 提交于
      Based upon a patch from Patrick McHardy.
      
      The fib_rules netlink attribute policy introduced in 2.6.19 broke
      userspace compatibilty. When specifying a rule with "from all"
      or "to all", iproute adds a zero byte long netlink attribute,
      but the policy requires all addresses to have a size equal to
      sizeof(struct in_addr)/sizeof(struct in6_addr), resulting in a
      validation error.
      
      Check attribute length of FRA_SRC/FRA_DST in the generic framework
      by letting the family specific rules implementation provide the
      length of an address. Report an error if address length is non
      zero but no address attribute is provided. Fix actual bug by
      checking address length for non-zero instead of relying on
      availability of attribute.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1701c68
  5. 20 3月, 2007 2 次提交
  6. 08 3月, 2007 1 次提交
    • E
      [IPSEC]: xfrm_policy delete security check misplaced · ef41aaa0
      Eric Paris 提交于
      The security hooks to check permissions to remove an xfrm_policy were
      actually done after the policy was removed.  Since the unlinking and
      deletion are done in xfrm_policy_by* functions this moves the hooks
      inside those 2 functions.  There we have all the information needed to
      do the security check and it can be done before the deletion.  Since
      auditing requires the result of that security check err has to be passed
      back and forth from the xfrm_policy_by* functions.
      
      This patch also fixes a bug where a deletion that failed the security
      check could cause improper accounting on the xfrm_policy
      (xfrm_get_policy didn't have a put on the exit path for the hold taken
      by xfrm_policy_by*)
      
      It also fixes the return code when no policy is found in
      xfrm_add_pol_expire.  In old code (at least back in the 2.6.18 days) err
      wasn't used before the return when no policy is found and so the
      initialization would cause err to be ENOENT.  But since err has since
      been used above when we don't get a policy back from the xfrm_policy_by*
      function we would always return 0 instead of the intended ENOENT.  Also
      fixed some white space damage in the same area.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NVenkat Yekkirala <vyekkirala@trustedcs.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef41aaa0
  7. 07 3月, 2007 1 次提交
  8. 06 3月, 2007 2 次提交
    • E
      187f5f84
    • P
      [NETFILTER]: conntrack: fix {nf,ip}_ct_iterate_cleanup endless loops · ec68e97d
      Patrick McHardy 提交于
      Fix {nf,ip}_ct_iterate_cleanup unconfirmed list handling:
      
      - unconfirmed entries can not be killed manually, they are removed on
        confirmation or final destruction of the conntrack entry, which means
        we might iterate forever without making forward progress.
      
        This can happen in combination with the conntrack event cache, which
        holds a reference to the conntrack entry, which is only released when
        the packet makes it all the way through the stack or a different
        packet is handled.
      
      - taking references to an unconfirmed entry and using it outside the
        locked section doesn't work, the list entries are not refcounted and
        another CPU might already be waiting to destroy the entry
      
      What the code really wants to do is make sure the references of the hash
      table to the selected conntrack entries are released, so they will be
      destroyed once all references from skbs and the event cache are dropped.
      
      Since unconfirmed entries haven't even entered the hash yet, simply mark
      them as dying and skip confirmation based on that.
      
      Reported and tested by Chuck Ebbert <cebbert@redhat.com>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec68e97d
  9. 03 3月, 2007 1 次提交
    • W
      [NET]: Fix bugs in "Whether sock accept queue is full" checking · 8488df89
      Wei Dong 提交于
      	when I use linux TCP socket, and find there is a bug in function
      sk_acceptq_is_full().
      
      	When a new SYN comes, TCP module first checks its validation. If valid,
      send SYN,ACK to the client and add the sock to the syn hash table. Next
      time if received the valid ACK for SYN,ACK from the client. server will
      accept this connection and increase the sk->sk_ack_backlog -- which is
      done in function tcp_check_req().We check wether acceptq is full in
      function tcp_v4_syn_recv_sock().
      
      Consider an example:
      
       After listen(sockfd, 1) system call, sk->sk_max_ack_backlog is set to
      1. As we know, sk->sk_ack_backlog is initialized to 0. Assuming accept()
      system call is not invoked now.
      
      1. 1st connection comes. invoke sk_acceptq_is_full(). sk-
      >sk_ack_backlog=0 sk->sk_max_ack_backlog=1, function return 0 accept
      this connection. Increase the sk->sk_ack_backlog
      2. 2nd connection comes. invoke sk_acceptq_is_full(). sk-
      >sk_ack_backlog=1 sk->sk_max_ack_backlog=1, function return 0 accept
      this connection. Increase the sk->sk_ack_backlog
      3. 3rd connection comes. invoke sk_acceptq_is_full(). sk-
      >sk_ack_backlog=2 sk->sk_max_ack_backlog=1, function return 1. Refuse
      this connection.
      
      I think it has bugs. after listen system call. sk->sk_max_ack_backlog=1
      but now it can accept 2 connections.
      Signed-off-by: NWei Dong <weid@np.css.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8488df89
  10. 01 3月, 2007 1 次提交
    • P
      [NET]: Handle disabled preemption in gfp_any() · 4498121c
      Patrick McHardy 提交于
      ctnetlink uses netlink_unicast from an atomic_notifier_chain
      (which is called within a RCU read side critical section)
      without holding further locks. netlink_unicast calls netlink_trim
      with the result of gfp_any() for the gfp flags, which are passed
      down to pskb_expand_header. gfp_any() only checks for softirq
      context and returns GFP_KERNEL, resulting in this warning:
      
      BUG: sleeping function called from invalid context at mm/slab.c:3032
      in_atomic():1, irqs_disabled():0
      no locks held by rmmod/7010.
      
      Call Trace:
       [<ffffffff8109467f>] debug_show_held_locks+0x9/0xb
       [<ffffffff8100b0b4>] __might_sleep+0xd9/0xdb
       [<ffffffff810b5082>] __kmalloc+0x68/0x110
       [<ffffffff811ba8f2>] pskb_expand_head+0x4d/0x13b
       [<ffffffff81053147>] netlink_broadcast+0xa5/0x2e0
       [<ffffffff881cd1d7>] :nfnetlink:nfnetlink_send+0x83/0x8a
       [<ffffffff8834f6a6>] :nf_conntrack_netlink:ctnetlink_conntrack_event+0x94c/0x96a
       [<ffffffff810624d6>] notifier_call_chain+0x29/0x3e
       [<ffffffff8106251d>] atomic_notifier_call_chain+0x32/0x60
       [<ffffffff881d266d>] :nf_conntrack:destroy_conntrack+0xa5/0x1d3
       [<ffffffff881d194e>] :nf_conntrack:nf_ct_cleanup+0x8c/0x12c
       [<ffffffff881d4614>] :nf_conntrack:kill_l3proto+0x0/0x13
       [<ffffffff881d482a>] :nf_conntrack:nf_conntrack_l3proto_unregister+0x90/0x94
       [<ffffffff883551b3>] :nf_conntrack_ipv4:nf_conntrack_l3proto_ipv4_fini+0x2b/0x5d
       [<ffffffff8109d44f>] sys_delete_module+0x1b5/0x1e6
       [<ffffffff8105f245>] trace_hardirqs_on_thunk+0x35/0x37
       [<ffffffff8105911e>] system_call+0x7e/0x83
      
      Since netlink_unicast is supposed to be callable from within RCU
      read side critical sections, make gfp_any() check for in_atomic()
      instead of in_softirq().
      
      Additionally nfnetlink_send needs to use gfp_any() as well for the
      call to netlink_broadcast).
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4498121c
  11. 27 2月, 2007 1 次提交
  12. 14 2月, 2007 2 次提交
  13. 13 2月, 2007 4 次提交
  14. 11 2月, 2007 5 次提交
  15. 09 2月, 2007 4 次提交