1. 05 12月, 2012 2 次提交
  2. 03 12月, 2012 2 次提交
    • J
      netfilter: nf_nat: Handle routing changes in MASQUERADE target · a0ecb85a
      Jozsef Kadlecsik 提交于
      When the route changes (backup default route, VPNs) which affect a
      masqueraded target, the packets were sent out with the outdated source
      address. The patch addresses the issue by comparing the outgoing interface
      directly with the masqueraded interface in the nat table.
      
      Events are inefficient in this case, because it'd require adding route
      events to the network core and then scanning the whole conntrack table
      and re-checking the route for all entry.
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      a0ecb85a
    • W
      tcp: don't abort splice() after small transfers · 02275a2e
      Willy Tarreau 提交于
      TCP coalescing added a regression in splice(socket->pipe) performance,
      for some workloads because of the way tcp_read_sock() is implemented.
      
      The reason for this is the break when (offset + 1 != skb->len).
      
      As we released the socket lock, this condition is possible if TCP stack
      added a fragment to the skb, which can happen with TCP coalescing.
      
      So let's go back to the beginning of the loop when this happens,
      to give a chance to splice more frags per system call.
      
      Doing so fixes the issue and makes GRO 10% faster than LRO
      on CPU-bound splice() workloads instead of the opposite.
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02275a2e
  3. 02 12月, 2012 1 次提交
    • E
      tcp: change default tcp hash size · fd90b29d
      Eric Dumazet 提交于
      As time passed, available memory increased faster than number of
      concurrent tcp sockets.
      
      As a result, a machine with 4GB of ram gets a hash table
      with 524288 slots, using 8388608 bytes of memory.
      
      Lets change that by a 16x factor (one slot for 128 KB of ram)
      
      Even if a small machine needs a _lot_ of sockets, tcp lookups are now
      very efficient, using one cache line per socket.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd90b29d
  4. 01 12月, 2012 1 次提交
    • E
      net: move inet_dport/inet_num in sock_common · ce43b03e
      Eric Dumazet 提交于
      commit 68835aba (net: optimize INET input path further)
      moved some fields used for tcp/udp sockets lookup in the first cache
      line of struct sock_common.
      
      This patch moves inet_dport/inet_num as well, filling a 32bit hole
      on 64 bit arches and reducing number of cache line misses in lookups.
      
      Also change INET_MATCH()/INET_TW_MATCH() to perform the ports match
      before addresses match, as this check is more discriminant.
      
      Remove the hash check from MATCH() macros because we dont need to
      re validate the hash value after taking a refcount on socket, and
      use likely/unlikely compiler hints, as the sk_hash/hash check
      makes the following conditional tests 100% predicted by cpu.
      
      Introduce skc_addrpair/skc_portpair pair values to better
      document the alignment requirements of the port/addr pairs
      used in the various MATCH() macros, and remove some casts.
      
      The namespace check can also be done at last.
      
      This slightly improves TCP/UDP lookup times.
      
      IP/TCP early demux needs inet->rx_dst_ifindex and
      TCP needs inet->min_ttl, lets group them together in same cache line.
      
      With help from Ben Hutchings & Joe Perches.
      
      Idea of this patch came after Ling Ma proposal to move skc_hash
      to the beginning of struct sock_common, and should allow him
      to submit a final version of his patch. My tests show an improvement
      doing so.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Ling Ma <ling.ma.program@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce43b03e
  5. 27 11月, 2012 2 次提交
  6. 26 11月, 2012 2 次提交
  7. 23 11月, 2012 2 次提交
    • J
      ipv4: do not cache looped multicasts · 63617421
      Julian Anastasov 提交于
      	Starting from 3.6 we cache output routes for
      multicasts only when using route to 224/4. For local receivers
      we can set RTCF_LOCAL flag depending on the membership but
      in such case we use maddr and saddr which are not caching
      keys as before. Additionally, we can not use same place to
      cache routes that differ in RTCF_LOCAL flag value.
      
      	Fix it by caching only RTCF_MULTICAST entries
      without RTCF_LOCAL (send-only, no loopback). As a side effect,
      we avoid unneeded lookup for fnhe when not caching because
      multicasts are not redirected and they do not learn PMTU.
      
      	Thanks to Maxime Bizon for showing the caching
      problems in __mkroute_output for 3.6 kernels: different
      RTCF_LOCAL flag in cache can lead to wrong ip_mc_output or
      ip_output call and the visible problem is that traffic can
      not reach local receivers via loopback.
      Reported-by: NMaxime Bizon <mbizon@freebox.fr>
      Tested-by: NMaxime Bizon <mbizon@freebox.fr>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63617421
    • A
      ipv6: adapt connect for repair move · 2b916477
      Andrey Vagin 提交于
      This is work the same as for ipv4.
      
      All other hacks about tcp repair are in common code for ipv4 and ipv6,
      so this patch is enough for repairing ipv6 connections.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b916477
  8. 19 11月, 2012 6 次提交
    • E
      net: Make CAP_NET_BIND_SERVICE per user namespace · 3594698a
      Eric W. Biederman 提交于
      Allow privileged users in any user namespace to bind to
      privileged sockets in network namespaces they control.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3594698a
    • E
      net: Enable a userns root rtnl calls that are safe for unprivilged users · b51642f6
      Eric W. Biederman 提交于
      - Only allow moving network devices to network namespaces you have
        CAP_NET_ADMIN privileges over.
      
      - Enable creating/deleting/modifying interfaces
      - Enable adding/deleting addresses
      - Enable adding/setting/deleting neighbour entries
      - Enable adding/removing routes
      - Enable adding/removing fib rules
      - Enable setting the forwarding state
      - Enable adding/removing ipv6 address labels
      - Enable setting bridge parameter
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b51642f6
    • E
      net: Enable some sysctls that are safe for the userns root · c027aab4
      Eric W. Biederman 提交于
      - Enable the per device ipv4 sysctls:
         net/ipv4/conf/<if>/forwarding
         net/ipv4/conf/<if>/mc_forwarding
         net/ipv4/conf/<if>/accept_redirects
         net/ipv4/conf/<if>/secure_redirects
         net/ipv4/conf/<if>/shared_media
         net/ipv4/conf/<if>/rp_filter
         net/ipv4/conf/<if>/send_redirects
         net/ipv4/conf/<if>/accept_source_route
         net/ipv4/conf/<if>/accept_local
         net/ipv4/conf/<if>/src_valid_mark
         net/ipv4/conf/<if>/proxy_arp
         net/ipv4/conf/<if>/medium_id
         net/ipv4/conf/<if>/bootp_relay
         net/ipv4/conf/<if>/log_martians
         net/ipv4/conf/<if>/tag
         net/ipv4/conf/<if>/arp_filter
         net/ipv4/conf/<if>/arp_announce
         net/ipv4/conf/<if>/arp_ignore
         net/ipv4/conf/<if>/arp_accept
         net/ipv4/conf/<if>/arp_notify
         net/ipv4/conf/<if>/proxy_arp_pvlan
         net/ipv4/conf/<if>/disable_xfrm
         net/ipv4/conf/<if>/disable_policy
         net/ipv4/conf/<if>/force_igmp_version
         net/ipv4/conf/<if>/promote_secondaries
         net/ipv4/conf/<if>/route_localnet
      
      - Enable the global ipv4 sysctl:
         net/ipv4/ip_forward
      
      - Enable the per device ipv6 sysctls:
         net/ipv6/conf/<if>/forwarding
         net/ipv6/conf/<if>/hop_limit
         net/ipv6/conf/<if>/mtu
         net/ipv6/conf/<if>/accept_ra
         net/ipv6/conf/<if>/accept_redirects
         net/ipv6/conf/<if>/autoconf
         net/ipv6/conf/<if>/dad_transmits
         net/ipv6/conf/<if>/router_solicitations
         net/ipv6/conf/<if>/router_solicitation_interval
         net/ipv6/conf/<if>/router_solicitation_delay
         net/ipv6/conf/<if>/force_mld_version
         net/ipv6/conf/<if>/use_tempaddr
         net/ipv6/conf/<if>/temp_valid_lft
         net/ipv6/conf/<if>/temp_prefered_lft
         net/ipv6/conf/<if>/regen_max_retry
         net/ipv6/conf/<if>/max_desync_factor
         net/ipv6/conf/<if>/max_addresses
         net/ipv6/conf/<if>/accept_ra_defrtr
         net/ipv6/conf/<if>/accept_ra_pinfo
         net/ipv6/conf/<if>/accept_ra_rtr_pref
         net/ipv6/conf/<if>/router_probe_interval
         net/ipv6/conf/<if>/accept_ra_rt_info_max_plen
         net/ipv6/conf/<if>/proxy_ndp
         net/ipv6/conf/<if>/accept_source_route
         net/ipv6/conf/<if>/optimistic_dad
         net/ipv6/conf/<if>/mc_forwarding
         net/ipv6/conf/<if>/disable_ipv6
         net/ipv6/conf/<if>/accept_dad
         net/ipv6/conf/<if>/force_tllao
      
      - Enable the global ipv6 sysctls:
         net/ipv6/bindv6only
         net/ipv6/icmp/ratelimit
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c027aab4
    • E
      net: Allow userns root to control ipv4 · 52e804c6
      Eric W. Biederman 提交于
      Allow an unpriviled user who has created a user namespace, and then
      created a network namespace to effectively use the new network
      namespace, by reducing capable(CAP_NET_ADMIN) and
      capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
      CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
      
      Settings that merely control a single network device are allowed.
      Either the network device is a logical network device where
      restrictions make no difference or the network device is hardware NIC
      that has been explicity moved from the initial network namespace.
      
      In general policy and network stack state changes are allowed
      while resource control is left unchanged.
      
      Allow creating raw sockets.
      Allow the SIOCSARP ioctl to control the arp cache.
      Allow the SIOCSIFFLAG ioctl to allow setting network device flags.
      Allow the SIOCSIFADDR ioctl to allow setting a netdevice ipv4 address.
      Allow the SIOCSIFBRDADDR ioctl to allow setting a netdevice ipv4 broadcast address.
      Allow the SIOCSIFDSTADDR ioctl to allow setting a netdevice ipv4 destination address.
      Allow the SIOCSIFNETMASK ioctl to allow setting a netdevice ipv4 netmask.
      Allow the SIOCADDRT and SIOCDELRT ioctls to allow adding and deleting ipv4 routes.
      
      Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
      adding, changing and deleting gre tunnels.
      
      Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
      adding, changing and deleting ipip tunnels.
      
      Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
      adding, changing and deleting ipsec virtual tunnel interfaces.
      
      Allow setting the MRT_INIT, MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC,
      MRT_DEL_MFC, MRT_ASSERT, MRT_PIM, MRT_TABLE socket options on multicast routing
      sockets.
      
      Allow setting and receiving IPOPT_CIPSO, IP_OPT_SEC, IP_OPT_SID and
      arbitrary ip options.
      
      Allow setting IP_SEC_POLICY/IP_XFRM_POLICY ipv4 socket option.
      Allow setting the IP_TRANSPARENT ipv4 socket option.
      Allow setting the TCP_REPAIR socket option.
      Allow setting the TCP_CONGESTION socket option.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52e804c6
    • E
      net: Push capable(CAP_NET_ADMIN) into the rtnl methods · dfc47ef8
      Eric W. Biederman 提交于
      - In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check
        to ns_capable(net->user-ns, CAP_NET_ADMIN).  Allowing unprivileged
        users to make netlink calls to modify their local network
        namespace.
      
      - In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so
        that calls that are not safe for unprivileged users are still
        protected.
      
      Later patches will remove the extra capable calls from methods
      that are safe for unprivilged users.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dfc47ef8
    • E
      net: Don't export sysctls to unprivileged users · 464dc801
      Eric W. Biederman 提交于
      In preparation for supporting the creation of network namespaces
      by unprivileged users, modify all of the per net sysctl exports
      and refuse to allow them to unprivileged users.
      
      This makes it safe for unprivileged users in general to access
      per net sysctls, and allows sysctls to be exported to unprivileged
      users on an individual basis as they are deemed safe.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      464dc801
  9. 17 11月, 2012 1 次提交
  10. 16 11月, 2012 7 次提交
  11. 15 11月, 2012 6 次提交
  12. 14 11月, 2012 1 次提交
    • E
      tcp: tcp_replace_ts_recent() should not be called from tcp_validate_incoming() · bd090dfc
      Eric Dumazet 提交于
      We added support for RFC 5961 in latest kernels but TCP fails
      to perform exhaustive check of ACK sequence.
      
      We can update our view of peer tsval from a frame that is
      later discarded by tcp_ack()
      
      This makes timestamps enabled sessions vulnerable to injection of
      a high tsval : peers start an ACK storm, since the victim
      sends a dupack each time it receives an ACK from the other peer.
      
      As tcp_validate_incoming() is called before tcp_ack(), we should
      not peform tcp_replace_ts_recent() from it, and let callers do it
      at the right time.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: H.K. Jerry Chu <hkchu@google.com>
      Cc: Romain Francoise <romain@orebokech.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd090dfc
  13. 13 11月, 2012 1 次提交
  14. 12 11月, 2012 1 次提交
  15. 10 11月, 2012 2 次提交
  16. 04 11月, 2012 2 次提交
    • C
      net: inet_diag -- Return error code if protocol handler is missed · cacb6ba0
      Cyrill Gorcunov 提交于
      We've observed that in case if UDP diag module is not
      supported in kernel the netlink returns NLMSG_DONE without
      notifying a caller that handler is missed.
      
      This patch makes __inet_diag_dump to return error code instead.
      
      So as example it become possible to detect such situation
      and handle it gracefully on userspace level.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      CC: David Miller <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Pavel Emelyanov <xemul@parallels.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cacb6ba0
    • E
      tcp: better retrans tracking for defer-accept · e6c022a4
      Eric Dumazet 提交于
      For passive TCP connections using TCP_DEFER_ACCEPT facility,
      we incorrectly increment req->retrans each time timeout triggers
      while no SYNACK is sent.
      
      SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
      which we received the ACK from client). Only the last SYNACK is sent
      so that we can receive again an ACK from client, to move the req into
      accept queue. We plan to change this later to avoid the useless
      retransmit (and potential problem as this SYNACK could be lost)
      
      TCP_INFO later gives wrong information to user, claiming imaginary
      retransmits.
      
      Decouple req->retrans field into two independent fields :
      
      num_retrans : number of retransmit
      num_timeout : number of timeouts
      
      num_timeout is the counter that is incremented at each timeout,
      regardless of actual SYNACK being sent or not, and used to
      compute the exponential timeout.
      
      Introduce inet_rtx_syn_ack() helper to increment num_retrans
      only if ->rtx_syn_ack() succeeded.
      
      Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
      when we re-send a SYNACK in answer to a (retransmitted) SYN.
      Prior to this patch, we were not counting these retransmits.
      
      Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
      only if a synack packet was successfully queued.
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Elliott Hughes <enh@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6c022a4
  17. 03 11月, 2012 1 次提交