1. 30 9月, 2016 4 次提交
    • D
      rxrpc: Reduce the rxrpc_local::services list to a pointer · 1e9e5c95
      David Howells 提交于
      Reduce the rxrpc_local::services list to just a pointer as we don't permit
      multiple service endpoints to bind to a single transport endpoints (this is
      excluded by rxrpc_lookup_local()).
      
      The reason we don't allow this is that if you send a request to an AFS
      filesystem service, it will try to talk back to your cache manager on the
      port you sent from (this is how file change notifications are handled).  To
      prevent someone from stealing your CM callbacks, we don't let AF_RXRPC
      sockets share a UDP socket if at least one of them has a service bound.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      1e9e5c95
    • D
      rxrpc: When activating client conn channels, do state check inside lock · 2629c7fa
      David Howells 提交于
      In rxrpc_activate_channels(), the connection cache state is checked outside
      of the lock, which means it can change whilst we're waking calls up,
      thereby changing whether or not we're allowed to wake calls up.
      
      Fix this by moving the check inside the locked region.  The check to see if
      all the channels are currently busy can stay outside of the locked region.
      
      Whilst we're at it:
      
       (1) Split the locked section out into its own function so that we can call
           it from other places in a later patch.
      
       (2) Determine the mask of channels dependent on the state as we're going
           to add another state in a later patch that will restrict the number of
           simultaneous calls to 1 on a connection.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      2629c7fa
    • D
      rxrpc: Make Tx loss-injection go through normal return and adjust tracing · a1767077
      David Howells 提交于
      In rxrpc_send_data_packet() make the loss-injection path return through the
      same code as the transmission path so that the RTT determination is
      initiated and any future timer shuffling will be done, despite the packet
      having been binned.
      
      Whilst we're at it:
      
       (1) Add to the tx_data tracepoint an indication of whether or not we're
           retransmitting a data packet.
      
       (2) When we're deciding whether or not to request an ACK, rather than
           checking if we're in fast-retransmit mode check instead if we're
           retransmitting.
      
       (3) Don't invoke the lose_skb tracepoint when losing a Tx packet as we're
           not altering the sk_buff refcount nor are we just seeing it after
           getting it off the Tx list.
      
       (4) The rxrpc_skb_tx_lost note is then no longer used so remove it.
      
       (5) rxrpc_lose_skb() no longer needs to deal with rxrpc_skb_tx_lost.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      a1767077
    • D
      rxrpc: Fix exclusive client connections · 8732db67
      David Howells 提交于
      Exclusive connections are currently reusable (which they shouldn't be)
      because rxrpc_alloc_client_connection() checks the exclusive flag in the
      rxrpc_connection struct before it's initialised from the function
      parameters.  This means that the DONT_REUSE flag doesn't get set.
      
      Fix this by checking the function parameters for the exclusive flag.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      8732db67
  2. 29 9月, 2016 1 次提交
  3. 28 9月, 2016 5 次提交
  4. 26 9月, 2016 4 次提交
    • L
      netfilter: nf_log: get rid of XT_LOG_* macros · 8cb2a7d5
      Liping Zhang 提交于
      nf_log is used by both nftables and iptables, so use XT_LOG_XXX macros
      here is not appropriate. Replace them with NF_LOG_XXX.
      Signed-off-by: NLiping Zhang <liping.zhang@spreadtrum.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8cb2a7d5
    • L
      netfilter: nft_log: complete NFTA_LOG_FLAGS attr support · ff107d27
      Liping Zhang 提交于
      NFTA_LOG_FLAGS attribute is already supported, but the related
      NF_LOG_XXX flags are not exposed to the userspace. So we cannot
      explicitly enable log flags to log uid, tcp sequence, ip options
      and so on, i.e. such rule "nft add rule filter output log uid"
      is not supported yet.
      
      So move NF_LOG_XXX macro definitions to the uapi/../nf_log.h. In
      order to keep consistent with other modules, change NF_LOG_MASK to
      refer to all supported log flags. On the other hand, add a new
      NF_LOG_DEFAULT_MASK to refer to the original default log flags.
      
      Finally, if user specify the unsupported log flags or NFTA_LOG_GROUP
      and NFTA_LOG_FLAGS are set at the same time, report EINVAL to the
      userspace.
      Signed-off-by: NLiping Zhang <liping.zhang@spreadtrum.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      ff107d27
    • P
      netfilter: nf_tables: add range expression · 0f3cd9b3
      Pablo Neira Ayuso 提交于
      Inverse ranges != [a,b] are not currently possible because rules are
      composites of && operations, and we need to express this:
      
      	data < a || data > b
      
      This patch adds a new range expression. Positive ranges can be already
      through two cmp expressions:
      
      	cmp(sreg, data, >=)
      	cmp(sreg, data, <=)
      
      This new range expression provides an alternative way to express this.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      0f3cd9b3
    • K
      netfilter: xt_socket: fix transparent match for IPv6 request sockets · 7a682575
      KOVACS Krisztian 提交于
      The introduction of TCP_NEW_SYN_RECV state, and the addition of request
      sockets to the ehash table seems to have broken the --transparent option
      of the socket match for IPv6 (around commit a9407000).
      
      Now that the socket lookup finds the TCP_NEW_SYN_RECV socket instead of the
      listener, the --transparent option tries to match on the no_srccheck flag
      of the request socket.
      
      Unfortunately, that flag was only set for IPv4 sockets in tcp_v4_init_req()
      by copying the transparent flag of the listener socket. This effectively
      causes '-m socket --transparent' not match on the ACK packet sent by the
      client in a TCP handshake.
      
      Based on the suggestion from Eric Dumazet, this change moves the code
      initializing no_srccheck to tcp_conn_request(), rendering the above
      scenario working again.
      
      Fixes: a9407000 ("netfilter: xt_socket: prepare for TCP_NEW_SYN_RECV support")
      Signed-off-by: NAlex Badics <alex.badics@balabit.com>
      Signed-off-by: NKOVACS Krisztian <hidden@balabit.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      7a682575
  5. 25 9月, 2016 21 次提交
    • F
      netfilter: evict stale entries when user reads /proc/net/nf_conntrack · 58e207e4
      Florian Westphal 提交于
      Fabian reports a possible conntrack memory leak (could not reproduce so
      far), however, one minor issue can be easily resolved:
      
      > cat /proc/net/nf_conntrack | wc -l = 5
      > 4 minutes required to clean up the table.
      
      We should not report those timed-out entries to the user in first place.
      And instead of just skipping those timed-out entries while iterating over
      the table we can also zap them (we already do this during ctnetlink
      walks, but I forgot about the /proc interface).
      
      Fixes: f330a7fd ("netfilter: conntrack: get rid of conntrack timer")
      Reported-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      58e207e4
    • V
      netfilter: xt_hashlimit: Create revision 2 to support higher pps rates · 11d5f157
      Vishwanath Pai 提交于
      Create a new revision for the hashlimit iptables extension module. Rev 2
      will support higher pps of upto 1 million, Version 1 supports only 10k.
      
      To support this we have to increase the size of the variables avg and
      burst in hashlimit_cfg to 64-bit. Create two new structs hashlimit_cfg2
      and xt_hashlimit_mtinfo2 and also create newer versions of all the
      functions for match, checkentry and destroy.
      
      Some of the functions like hashlimit_mt, hashlimit_mt_check etc are very
      similar in both rev1 and rev2 with only minor changes, so I have split
      those functions and moved all the common code to a *_common function.
      Signed-off-by: NVishwanath Pai <vpai@akamai.com>
      Signed-off-by: NJoshua Hunt <johunt@akamai.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      11d5f157
    • V
      netfilter: xt_hashlimit: Prepare for revision 2 · 0dc60a45
      Vishwanath Pai 提交于
      I am planning to add a revision 2 for the hashlimit xtables module to
      support higher packets per second rates. This patch renames all the
      functions and variables related to revision 1 by adding _v1 at the
      end of the names.
      Signed-off-by: NVishwanath Pai <vpai@akamai.com>
      Signed-off-by: NJoshua Hunt <johunt@akamai.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      0dc60a45
    • L
      netfilter: nft_ct: report error if mark and dir specified simultaneously · 7bfdde70
      Liping Zhang 提交于
      NFT_CT_MARK is unrelated to direction, so if NFTA_CT_DIRECTION attr is
      specified, report EINVAL to the userspace. This validation check was
      already done at nft_ct_get_init, but we missed it in nft_ct_set_init.
      Signed-off-by: NLiping Zhang <liping.zhang@spreadtrum.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      7bfdde70
    • L
      netfilter: nft_ct: unnecessary to require dir when use ct l3proto/protocol · d767ff2c
      Liping Zhang 提交于
      Currently, if the user want to match ct l3proto, we must specify the
      direction, for example:
        # nft add rule filter input ct original l3proto ipv4
                                       ^^^^^^^^
      Otherwise, error message will be reported:
        # nft add rule filter input ct l3proto ipv4
        nft add rule filter input ct l3proto ipv4
        <cmdline>:1:1-38: Error: Could not process rule: Invalid argument
        add rule filter input ct l3proto ipv4
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      
      Actually, there's no need to require NFTA_CT_DIRECTION attr, because
      ct l3proto and protocol are unrelated to direction.
      
      And for compatibility, even if the user specify the NFTA_CT_DIRECTION
      attr, do not report error, just skip it.
      Signed-off-by: NLiping Zhang <liping.zhang@spreadtrum.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d767ff2c
    • G
      netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack · 8d11350f
      Gao Feng 提交于
      It is valid that the TCP RST packet which does not set ack flag, and bytes
      of ack number are zero. But current seqadj codes would adjust the "0" ack
      to invalid ack number. Actually seqadj need to check the ack flag before
      adjust it for these RST packets.
      
      The following is my test case
      
      client is 10.26.98.245, and add one iptable rule:
      iptables  -I INPUT -p tcp --sport 12345 -m connbytes --connbytes 2:
      --connbytes-dir reply --connbytes-mode packets -j REJECT --reject-with
      tcp-reset
      This iptables rule could generate on TCP RST without ack flag.
      
      server:10.172.135.55
      Enable the synproxy with seqadjust by the following iptables rules
      iptables -t raw -A PREROUTING -i eth0 -p tcp -d 10.172.135.55 --dport 12345
      -m tcp --syn -j CT --notrack
      
      iptables -A INPUT -i eth0 -p tcp -d 10.172.135.55 --dport 12345 -m conntrack
      --ctstate INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7
      --mss 1460
      iptables -A OUTPUT -o eth0 -p tcp -s 10.172.135.55 --sport 12345 -m conntrack
      --ctstate INVALID,UNTRACKED -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j ACCEPT
      
      The following is my test result.
      
      1. packet trace on client
      root@routers:/tmp# tcpdump -i eth0 tcp port 12345 -n
      tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
      listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
      IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [S], seq 3695959829,
      win 29200, options [mss 1460,sackOK,TS val 452367884 ecr 0,nop,wscale 7],
      length 0
      IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [S.], seq 546723266,
      ack 3695959830, win 0, options [mss 1460,sackOK,TS val 15643479 ecr 452367884,
      nop,wscale 7], length 0
      IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [.], ack 1, win 229,
      options [nop,nop,TS val 452367885 ecr 15643479], length 0
      IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [.], ack 1, win 226,
      options [nop,nop,TS val 15643479 ecr 452367885], length 0
      IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [R], seq 3695959830,
      win 0, length 0
      
      2. seqadj log on server
      [62873.867319] Adjusting sequence number from 602341895->546723267,
      ack from 3695959830->3695959830
      [62873.867644] Adjusting sequence number from 602341895->546723267,
      ack from 3695959830->3695959830
      [62873.869040] Adjusting sequence number from 3695959830->3695959830,
      ack from 0->55618628
      
      To summarize, it is clear that the seqadj codes adjust the 0 ack when receive
      one TCP RST packet without ack.
      Signed-off-by: NGao Feng <fgao@ikuai8.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8d11350f
    • A
      netfilter: replace list_head with single linked list · e3b37f11
      Aaron Conole 提交于
      The netfilter hook list never uses the prev pointer, and so can be trimmed to
      be a simple singly-linked list.
      
      In addition to having a more light weight structure for hook traversal,
      struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
      2176 bytes (down from 2240).
      Signed-off-by: NAaron Conole <aconole@bytheb.org>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e3b37f11
    • L
      gre: use nla_get_be32() to extract flowinfo · c2675de4
      Lance Richardson 提交于
      Eliminate a sparse endianness mismatch warning, use nla_get_be32() to
      extract a __be32 value instead of nla_get_u32().
      Signed-off-by: NLance Richardson <lrichard@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2675de4
    • D
      rxrpc: Implement slow-start · 57494343
      David Howells 提交于
      Implement RxRPC slow-start, which is similar to RFC 5681 for TCP.  A
      tracepoint is added to log the state of the congestion management algorithm
      and the decisions it makes.
      
      Notes:
      
       (1) Since we send fixed-size DATA packets (apart from the final packet in
           each phase), counters and calculations are in terms of packets rather
           than bytes.
      
       (2) The ACK packet carries the equivalent of TCP SACK.
      
       (3) The FLIGHT_SIZE calculation in RFC 5681 doesn't seem particularly
           suited to SACK of a small number of packets.  It seems that, almost
           inevitably, by the time three 'duplicate' ACKs have been seen, we have
           narrowed the loss down to one or two missing packets, and the
           FLIGHT_SIZE calculation ends up as 2.
      
       (4) In rxrpc_resend(), if there was no data that apparently needed
           retransmission, we transmit a PING ACK to ask the peer to tell us what
           its Rx window state is.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      57494343
    • D
      rxrpc: Schedule an ACK if the reply to a client call appears overdue · 0d967960
      David Howells 提交于
      If we've sent all the request data in a client call but haven't seen any
      sign of the reply data yet, schedule an ACK to be sent to the server to
      find out if the reply data got lost.
      
      If the server hasn't yet hard-ACK'd the request data, we send a PING ACK to
      demand a response to find out whether we need to retransmit.
      
      If the server says it has received all of the data, we send an IDLE ACK to
      tell the server that we haven't received anything in the receive phase as
      yet.
      
      To make this work, a non-immediate PING ACK must carry a delay.  I've chosen
      the same as the IDLE ACK for the moment.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      0d967960
    • D
      rxrpc: Generate a summary of the ACK state for later use · 31a1b989
      David Howells 提交于
      Generate a summary of the Tx buffer packet state when an ACK is received
      for use in a later patch that does congestion management.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      31a1b989
    • D
      rxrpc: Delay the resend timer to allow for nsec->jiffies conv error · df0562a7
      David Howells 提交于
      When determining the resend timer value, we have a value in nsec but the
      timer is in jiffies which may be a million or more times more coarse.
      nsecs_to_jiffies() rounds down - which means that the resend timeout
      expressed as jiffies is very likely earlier than the one expressed as
      nanoseconds from which it was derived.
      
      The problem is that rxrpc_resend() gets triggered by the timer, but can't
      then find anything to resend yet.  It sets the timer again - but gets
      kicked off immediately again and again until the nanosecond-based expiry
      time is reached and we actually retransmit.
      
      Fix this by adding 1 to the jiffies-based resend_at value to counteract the
      rounding and make sure that the timer happens after the nanosecond-based
      expiry is passed.
      
      Alternatives would be to adjust the timestamp on the packets to align
      with the jiffie scale or to switch back to using jiffie-timestamps.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      df0562a7
    • D
      rxrpc: Reinitialise the call ACK and timer state for client reply phase · dd7c1ee5
      David Howells 提交于
      Clear the ACK reason, ACK timer and resend timer when entering the client
      reply phase when the first DATA packet is received.  New ACKs will be
      proposed once the data is queued.
      
      The resend timer is no longer relevant and we need to cancel ACKs scheduled
      to probe for a lost reply.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      dd7c1ee5
    • D
      rxrpc: Include the last reply DATA serial number in the final ACK · b69d94d7
      David Howells 提交于
      In a client call, include the serial number of the last DATA packet of the
      reply in the final ACK.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      b69d94d7
    • D
      rxrpc: Send an immediate ACK if we fill in a hole · a7056c5b
      David Howells 提交于
      Send an immediate ACK if we fill in a hole in the buffer left by an
      out-of-sequence packet.  This may allow the congestion management in the peer
      to avoid a retransmission if packets got reordered on the wire.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      a7056c5b
    • A
      netfilter: Only allow sane values in nf_register_net_hook · d4bb5caa
      Aaron Conole 提交于
      This commit adds an upfront check for sane values to be passed when
      registering a netfilter hook.  This will be used in a future patch for a
      simplified hook list traversal.
      Signed-off-by: NAaron Conole <aconole@bytheb.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d4bb5caa
    • A
      netfilter: Remove explicit rcu_read_lock in nf_hook_slow · e2361cb9
      Aaron Conole 提交于
      All of the callers of nf_hook_slow already hold the rcu_read_lock, so this
      cleanup removes the recursive call.  This is just a cleanup, as the locking
      code gracefully handles this situation.
      Signed-off-by: NAaron Conole <aconole@bytheb.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e2361cb9
    • A
      netfilter: call nf_hook_ingress with rcu_read_lock · 2c1e2703
      Aaron Conole 提交于
      This commit ensures that the rcu read-side lock is held while the
      ingress hook is called.  This ensures that a call to nf_hook_slow (and
      ultimately nf_ingress) will be read protected.
      Signed-off-by: NAaron Conole <aconole@bytheb.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2c1e2703
    • F
      netfilter: bridge: add and use br_nf_hook_thresh · c5136b15
      Florian Westphal 提交于
      This replaces the last uses of NF_HOOK_THRESH().
      Followup patch will remove it and rename nf_hook_thresh.
      
      The reason is that inet (non-bridge) netfilter no longer invokes the
      hooks from hooks, so we do no longer need the thresh value to skip hooks
      with a lower priority.
      
      The bridge netfilter however may need to do this. br_nf_hook_thresh is a
      wrapper that is supposed to do this, i.e. only call hooks with a
      priority that exceeds NF_BR_PRI_BRNF.
      
      It's used only in the recursion cases of br_netfilter.  It invokes
      nf_hook_slow while holding an rcu read-side critical section to make a
      future cleanup simpler.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NAaron Conole <aconole@bytheb.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c5136b15
    • G
      netfilter: xt_TCPMSS: Refactor the codes to decrease one condition check and more readable · 50f4c7b7
      Gao Feng 提交于
      The origin codes perform two condition checks with dst_mtu(skb_dst(skb))
      and in_mtu. And the last statement is "min(dst_mtu(skb_dst(skb)),
      in_mtu) - minlen". It may let reader think about how about the result.
      Would it be negative.
      
      Now assign the result of min(dst_mtu(skb_dst(skb)), in_mtu) to a new
      variable, then only perform one condition check, and it is more readable.
      Signed-off-by: NGao Feng <fgao@ikuai8.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      50f4c7b7
    • D
      rxrpc: Send an ACK after every few DATA packets we receive · 805b21b9
      David Howells 提交于
      Send an ACK if we haven't sent one for the last two packets we've received.
      This keeps the other end apprised of where we've got to - which is
      important if they're doing slow-start.
      
      We do this in recvmsg so that we can dispatch a packet directly without the
      need to wake up the background thread.
      
      This should possibly be made configurable in future.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      805b21b9
  6. 24 9月, 2016 1 次提交
    • M
      net: Update API for VF vlan protocol 802.1ad support · 79aab093
      Moshe Shemesh 提交于
      Introduce new rtnl UAPI that exposes a list of vlans per VF, giving
      the ability for user-space application to specify it for the VF, as an
      option to support 802.1ad.
      We adjusted IP Link tool to support this option.
      
      For future use cases, the new UAPI supports multiple vlans. For now we
      limit the list size to a single vlan in kernel.
      Add IFLA_VF_VLAN_LIST in addition to IFLA_VF_VLAN to keep backward
      compatibility with older versions of IP Link tool.
      
      Add a vlan protocol parameter to the ndo_set_vf_vlan callback.
      We kept 802.1Q as the drivers' default vlan protocol.
      Suitable ip link tool command examples:
        Set vf vlan protocol 802.1ad:
          ip link set eth0 vf 1 vlan 100 proto 802.1ad
        Set vf to VST (802.1Q) mode:
          ip link set eth0 vf 1 vlan 100 proto 802.1Q
        Or by omitting the new parameter
          ip link set eth0 vf 1 vlan 100
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79aab093
  7. 23 9月, 2016 4 次提交