1. 07 9月, 2016 1 次提交
  2. 03 9月, 2016 1 次提交
  3. 02 9月, 2016 4 次提交
    • V
      net: dsa: remove ds_to_priv · 04bed143
      Vivien Didelot 提交于
      Access the priv member of the dsa_switch structure directly, instead of
      having an unnecessary helper.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04bed143
    • R
      rtnetlink: fdb dump: optimize by saving last interface markers · d297653d
      Roopa Prabhu 提交于
      fdb dumps spanning multiple skb's currently restart from the first
      interface again for every skb. This results in unnecessary
      iterations on the already visited interfaces and their fdb
      entries. In large scale setups, we have seen this to slow
      down fdb dumps considerably. On a system with 30k macs we
      see fdb dumps spanning across more than 300 skbs.
      
      To fix the problem, this patch replaces the existing single fdb
      marker with three markers: netdev hash entries, netdevs and fdb
      index to continue where we left off instead of restarting from the
      first netdev. This is consistent with link dumps.
      
      In the process of fixing the performance issue, this patch also
      re-implements fix done by
      commit 472681d5 ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
      (with an internal fix from Wilson Kok) in the following ways:
      - change ndo_fdb_dump handlers to return error code instead
      of the last fdb index
      - use cb->args strictly for dump frag markers and not error codes.
      This is consistent with other dump functions.
      
      Below results were taken on a system with 1000 netdevs
      and 35085 fdb entries:
      before patch:
      $time bridge fdb show | wc -l
      15065
      
      real    1m11.791s
      user    0m0.070s
      sys 1m8.395s
      
      (existing code does not return all macs)
      
      after patch:
      $time bridge fdb show | wc -l
      35085
      
      real    0m2.017s
      user    0m0.113s
      sys 0m1.942s
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NWilson Kok <wkok@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d297653d
    • G
      rps: flow_dissector: Add the const for the parameter of flow_keys_have_l4 · 66fdd05e
      Gao Feng 提交于
      Add the const for the parameter of flow_keys_have_l4 for the readability.
      Signed-off-by: NGao Feng <fgao@ikuai8.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66fdd05e
    • D
      rxrpc: Don't expose skbs to in-kernel users [ver #2] · d001648e
      David Howells 提交于
      Don't expose skbs to in-kernel users, such as the AFS filesystem, but
      instead provide a notification hook the indicates that a call needs
      attention and another that indicates that there's a new call to be
      collected.
      
      This makes the following possibilities more achievable:
      
       (1) Call refcounting can be made simpler if skbs don't hold refs to calls.
      
       (2) skbs referring to non-data events will be able to be freed much sooner
           rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
           will be able to consult the call state.
      
       (3) We can shortcut the receive phase when a call is remotely aborted
           because we don't have to go through all the packets to get to the one
           cancelling the operation.
      
       (4) It makes it easier to do encryption/decryption directly between AFS's
           buffers and sk_buffs.
      
       (5) Encryption/decryption can more easily be done in the AFS's thread
           contexts - usually that of the userspace process that issued a syscall
           - rather than in one of rxrpc's background threads on a workqueue.
      
       (6) AFS will be able to wait synchronously on a call inside AF_RXRPC.
      
      To make this work, the following interface function has been added:
      
           int rxrpc_kernel_recv_data(
      		struct socket *sock, struct rxrpc_call *call,
      		void *buffer, size_t bufsize, size_t *_offset,
      		bool want_more, u32 *_abort_code);
      
      This is the recvmsg equivalent.  It allows the caller to find out about the
      state of a specific call and to transfer received data into a buffer
      piecemeal.
      
      afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
      logic between them.  They don't wait synchronously yet because the socket
      lock needs to be dealt with.
      
      Five interface functions have been removed:
      
      	rxrpc_kernel_is_data_last()
          	rxrpc_kernel_get_abort_code()
          	rxrpc_kernel_get_error_number()
          	rxrpc_kernel_free_skb()
          	rxrpc_kernel_data_consumed()
      
      As a temporary hack, sk_buffs going to an in-kernel call are queued on the
      rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
      in-kernel user.  To process the queue internally, a temporary function,
      temp_deliver_data() has been added.  This will be replaced with common code
      between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
      future patch.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d001648e
  4. 01 9月, 2016 1 次提交
  5. 31 8月, 2016 1 次提交
    • R
      net: lwtunnel: Handle fragmentation · 14972cbd
      Roopa Prabhu 提交于
      Today mpls iptunnel lwtunnel_output redirect expects the tunnel
      output function to handle fragmentation. This is ok but can be
      avoided if we did not do the mpls output redirect too early.
      ie we could wait until ip fragmentation is done and then call
      mpls output for each ip fragment.
      
      To make this work we will need,
      1) the lwtunnel state to carry encap headroom
      2) and do the redirect to the encap output handler on the ip fragment
      (essentially do the output redirect after fragmentation)
      
      This patch adds tunnel headroom in lwtstate to make sure we
      account for tunnel data in mtu calculations during fragmentation
      and adds new xmit redirect handler to redirect to lwtunnel xmit func
      after ip fragmentation.
      
      This includes IPV6 and some mtu fixes and testing from David Ahern.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14972cbd
  6. 30 8月, 2016 6 次提交
    • D
      rxrpc: Pass struct socket * to more rxrpc kernel interface functions · 4de48af6
      David Howells 提交于
      Pass struct socket * to more rxrpc kernel interface functions.  They should
      be starting from this rather than the socket pointer in the rxrpc_call
      struct if they need to access the socket.
      
      I have left:
      
      	rxrpc_kernel_is_data_last()
      	rxrpc_kernel_get_abort_code()
      	rxrpc_kernel_get_error_number()
      	rxrpc_kernel_free_skb()
      	rxrpc_kernel_data_consumed()
      
      unmodified as they're all about to be removed (and, in any case, don't
      touch the socket).
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4de48af6
    • D
      rxrpc: Provide a way for AFS to ask for the peer address of a call · 8324f0bc
      David Howells 提交于
      Provide a function so that kernel users, such as AFS, can ask for the peer
      address of a call:
      
         void rxrpc_kernel_get_peer(struct rxrpc_call *call,
      			      struct sockaddr_rxrpc *_srx);
      
      In the future the kernel service won't get sk_buffs to look inside.
      Further, this allows us to hide any canonicalisation inside AF_RXRPC for
      when IPv6 support is added.
      
      Also propagate this through to afs_find_server() and issue a warning if we
      can't handle the address family yet.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      8324f0bc
    • G
      netfilter: log: Check param to avoid overflow in nf_log_set · 779994fa
      Gao Feng 提交于
      The nf_log_set is an interface function, so it should do the strict sanity
      check of parameters. Convert the return value of nf_log_set as int instead
      of void. When the pf is invalid, return -EOPNOTSUPP.
      Signed-off-by: NGao Feng <fgao@ikuai8.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      779994fa
    • F
      netfilter: remove __nf_ct_kill_acct helper · ad66713f
      Florian Westphal 提交于
      After timer removal this just calls nf_ct_delete so remove the __ prefix
      version and make nf_ct_kill a shorthand for nf_ct_delete.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      ad66713f
    • F
      netfilter: conntrack: get rid of conntrack timer · f330a7fd
      Florian Westphal 提交于
      With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
      Eric Dumazet pointed out during netfilter workshop 2016.
      
      Eric also says: "Another reason was the fact that Thomas was about to
      change max timer range [..]" (500462a9, 'timers: Switch to
      a non-cascading wheel').
      
      Remove the timer and use a 32bit jiffies value containing timestamp until
      entry is valid.
      
      During conntrack lookup, even before doing tuple comparision, check
      the timeout value and evict the entry in case it is too old.
      
      The dying bit is used as a synchronization point to avoid races where
      multiple cpus try to evict the same entry.
      
      Because lookup is always lockless, we need to bump the refcnt once
      when we evict, else we could try to evict already-dead entry that
      is being recycled.
      
      This is the standard/expected way when conntrack entries are destroyed.
      
      Followup patches will introduce garbage colliction via work queue
      and further places where we can reap obsoleted entries (e.g. during
      netlink dumps), this is needed to avoid expired conntracks from hanging
      around for too long when lookup rate is low after a busy period.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f330a7fd
    • F
      netfilter: don't rely on DYING bit to detect when destroy event was sent · 616b14b4
      Florian Westphal 提交于
      The reliable event delivery mode currently (ab)uses the DYING bit to
      detect which entries on the dying list have to be skipped when
      re-delivering events from the eache worker in reliable event mode.
      
      Currently when we delete the conntrack from main table we only set this
      bit if we could also deliver the netlink destroy event to userspace.
      
      If we fail we move it to the dying list, the ecache worker will
      reattempt event delivery for all confirmed conntracks on the dying list
      that do not have the DYING bit set.
      
      Once timer is gone, we can no longer use if (del_timer()) to detect
      when we 'stole' the reference count owned by the timer/hash entry, so
      we need some other way to avoid racing with other cpu.
      
      Pablo suggested to add a marker in the ecache extension that skips
      entries that have been unhashed from main table but are still waiting
      for the last reference count to be dropped (e.g. because one skb waiting
      on nfqueue verdict still holds a reference).
      
      We do this by adding a tristate.
      If we fail to deliver the destroy event, make a note of this in the
      eache extension.  The worker can then skip all entries that are in
      a different state.  Either they never delivered a destroy event,
      e.g. because the netlink backend was not loaded, or redelivery took
      place already.
      
      Once the conntrack timer is removed we will now be able to replace
      del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid
      racing with other cpu that tries to evict the same conntrack.
      
      Because DYING will then be set right before we report the destroy event
      we can no longer skip event reporting when dying bit is set.
      Suggested-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      616b14b4
  7. 29 8月, 2016 4 次提交
    • E
      tcp: add tcp_add_backlog() · c9c33212
      Eric Dumazet 提交于
      When TCP operates in lossy environments (between 1 and 10 % packet
      losses), many SACK blocks can be exchanged, and I noticed we could
      drop them on busy senders, if these SACK blocks have to be queued
      into the socket backlog.
      
      While the main cause is the poor performance of RACK/SACK processing,
      we can try to avoid these drops of valuable information that can lead to
      spurious timeouts and retransmits.
      
      Cause of the drops is the skb->truesize overestimation caused by :
      
      - drivers allocating ~2048 (or more) bytes as a fragment to hold an
        Ethernet frame.
      
      - various pskb_may_pull() calls bringing the headers into skb->head
        might have pulled all the frame content, but skb->truesize could
        not be lowered, as the stack has no idea of each fragment truesize.
      
      The backlog drops are also more visible on bidirectional flows, since
      their sk_rmem_alloc can be quite big.
      
      Let's add some room for the backlog, as only the socket owner
      can selectively take action to lower memory needs, like collapsing
      receive queues or partial ofo pruning.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9c33212
    • T
      kcm: Remove TCP specific references from kcm and strparser · 96a59083
      Tom Herbert 提交于
      kcm and strparser need to work with any type of stream socket not just
      TCP. Eliminate references to TCP and call generic proto_ops functions of
      read_sock and peek_len. Also in strp_init check if the socket support
      the proto_ops read_sock and peek_len.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96a59083
    • T
      tcp: Set read_sock and peek_len proto_ops · 32035585
      Tom Herbert 提交于
      In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
      tcp_peek_len (which is just a stub function that calls tcp_inq).
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32035585
    • T
      net: Add read_sock proto_op · 0294b625
      Tom Herbert 提交于
      Add new function in proto_ops structure. This includes moving the
      typedef got sk_read_actor into net.h and removing the definition from
      tcp.h.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0294b625
  8. 27 8月, 2016 2 次提交
    • I
      bridge: switchdev: Add forward mark support for stacked devices · 6bc506b4
      Ido Schimmel 提交于
      switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
      port netdevs so that packets being flooded by the device won't be
      flooded twice.
      
      It works by assigning a unique identifier (the ifindex of the first
      bridge port) to bridge ports sharing the same parent ID. This prevents
      packets from being flooded twice by the same switch, but will flood
      packets through bridge ports belonging to a different switch.
      
      This method is problematic when stacked devices are taken into account,
      such as VLANs. In such cases, a physical port netdev can have upper
      devices being members in two different bridges, thus requiring two
      different 'offload_fwd_mark's to be configured on the port netdev, which
      is impossible.
      
      The main problem is that packet and netdev marking is performed at the
      physical netdev level, whereas flooding occurs between bridge ports,
      which are not necessarily port netdevs.
      
      Instead, packet and netdev marking should really be done in the bridge
      driver with the switch driver only telling it which packets it already
      forwarded. The bridge driver will mark such packets using the mark
      assigned to the ingress bridge port and will prevent the packet from
      being forwarded through any bridge port sharing the same mark (i.e.
      having the same parent ID).
      
      Remove the current switchdev 'offload_fwd_mark' implementation and
      instead implement the proposed method. In addition, make rocker - the
      sole user of the mark - use the proposed method.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bc506b4
    • I
      devlink: remove unused priv_size · 2a313cdf
      Ivan Vecera 提交于
      Remove unused and useless priv_size member from struct devlink_ops.
      
      Cc: Jiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIvan Vecera <ivecera@redhat.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a313cdf
  9. 26 8月, 2016 2 次提交
  10. 25 8月, 2016 1 次提交
    • V
      net: dsa: rename switch operations structure · 9d490b4e
      Vivien Didelot 提交于
      Now that the dsa_switch_driver structure contains only function pointers
      as it is supposed to, rename it to the more appropriate dsa_switch_ops,
      uniformly to any other operations structure in the kernel.
      
      No functional changes here, basically just the result of something like:
      s/dsa_switch_driver *drv/dsa_switch_ops *ops/g
      
      However keep the {un,}register_switch_driver functions and their
      dsa_switch_drivers list as is, since they represent the -- likely to be
      deprecated soon -- legacy DSA registration framework.
      
      In the meantime, also fix the following checks from checkpatch.pl to
      make it happy with this patch:
      
          CHECK: Comparison to NULL could be written "!ops"
          #403: FILE: net/dsa/dsa.c:470:
          +	if (ops == NULL) {
      
          CHECK: Comparison to NULL could be written "ds->ops->get_strings"
          #773: FILE: net/dsa/slave.c:697:
          +		if (ds->ops->get_strings != NULL)
      
          CHECK: Comparison to NULL could be written "ds->ops->get_ethtool_stats"
          #824: FILE: net/dsa/slave.c:785:
          +	if (ds->ops->get_ethtool_stats != NULL)
      
          CHECK: Comparison to NULL could be written "ds->ops->get_sset_count"
          #835: FILE: net/dsa/slave.c:798:
          +		if (ds->ops->get_sset_count != NULL)
      
          total: 0 errors, 0 warnings, 4 checks, 784 lines checked
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d490b4e
  11. 24 8月, 2016 6 次提交
  12. 23 8月, 2016 2 次提交
  13. 20 8月, 2016 1 次提交
  14. 19 8月, 2016 4 次提交
  15. 18 8月, 2016 4 次提交
    • T
      kcm: Use stream parser · 9b73896a
      Tom Herbert 提交于
      Adapt KCM to use the stream parser. This mostly involves removing
      the RX handling and setting up the strparser using the interface.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b73896a
    • T
      strparser: Stream parser for messages · 43a0c675
      Tom Herbert 提交于
      This patch introduces a utility for parsing application layer protocol
      messages in a TCP stream. This is a generalization of the mechanism
      implemented of Kernel Connection Multiplexor.
      
      The API includes a context structure, a set of callbacks, utility
      functions, and a data ready function.
      
      A stream parser instance is defined by a strparse structure that
      is bound to a TCP socket. The function to initialize the structure
      is:
      
      int strp_init(struct strparser *strp, struct sock *csk,
                    struct strp_callbacks *cb);
      
      csk is the TCP socket being bound to and cb are the parser callbacks.
      
      The upper layer calls strp_tcp_data_ready when data is ready on the lower
      socket for strparser to process. This should be called from a data_ready
      callback that is set on the socket:
      
      void strp_tcp_data_ready(struct strparser *strp);
      
      A parser is bound to a TCP socket by setting data_ready function to
      strp_tcp_data_ready so that all receive indications on the socket
      go through the parser. This is assumes that sk_user_data is set to
      the strparser structure.
      
      There are four callbacks.
       - parse_msg is called to parse the message (returns length or error).
       - rcv_msg is called when a complete message has been received
       - read_sock_done is called when data_ready function exits
       - abort_parser is called to abort the parser
      
      The input to parse_msg is an skbuff which contains next message under
      construction. The backend processing of parse_msg will parse the
      application layer protocol headers to determine the length of
      the message in the stream. The possible return values are:
      
         >0 : indicates length of successfully parsed message
         0  : indicates more data must be received to parse the message
         -ESTRPIPE : current message should not be processed by the
            kernel, return control of the socket to userspace which
            can proceed to read the messages itself
         other < 0 : Error is parsing, give control back to userspace
            assuming that synchronzation is lost and the stream
            is unrecoverable (application expected to close TCP socket)
      
      In the case of error return (< 0) strparse will stop the parser
      and report and error to userspace. The application must deal
      with the error. To handle the error the strparser is unbound
      from the TCP socket. If the error indicates that the stream
      TCP socket is at recoverable point (ESTRPIPE) then the application
      can read the TCP socket to process the stream. Once the application
      has dealt with the exceptions in the stream, it may again bind the
      socket to a strparser to continue data operations.
      
      Note that ENODATA may be returned to the application. In this case
      parse_msg returned -ESTRPIPE, however strparser was unable to maintain
      synchronization of the stream (i.e. some of the message in question
      was already read by the parser).
      
      strp_pause and strp_unpause are used to provide flow control. For
      instance, if rcv_msg is called but the upper layer can't immediately
      consume the message it can hold the message and pause strparser.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43a0c675
    • W
      net_sched: convert tcf_exts from list to pointer array · 22dc13c8
      WANG Cong 提交于
      As pointed out by Jamal, an action could be shared by
      multiple filters, so we can't use list to chain them
      any more after we get rid of the original tc_action.
      Instead, we could just save pointers to these actions
      in tcf_exts, since they are refcount'ed, so convert
      the list to an array of pointers.
      
      The "ugly" part is the action API still accepts list
      as a parameter, I just introduce a helper function to
      convert the array of pointers to a list, instead of
      relying on the C99 feature to iterate the array.
      
      Fixes: a85a970a ("net_sched: move tc_action into tcf_common")
      Reported-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22dc13c8
    • W
      net_sched: move tc offload macros to pkt_cls.h · 2734437e
      WANG Cong 提交于
      struct tcf_exts belongs to filters, should not be visible
      to plain tc actions.
      
      Cc: Ido Schimmel <idosch@mellanox.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2734437e