1. 25 9月, 2016 1 次提交
  2. 13 9月, 2016 1 次提交
    • F
      netfilter: conntrack: remove packet hotpath stats · 8e8118f8
      Florian Westphal 提交于
      These counters sit in hot path and do show up in perf, this is especially
      true for 'found' and 'searched' which get incremented for every packet
      processed.
      
      Information like
      
      searched=212030105
      new=623431
      found=333613
      delete=623327
      
      does not seem too helpful nowadays:
      
      - on busy systems found and searched will overflow every few hours
      (these are 32bit integers), other more busy ones every few days.
      
      - for debugging there are better methods, such as iptables' trace target,
      the conntrack log sysctls.  Nowadays we also have perf tool.
      
      This removes packet path stat counters except those that
      are expected to be 0 (or close to 0) on a normal system, e.g.
      'insert_failed' (race happened) or 'invalid' (proto tracker rejects).
      
      The insert stat is retained for the ctnetlink case.
      The found stat is retained for the tuple-is-taken check when NAT has to
      determine if it needs to pick a different source address.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8e8118f8
  3. 07 9月, 2016 2 次提交
  4. 03 9月, 2016 2 次提交
    • A
      perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs · aa6a5f3c
      Alexei Starovoitov 提交于
      Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
      via overflow_handler mechanism.
      When program is attached the overflow_handlers become stacked.
      The program acts as a filter.
      Returning zero from the program means that the normal perf_event_output handler
      will not be called and sampling event won't be stored in the ring buffer.
      
      The overflow_handler_context==NULL is an additional safety check
      to make sure programs are not attached to hw breakpoints and watchdog
      in case other checks (that prevent that now anyway) get accidentally
      relaxed in the future.
      
      The program refcnt is incremented in case perf_events are inhereted
      when target task is forked.
      Similar to kprobe and tracepoint programs there is no ioctl to
      detach the program or swap already attached program. The user space
      expected to close(perf_event_fd) like it does right now for kprobe+bpf.
      That restriction simplifies the code quite a bit.
      
      The invocation of overflow_handler in __perf_event_overflow() is now
      done via READ_ONCE, since that pointer can be replaced when the program
      is attached while perf_event itself could have been active already.
      There is no need to do similar treatment for event->prog, since it's
      assigned only once before it's accessed.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa6a5f3c
    • A
      bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type · 0515e599
      Alexei Starovoitov 提交于
      Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
      HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
      correspondingly in uapi/linux/perf_event.h)
      
      The program visible context meta structure is
      struct bpf_perf_event_data {
          struct pt_regs regs;
           __u64 sample_period;
      };
      which is accessible directly from the program:
      int bpf_prog(struct bpf_perf_event_data *ctx)
      {
        ... ctx->sample_period ...
        ... ctx->regs.ip ...
      }
      
      The bpf verifier rewrites the accesses into kernel internal
      struct bpf_perf_event_data_kern which allows changing
      struct perf_sample_data without affecting bpf programs.
      New fields can be added to the end of struct bpf_perf_event_data
      in the future.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0515e599
  5. 02 9月, 2016 2 次提交
    • N
      net: bridge: add per-port multicast flood flag · b6cb5ac8
      Nikolay Aleksandrov 提交于
      Add a per-port flag to control the unknown multicast flood, similar to the
      unknown unicast flood flag and break a few long lines in the netlink flag
      exports.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6cb5ac8
    • R
      rtnetlink: fdb dump: optimize by saving last interface markers · d297653d
      Roopa Prabhu 提交于
      fdb dumps spanning multiple skb's currently restart from the first
      interface again for every skb. This results in unnecessary
      iterations on the already visited interfaces and their fdb
      entries. In large scale setups, we have seen this to slow
      down fdb dumps considerably. On a system with 30k macs we
      see fdb dumps spanning across more than 300 skbs.
      
      To fix the problem, this patch replaces the existing single fdb
      marker with three markers: netdev hash entries, netdevs and fdb
      index to continue where we left off instead of restarting from the
      first netdev. This is consistent with link dumps.
      
      In the process of fixing the performance issue, this patch also
      re-implements fix done by
      commit 472681d5 ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
      (with an internal fix from Wilson Kok) in the following ways:
      - change ndo_fdb_dump handlers to return error code instead
      of the last fdb index
      - use cb->args strictly for dump frag markers and not error codes.
      This is consistent with other dump functions.
      
      Below results were taken on a system with 1000 netdevs
      and 35085 fdb entries:
      before patch:
      $time bridge fdb show | wc -l
      15065
      
      real    1m11.791s
      user    0m0.070s
      sys 1m8.395s
      
      (existing code does not return all macs)
      
      after patch:
      $time bridge fdb show | wc -l
      35085
      
      real    0m2.017s
      user    0m0.113s
      sys 0m1.942s
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NWilson Kok <wkok@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d297653d
  6. 29 8月, 2016 2 次提交
    • R
      net: smc91x: fix SMC accesses · 2fb04fdf
      Russell King 提交于
      Commit b70661c7 ("net: smc91x: use run-time configuration on all ARM
      machines") broke some ARM platforms through several mistakes.  Firstly,
      the access size must correspond to the following rule:
      
      (a) at least one of 16-bit or 8-bit access size must be supported
      (b) 32-bit accesses are optional, and may be enabled in addition to
          the above.
      
      Secondly, it provides no emulation of 16-bit accesses, instead blindly
      making 16-bit accesses even when the platform specifies that only 8-bit
      is supported.
      
      Reorganise smc91x.h so we can make use of the existing 16-bit access
      emulation already provided - if 16-bit accesses are supported, use
      16-bit accesses directly, otherwise if 8-bit accesses are supported,
      use the provided 16-bit access emulation.  If neither, BUG().  This
      exactly reflects the driver behaviour prior to the commit being fixed.
      
      Since the conversion incorrectly cut down the available access sizes on
      several platforms, we also need to go through every platform and fix up
      the overly-restrictive access size: Arnd assumed that if a platform can
      perform 32-bit, 16-bit and 8-bit accesses, then only a 32-bit access
      size needed to be specified - not so, all available access sizes must
      be specified.
      
      This likely fixes some performance regressions in doing this: if a
      platform does not support 8-bit accesses, 8-bit accesses have been
      emulated by performing a 16-bit read-modify-write access.
      
      Tested on the Intel Assabet/Neponset platform, which supports only 8-bit
      accesses, which was broken by the original commit.
      
      Fixes: b70661c7 ("net: smc91x: use run-time configuration on all ARM machines")
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Tested-by: NRobert Jarzmik <robert.jarzmik@free.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2fb04fdf
    • T
      net: Add read_sock proto_op · 0294b625
      Tom Herbert 提交于
      Add new function in proto_ops structure. This includes moving the
      typedef got sk_read_actor into net.h and removing the definition from
      tcp.h.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0294b625
  7. 27 8月, 2016 3 次提交
    • S
      sysctl: handle error writing UINT_MAX to u32 fields · e7d316a0
      Subash Abhinov Kasiviswanathan 提交于
      We have scripts which write to certain fields on 3.18 kernels but this
      seems to be failing on 4.4 kernels.  An entry which we write to here is
      xfrm_aevent_rseqth which is u32.
      
        echo 4294967295  > /proc/sys/net/core/xfrm_aevent_rseqth
      
      Commit 230633d1 ("kernel/sysctl.c: detect overflows when converting
      to int") prevented writing to sysctl entries when integer overflow
      occurs.  However, this does not apply to unsigned integers.
      
      Heinrich suggested that we introduce a new option to handle 64 bit
      limits and set min as 0 and max as UINT_MAX.  This might not work as it
      leads to issues similar to __do_proc_doulongvec_minmax.  Alternatively,
      we would need to change the datatype of the entry to 64 bit.
      
        static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
        {
            i = (unsigned long *) data;   //This cast is causing to read beyond the size of data (u32)
            vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.
      
      Introduce a new proc handler proc_douintvec.  Individual proc entries
      will need to be updated to use the new handler.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Fixes: 230633d1 ("kernel/sysctl.c:detect overflows when converting to int")
      Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.orgSigned-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7d316a0
    • J
      byteswap: don't use __builtin_bswap*() with sparse · 101b29a2
      Johannes Berg 提交于
      Although sparse declares __builtin_bswap*(), it can't actually do
      constant folding inside them (yet).  As such, things like
      
        switch (protocol) {
        case htons(ETH_P_IP):
                break;
        }
      
      which we do all over the place cause sparse to warn that it expects a
      constant instead of a function call.
      
      Disable __HAVE_BUILTIN_BSWAP*__ if __CHECKER__ is defined to avoid this.
      
      Fixes: 7322dd75 ("byteswap: try to avoid __builtin_constant_p gcc bug")
      Link: http://lkml.kernel.org/r/1470914102-26389-1-git-send-email-johannes@sipsolutions.netSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      101b29a2
    • I
      bridge: switchdev: Add forward mark support for stacked devices · 6bc506b4
      Ido Schimmel 提交于
      switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
      port netdevs so that packets being flooded by the device won't be
      flooded twice.
      
      It works by assigning a unique identifier (the ifindex of the first
      bridge port) to bridge ports sharing the same parent ID. This prevents
      packets from being flooded twice by the same switch, but will flood
      packets through bridge ports belonging to a different switch.
      
      This method is problematic when stacked devices are taken into account,
      such as VLANs. In such cases, a physical port netdev can have upper
      devices being members in two different bridges, thus requiring two
      different 'offload_fwd_mark's to be configured on the port netdev, which
      is impossible.
      
      The main problem is that packet and netdev marking is performed at the
      physical netdev level, whereas flooding occurs between bridge ports,
      which are not necessarily port netdevs.
      
      Instead, packet and netdev marking should really be done in the bridge
      driver with the switch driver only telling it which packets it already
      forwarded. The bridge driver will mark such packets using the mark
      assigned to the ingress bridge port and will prevent the packet from
      being forwarded through any bridge port sharing the same mark (i.e.
      having the same parent ID).
      
      Remove the current switchdev 'offload_fwd_mark' implementation and
      instead implement the proposed method. In addition, make rocker - the
      sole user of the mark - use the proposed method.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bc506b4
  8. 26 8月, 2016 1 次提交
    • P
      rhashtable: add rhashtable_lookup_get_insert_key() · 5ca8cc5b
      Pablo Neira Ayuso 提交于
      This patch modifies __rhashtable_insert_fast() so it returns the
      existing object that clashes with the one that you want to insert.
      In case the object is successfully inserted, NULL is returned.
      Otherwise, you get an error via ERR_PTR().
      
      This patch adapts the existing callers of __rhashtable_insert_fast()
      so they handle this new logic, and it adds a new
      rhashtable_lookup_get_insert_key() interface to fetch this existing
      object.
      
      nf_tables needs this change to improve handling of EEXIST cases via
      honoring the NLM_F_EXCL flag and by checking if the data part of the
      mapping matches what we have.
      
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      5ca8cc5b
  9. 24 8月, 2016 3 次提交
  10. 23 8月, 2016 2 次提交
  11. 20 8月, 2016 2 次提交
    • Y
      qed: utilize FW 8.10.10.0 · 05fafbfb
      Yuval Mintz 提交于
      This new firmware for the qed* adpaters fixes several issues:
       - Better blocking of malicious VFs.
       - After FLR, Tx-switching [internal routing] of packets might
         be incorrect.
       - Deletion of unicast MAC filters would sometime have side-effect
         of corrupting the MAC filters configred for a device.
      It also contains fixes for future qed* drivers that *hopefully* would be
      sent for review in the near future.
      
      In addition, it would allow driver some new functionality, including:
       - Allowing PF/VF driver compaitibility with old drivers [running
         pre-8.10.5.0 firmware].
       - Better debug facilities.
      
      This would also bump the qed* driver versions to 8.10.9.20.
      Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05fafbfb
    • H
      rhashtable: Remove GFP flag from rhashtable_walk_init · 246779dd
      Herbert Xu 提交于
      The commit 8f6fd83c ("rhashtable:
      accept GFP flags in rhashtable_walk_init") added a GFP flag argument
      to rhashtable_walk_init because some users wish to use the walker
      in an unsleepable context.
      
      In fact we don't need to allocate memory in rhashtable_walk_init
      at all.  The walker is always paired with an iterator so we could
      just stash ourselves there.
      
      This patch does that by introducing a new enter function to replace
      the existing init function.  This way we don't have to churn all
      the existing users again.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      246779dd
  12. 19 8月, 2016 4 次提交
    • D
      bpf: add bpf_skb_change_tail helper · 5293efe6
      Daniel Borkmann 提交于
      This work adds a bpf_skb_change_tail() helper for tc BPF programs. The
      basic idea is to expand or shrink the skb in a controlled manner. The
      eBPF program can then rewrite the rest via helpers like bpf_skb_store_bytes(),
      bpf_lX_csum_replace() and others rather than passing a raw buffer for
      writing here.
      
      bpf_skb_change_tail() is really a slow path helper and intended for
      replies with f.e. ICMP control messages. Concept is similar to other
      helpers like bpf_skb_change_proto() helper to keep the helper without
      protocol specifics and let the BPF program mangle the remaining parts.
      A flags field has been added and is reserved for now should we extend
      the helper in future.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5293efe6
    • R
      net: bgmac: support Ethernet core on BCM53573 SoCs · 1cb94db3
      Rafał Miłecki 提交于
      BCM53573 is a new series of Broadcom's SoCs. It's based on ARM and can
      be found in two packages (versions): BCM53573 and BCM47189. It shares
      some code with the Northstar family, but also requires some new quirks.
      
      First of all there can be up to 2 Ethernet cores on this SoC. If that is
      the case, they are connected to two different switch ports allowing some
      more complex/optimized setups. It seems the second unit doesn't come
      fully configured and requires some IRQ quirk.
      
      Other than that only the first core is connected to the PHY. For the
      second one we have to register fixed PHY (similarly to the Northstar),
      otherwise generic PHY driver would get some invalid info.
      
      This has been successfully tested on Tenda AC9 (BCM47189B0).
      Signed-off-by: NRafał Miłecki <rafal@milecki.pl>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cb94db3
    • H
      flow_dissector: Get vlan priority in addition to vlan id · f6a66927
      Hadar Hen Zion 提交于
      Add vlan priority check to the flow dissector by adding new flow
      dissector struct, flow_dissector_key_vlan which includes vlan tag
      fields.
      
      vlan_id and flow_label fields were under the same struct
      (flow_dissector_key_tags). It was a convenient setting since struct
      flow_dissector_key_tags is used by struct flow_keys and by setting
      vlan_id and flow_label under the same struct, we get precisely 24 or 48
      bytes in flow_keys from flow_dissector_key_basic.
      
      Now, when adding vlan priority support, the code will be cleaner if
      flow_label and vlan tag won't be under the same struct anymore.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6a66927
    • Y
      qed*: Fix pause setting · d194fd26
      Yuval Mintz 提交于
      When moving into using ethtool's link_ksetting, qed started
      supplying its own bitmask of speed/capabilities, but qede
      is still checking for the SUPPORTED value to determine whether
      it supports pause.
      
      Fixes: 054c67d1 ("qed*: Add support for ethtool link_ksettings callbacks")
      Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d194fd26
  13. 18 8月, 2016 14 次提交
  14. 17 8月, 2016 1 次提交