1. 04 12月, 2016 40 次提交
    • D
      Merge branch 'mv88e6390-batch-three' · ce84c7c6
      David S. Miller 提交于
      Andrew Lunn says:
      
      ====================
      mv88e6390 batch 3
      
      More patches to support the MV88e6390. This is mostly refactoring
      existing code and adding implementations for the mv88e6390.  This
      patchset set which reserved frames are sent to the cpu, the size of
      jumbo frames that will be accepted, turn off egress rate limiting, and
      configuration of pause frames.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce84c7c6
    • A
      net: dsa: mv88e6xxx: Implement mv88e6390 pause control · 3ce0e65e
      Andrew Lunn 提交于
      The mv88e6390 has a number flow control registers accessed via the
      Flow Control register. Use these to set the pause control.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ce0e65e
    • A
      net: dsa: mv88e6xxx: Refactor pause configuration · b35d322a
      Andrew Lunn 提交于
      The mv88e6390 has a different mechanism for configuring pause.
      Refactor the code into an ops function, and for the moment, don't add
      any mv88e6390 code yet.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b35d322a
    • A
      net: dsa: mv88e6xxx: Refactor egress rate limiting · ef70b111
      Andrew Lunn 提交于
      There are two different rate limiting configurations, depending on the
      switch generation. Refactor this into ops.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef70b111
    • A
      net: dsa: mv88e6xxx: Refactor setting of jumbo frames · 5f436666
      Andrew Lunn 提交于
      Some switches support jumbo frames. Refactor this code into operations
      in the ops structure.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f436666
    • A
      net: dsa: mv88e6xxx: Reserved Management frames to CPU · 6e55f698
      Andrew Lunn 提交于
      Older devices have a couple of registers in global2. The mv88e6390
      family has a single register in global1 behind which hides similar
      configuration. Implement and op for this.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e55f698
    • D
      Merge branch 'mv88e6390-batch-two' · 7a6c5cb9
      David S. Miller 提交于
      Andrew Lunn says:
      
      ====================
      MV88E6390 batch two
      
      This is the second batch of patches adding support for the
      MV88e6390. They are not sufficient to make it work properly.
      
      The mv88e6390 has a much expanded set of priority maps. Refactor the
      existing code, and implement basic support for the new device.
      
      Similarly, the monitor control register has been reworked.
      
      The mv88e6390 has something odd in its EDSA tagging implementation,
      which means it is not possible to use it. So we need to use DSA
      tagging. This is the first device with EDSA support where we need to
      use DSA, and the code does not support this. So two patches refactor
      the existing code. The two different register definitions are
      separated out, and using DSA on an EDSA capable device is added.
      
      v2:
      Add port prefix
      Add helper function for 6390
      Add _IEEE_ into #defines
      Split monitor_ctrl into a number of separate ops.
      Remove 6390 code which is management, used in a later patch
      s/EGREES/EGRESS/.
      Broke up setup_port_dsa() and set_port_dsa() into a number of ops
      
      v3:
      Verify mandatory ops for port setup
      Don't set ether type for DSA port.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a6c5cb9
    • A
      net: dsa: mv88e6xxx: Refactor CPU and DSA port setup · 56995cbc
      Andrew Lunn 提交于
      Older chips only support DSA tagging. Newer chips have both DSA and
      EDSA tagging. Refactor the code by adding port functions for setting the
      frame mode, egress mode, and if to forward unknown frames.
      
      This results in the helper mv88e6xxx_6065_family() becoming unused, so
      remove it.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      v3:
      Verify mandatory ops for port setup
      Don't set ether type for DSA port.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56995cbc
    • A
      net: dsa: mv88e6xxx: Move the tagging protocol into info · 443d5a1b
      Andrew Lunn 提交于
      Older chips support a single tagging protocol, DSA. New chips support
      both DSA and EDSA, an enhanced version. Having both as an option
      changes the register layouts. Up until now, it has been assumed that
      if EDSA is supported, it will be used. Hence the register layout has
      been determined by which protocol should be used. However, mv88e6390
      has a different implementation of EDSA, which requires we need to use
      the DSA tagging. Hence separate the selection of the protocol from the
      register layout.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      443d5a1b
    • A
      net: dsa: mv88e6xxx: Monitor and Management tables · 33641994
      Andrew Lunn 提交于
      The mv88e6390 changes the monitor control register into the Monitor
      and Management control, which is an indirection register to various
      registers.
      
      Add ops to set the CPU port and the ingress/egress port for both
      register layouts, to global1
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33641994
    • A
      net: dsa: mv88e6xxx: Implement mv88e6390 tag remap · ef0a7318
      Andrew Lunn 提交于
      The mv88e6390 does not have the two registers to set the frame
      priority map. Instead it has an indirection registers for setting a
      number of different priority maps. Refactor the old code into an
      function, implement the mv88e6390 version, and use an op to call the
      right one.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef0a7318
    • D
      Merge branch 'fib-notifier-event-replay' · 69248719
      David S. Miller 提交于
      Jiri Pirko says:
      
      ====================
      ipv4: fib: Replay events when registering FIB notifier
      
      Ido says:
      
      In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
      by a new FIB notification chain to which modules could register in order
      to be notified about the addition and deletion of FIB entries. The
      motivation for this change was that switchdev drivers need to be able to
      reflect the entire FIB table and not only FIBs configured on top of the
      port netdevs themselves. This is useful in case of in-band management.
      
      The fundamental problem with this approach is that upon registration
      listeners lose all the information previously sent in the chain and
      thus have an incomplete view of the FIB tables, which can result in
      packet loss. This patchset fixes that by dumping the FIB tables and
      replaying notifications previously sent in the chain for the registered
      notification block.
      
      The entire dump process is done under RCU and thus the FIB notification
      chain is converted to be atomic. The listeners are modified accordingly.
      This is done in the first eight patches.
      
      The ninth patch adds a change sequence counter to ensure the integrity
      of the FIB dump. The last patch adds the dump itself to the FIB chain
      registration function and modifies existing listeners to pass a callback
      to be executed in case dump was inconsistent.
      
      ---
      v3->v4:
      - Register the notification block after the dump and protect it using
        the change sequence counter (Hannes Frederic Sowa).
      - Since we now integrate the dump into the registration function, drop
        the sysctl to set maximum number of retries and instead set it to a
        fixed number. Lets see if it's really a problem before adding something
        we can never remove.
      - For the same reason, dump FIB tables for all net namespaces.
      - Add a comment regarding guarantees provided by mutex semantics.
      
      v2->v3:
      - Add sysctl to set the number of FIB dump retries (Hannes Frederic Sowa).
      - Read the sequence counter under RTNL to ensure synchronization
        between the dump process and other processes changing the routing
        tables (Hannes Frederic Sowa).
      - Pass a callback to the dump function to be executed prior to a retry.
      - Limit the dump to a single net namespace.
      
      v1->v2:
      - Add a sequence counter to ensure the integrity of the FIB dump
        (David S. Miller, Hannes Frederic Sowa).
      - Protect notifications from re-ordering in listeners by using an
        ordered workqueue (Hannes Frederic Sowa).
      - Introduce fib_info_hold() (Jiri Pirko).
      - Relieve rocker from the need to invoke the FIB dump by registering
        to the FIB notification chain prior to ports creation.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69248719
    • I
      ipv4: fib: Replay events when registering FIB notifier · c3852ef7
      Ido Schimmel 提交于
      Commit b90eb754 ("fib: introduce FIB notification infrastructure")
      introduced a new notification chain to notify listeners (f.e., switchdev
      drivers) about addition and deletion of routes.
      
      However, upon registration to the chain the FIB tables can already be
      populated, which means potential listeners will have an incomplete view
      of the tables.
      
      Solve that by dumping the FIB tables and replaying the events to the
      passed notification block. The dump itself is done using RCU in order
      not to starve consumers that need RTNL to make progress.
      
      The integrity of the dump is ensured by reading the FIB change sequence
      counter before and after the dump under RTNL. This allows us to avoid
      the problematic situation in which the dumping process sends a ENTRY_ADD
      notification following ENTRY_DEL generated by another process holding
      RTNL.
      
      Callers of the registration function may pass a callback that is
      executed in case the dump was inconsistent with current FIB tables.
      
      The number of retries until a consistent dump is achieved is set to a
      fixed number to prevent callers from looping for long periods of time.
      In case current limit proves to be problematic in the future, it can be
      easily converted to be configurable using a sysctl.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3852ef7
    • I
      ipv4: fib: Allow for consistent FIB dumping · cacaad11
      Ido Schimmel 提交于
      The next patch will enable listeners of the FIB notification chain to
      request a dump of the FIB tables. However, since RTNL isn't taken during
      the dump, it's possible for the FIB tables to change mid-dump, which
      will result in inconsistency between the listener's table and the
      kernel's.
      
      Allow listeners to know about changes that occurred mid-dump, by adding
      a change sequence counter to each net namespace. The counter is
      incremented just before a notification is sent in the FIB chain.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cacaad11
    • I
      ipv4: fib: Convert FIB notification chain to be atomic · d3f706f6
      Ido Schimmel 提交于
      In order not to hold RTNL for long periods of time we're going to dump
      the FIB tables using RCU.
      
      Convert the FIB notification chain to be atomic, as we can't block in
      RCU critical sections.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3f706f6
    • I
      rocker: Register FIB notifier before creating ports · 17f8be7d
      Ido Schimmel 提交于
      We can miss FIB notifications sent between the time the ports were
      created and the FIB notification block registered.
      
      Instead of receiving these notifications only when they are replayed for
      the FIB notification block during registration, just register the
      notification block before the ports are created.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17f8be7d
    • I
      rocker: Implement FIB offload in deferred work · db701955
      Ido Schimmel 提交于
      Convert rocker to offload FIBs in deferred work in a similar fashion to
      mlxsw, which was converted in the previous commits.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db701955
    • I
      rocker: Create an ordered workqueue for FIB offload · c1bb279c
      Ido Schimmel 提交于
      As explained in the previous commits, we need to process FIB entries
      addition / deletion events in FIFO order or otherwise we can have a
      mismatch between the kernel's FIB table and the device's.
      
      Create an ordered workqueue for rocker to which these work items will be
      submitted to.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1bb279c
    • I
      mlxsw: spectrum_router: Implement FIB offload in deferred work · 3057224e
      Ido Schimmel 提交于
      FIB offload is currently done in process context with RTNL held, but
      we're about to dump the FIB tables in RCU critical section, so we can no
      longer sleep.
      
      Instead, defer the operation to process context using deferred work. Make
      sure fib info isn't freed while the work is queued by taking a reference
      on it and releasing it after the operation is done.
      
      Deferring the operation is valid because the upper layers always assume
      the operation was successful. If it's not, then the driver-specific
      abort mechanism is called and all routed traffic is directed to slow
      path.
      
      The work items are submitted to an ordered workqueue to prevent a
      mismatch between the kernel's FIB table and the device's.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3057224e
    • I
      mlxsw: core: Create an ordered workqueue for FIB offload · a3832b31
      Ido Schimmel 提交于
      We're going to start processing FIB entries addition / deletion events
      in deferred work. These work items must be processed in the order they
      were submitted or otherwise we can have differences between the kernel's
      FIB table and the device's.
      
      Solve this by creating an ordered workqueue to which these work items
      will be submitted to. Note that we can't simply convert the current
      workqueue to be ordered, as EMADs re-transmissions are also processed in
      deferred work.
      
      Later on, we can migrate other work items to this workqueue, such as FDB
      notification processing and nexthop resolution, since they all take the
      same lock anyway.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3832b31
    • I
      ipv4: fib: Add fib_info_hold() helper · 1c677b3d
      Ido Schimmel 提交于
      As explained in the previous commit, modules are going to need to take a
      reference on fib info and then drop it using fib_info_put().
      
      Add the fib_info_hold() helper to make the code more readable and also
      symmetric with fib_info_put().
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Suggested-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c677b3d
    • I
      ipv4: fib: Export free_fib_info() · b423cb10
      Ido Schimmel 提交于
      The FIB notification chain is going to be converted to an atomic chain,
      which means switchdev drivers will have to offload FIB entries in
      deferred work, as hardware operations entail sleeping.
      
      However, while the work is queued fib info might be freed, so a
      reference must be taken. To release the reference (and potentially free
      the fib info) fib_info_put() will be called, which in turn calls
      free_fib_info().
      
      Export free_fib_info() so that modules will be able to invoke
      fib_info_put().
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b423cb10
    • W
      act_mirred: fix a typo in get_dev · 548ed722
      WANG Cong 提交于
      Fixes: 255cb304 ("net/sched: act_mirred: Add new tc_action_ops get_dev()")
      Cc: Hadar Hen Zion <hadarh@mellanox.com>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      548ed722
    • D
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · db7e9f7c
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2016-12-02
      
      This series contains updates to i40e and i40evf only.
      
      Alex provides changes so that we are much more robust about defining what
      we can and cannot offload in i40e and i40evf by doing additional checks
      other than L4 tunnel header length.
      
      Jake provides several fixes/changes, first cleaning up a label that is
      unnecessary, as well as cleaned up the use of a "magic number".  Clarified
      the code by separating the global private flags and the regular private
      flags per interface into two arrays, so that future additions will not
      produce duplication and buggy code.  Adds additional checks to protect
      against NULL values for msix_entries and q_vectors pointers.
      
      Michal adds Clause22 method for accessing registers for some external
      PHYs.
      
      Piotr adds additional protocol support for the admin queue discover
      capabilities function.
      
      Tushar Dave fixes a panic seen on SPARC, where writel() should not be
      used to write directly to a memory address but only to a memory mapped
      I/O address otherwise it causes data access exceptions.
      
      Joe Perches separates out a section of code into its own function, to
      help reduce i40evf_reset_task() a bit.
      
      Alan fixes an issue by checking for NULL before dereferencing msix_entries
      and returning early in the case where it is NULL within the i40evf_close()
      code path.
      
      Henry provides code cleanup to remove unreachable and redundant sections
      of code.  Fixed up an issue where new NICs were not identifying "unknown
      PHYs" correctly.
      
      Harshitha fixes a issue where the ethtool "Supported Link" modes list
      backplane interfaces on X722 devices for 10 GbE with SFP+ and Cortina
      retimer, where these interfaces should not be visible to the user since
      they cannot use them.
      
      Carolyn changes an X722 informational message so that it only appears
      when extra messages are desired.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db7e9f7c
    • Y
      tcp: fix the missing avr32 SOF_TIMESTAMPING_OPT_STATS · 2bb14878
      Yuchung Cheng 提交于
      The commit of SOF_TIMESTAMPING_OPT_STATS didn't include the
      new header for avr32, causing build to break. The patch fixes it.
      
      Fixes: 1c885808 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING")
      Reported-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bb14878
    • P
      udp: be less conservative with sock rmem accounting · 363dc73a
      Paolo Abeni 提交于
      Before commit 850cbadd ("udp: use it's own memory accounting
      schema"), the udp protocol allowed sk_rmem_alloc to grow beyond
      the rcvbuf by the whole current packet's truesize. After said commit
      we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue
      is empty. As reported by Jesper this cause a performance regression
      for some (small) values of rcvbuf.
      
      This commit is intended to fix the regression restoring the old
      handling of the rcvbuf limit.
      Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Fixes: 850cbadd ("udp: use it's own memory accounting schema")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      363dc73a
    • E
      net_sched: gen_estimator: account for timer drifts · 12efa1fa
      Eric Dumazet 提交于
      Under heavy stress, timer used in estimators tend to slowly be delayed
      by a few jiffies, leading to inaccuracies.
      
      Lets remember what was the last scheduled jiffies so that we get more
      precise estimations, without having to add a multiply/divide in the loop
      to account for the drifts.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12efa1fa
    • E
      sfc: remove EFX_BUG_ON_PARANOID, use EFX_WARN_ON_[ONCE_]PARANOID instead · e01b16a7
      Edward Cree 提交于
      Logically, EFX_BUG_ON_PARANOID can never be correct.  For, BUG_ON should
       only be used if it is not possible to continue without potential harm;
       and since the non-DEBUG driver will continue regardless (as the BUG_ON is
       compiled out), clearly the BUG_ON cannot be needed in the DEBUG driver.
      So, replace every EFX_BUG_ON_PARANOID with either an EFX_WARN_ON_PARANOID
       or the newly defined EFX_WARN_ON_ONCE_PARANOID.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e01b16a7
    • D
      Merge branch 'samples-bpf-automated-cgroup-tests' · 816fba35
      David S. Miller 提交于
      Sargun Dhillon says:
      
      ====================
      samples, bpf: Refactor; Add automated tests for cgroups
      
      These two patches are around refactoring out some old, reusable code from the
      existing test_current_task_under_cgroup_user test, and adding a new, automated
      test.
      
      There is some generic cgroupsv2 setup & cleanup code, given that most
      environment still don't have it setup by default. With this code, we're able
      to pretty easily add an automated test for future cgroupsv2 functionality.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      816fba35
    • S
      samples, bpf: Add automated test for cgroup filter attachments · 9b474ece
      Sargun Dhillon 提交于
      This patch adds the sample program test_cgrp2_attach2. This program is
      similar to test_cgrp2_attach, but it performs automated testing of the
      cgroupv2 BPF attached filters. It runs the following checks:
      * Simple filter attachment
      * Application of filters to child cgroups
      * Overriding filters on child cgroups
      	* Checking that this still works when the parent filter is removed
      
      The filters that are used here are simply allow all / deny all filters, so
      it isn't checking the actual functionality of the filters, but rather
      the behaviour  around detachment / attachment. If net_cls is enabled,
      this test will fail.
      Signed-off-by: NSargun Dhillon <sargun@sargun.me>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b474ece
    • S
      samples, bpf: Refactor test_current_task_under_cgroup - separate out helpers · 1a922fee
      Sargun Dhillon 提交于
      This patch modifies test_current_task_under_cgroup_user. The test has
      several helpers around creating a temporary environment for cgroup
      testing, and moving the current task around cgroups. This set of
      helpers can then be used in other tests.
      Signed-off-by: NSargun Dhillon <sargun@sargun.me>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a922fee
    • A
      samples/bpf: silence compiler warnings · 69a9d09b
      Alexei Starovoitov 提交于
      silence some of the clang compiler warnings like:
      include/linux/fs.h:2693:9: warning: comparison of unsigned enum expression < 0 is always false
      arch/x86/include/asm/processor.h:491:30: warning: taking address of packed member 'sp0' of class or structure 'x86_hw_tss' may result in an unaligned pointer value
      include/linux/cgroup-defs.h:326:16: warning: field 'cgrp' with variable sized type 'struct cgroup' not at the end of a struct or class is a GNU extension
      since they add too much noise to samples/bpf/ build.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69a9d09b
    • A
      netns: fix net_generic() "id - 1" bloat · 6af2d5ff
      Alexey Dobriyan 提交于
      net_generic() function is both a) inline and b) used ~600 times.
      
      It has the following code inside
      
      		...
      	ptr = ng->ptr[id - 1];
      		...
      
      "id" is never compile time constant so compiler is forced to subtract 1.
      And those decrements or LEA [r32 - 1] instructions add up.
      
      We also start id'ing from 1 to catch bugs where pernet sybsystem id
      is not initialized and 0. This is quite pointless idea (nothing will
      work or immediate interference with first registered subsystem) in
      general but it hints what needs to be done for code size reduction.
      
      Namely, overlaying allocation of pointer array and fixed part of
      structure in the beginning and using usual base-0 addressing.
      
      Ids are just cookies, their exact values do not matter, so lets start
      with 3 on x86_64.
      
      Code size savings (oh boy): -4.2 KB
      
      As usual, ignore the initial compiler stupidity part of the table.
      
      	add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
      	function                                     old     new   delta
      	tipc_nametbl_insert_publ                    1250    1270     +20
      	nlmclnt_lookup_host                          686     703     +17
      	nfsd4_encode_fattr                          5930    5941     +11
      	nfs_get_client                              1050    1061     +11
      	register_pernet_operations                   333     342      +9
      	tcf_mirred_init                              843     849      +6
      	tcf_bpf_init                                1143    1149      +6
      	gss_setup_upcall                             990     994      +4
      	idmap_name_to_id                             432     434      +2
      	ops_init                                     274     275      +1
      	nfsd_inject_forget_client                    259     260      +1
      	nfs4_alloc_client                            612     613      +1
      	tunnel_key_walker                            164     163      -1
      
      		...
      
      	tipc_bcbase_select_primary                   392     360     -32
      	mac80211_hwsim_new_radio                    2808    2767     -41
      	ipip6_tunnel_ioctl                          2228    2186     -42
      	tipc_bcast_rcv                               715     672     -43
      	tipc_link_build_proto_msg                   1140    1089     -51
      	nfsd4_lock                                  3851    3796     -55
      	tipc_mon_rcv                                1012     956     -56
      	Total: Before=156643951, After=156639743, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6af2d5ff
    • A
      netns: add dummy struct inside "struct net_generic" · 9bfc7b99
      Alexey Dobriyan 提交于
      This is precursor to fixing "[id - 1]" bloat inside net_generic().
      
      Name "s" is chosen to complement name "u" often used for dummy unions.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bfc7b99
    • A
      netns: publish net_generic correctly · 1a9a0592
      Alexey Dobriyan 提交于
      Publishing net_generic pointer is done with silly mistake: new array is
      published BEFORE setting freshly acquired pernet subsystem pointer.
      
      	memcpy
      	rcu_assign_pointer
      	kfree_rcu
      	ng->ptr[id - 1] = data;
      
      This bug was introduced with commit dec827d1
      ("[NETNS]: The generic per-net pointers.") in the glorious days of
      chopping networking stack into containers proper 8.5 years ago (whee...)
      
      How it didn't trigger for so long?
      Well, you need quite specific set of conditions:
      
      *) race window opens once per pernet subsystem addition
         (read: modprobe or boot)
      
      *) not every pernet subsystem is eligible (need ->id and ->size)
      
      *) not every pernet subsystem is vulnerable (need incorrect or absense
         of ordering of register_pernet_sybsys() and actually using net_generic())
      
      *) to hide the bug even more, default is to preallocate 13 pointers which
         is actually quite a lot. You need IPv6, netfilter, bridging etc together
         loaded to trigger reallocation in the first place. Trimmed down
         config are OK.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a9a0592
    • A
      netlink: 2-clause nla_ok() · 4f7df337
      Alexey Dobriyan 提交于
      nla_ok() consists of 3 clauses:
      
      	1) int rem >= (int)sizeof(struct nlattr)
      
      	2) u16 nla_len >= sizeof(struct nlattr)
      
      	3) u16 nla_len <= int rem
      
      The statement is that clause (1) is redundant.
      
      What it does is ensuring that "rem" is a positive number,
      so that in clause (3) positive number will be compared to positive number
      with no problems.
      
      However, "u16" fully fits into "int" and integers do not change value
      when upcasting even to signed type. Negative integers will be rejected
      by clause (3) just fine. Small positive integers will be rejected
      by transitivity of comparison operator.
      
      NOTE: all of the above DOES NOT apply to nlmsg_ok() where ->nlmsg_len is
      u32(!), so 3 clauses AND A CAST TO INT are necessary.
      
      Obligatory space savings report: -1.6 KB
      
      	$ ./scripts/bloat-o-meter ../vmlinux-000* ../vmlinux-001*
      	add/remove: 0/0 grow/shrink: 3/63 up/down: 35/-1692 (-1657)
      	function                                     old     new   delta
      	validate_scan_freqs                          142     155     +13
      	tcf_em_tree_validate                         867     879     +12
      	dcbnl_ieee_del                               328     338     +10
      	netlbl_cipsov4_add_common.isra               218     215      -3
      		...
      	ovs_nla_put_actions                          888     806     -82
      	netlbl_cipsov4_add_std                      1648    1566     -82
      	nl80211_parse_sched_scan                    2889    2780    -109
      	ip_tun_from_nlattr                          3086    2945    -141
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f7df337
    • Z
      staging: wilc1000: use reset to set mac header · fe211cd8
      Zhang Shengju 提交于
      Since offset is zero, it's not necessary to use set function. Reset
      function is straightforward, and will remove the unnecessary add
      operation in set function.
      Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe211cd8
    • Z
      iwlwifi: use reset to set transport header · a52a8a4d
      Zhang Shengju 提交于
      Since offset is zero, it's not necessary to use set function. Reset
      function is straightforward, and will remove the unnecessary add
      operation in set function.
      Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a52a8a4d
    • Z
      mlx4: use reset to set mac header · 69029109
      Zhang Shengju 提交于
      Since offset is zero, it's not necessary to use set function. Reset
      function is straightforward, and will remove the unnecessary add
      operation in set function.
      Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69029109
    • Z
      bnx2x: use reset to set network header · 0e24c0ad
      Zhang Shengju 提交于
      Since offset is zero, it's not necessary to use set function. Reset
      function is straightforward, and will remove the unnecessary add
      operation in set function.
      Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e24c0ad