1. 02 7月, 2016 3 次提交
    • D
      Merge branch 'bpf-robustify' · 6bd3847b
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      Further robustify putting BPF progs
      
      This series addresses a potential issue reported to us by Jann Horn
      with regards to putting progs. First patch moves progs generally under
      RCU destruction and second patch refactors getting of progs to simplify
      code a bit. For details, please see individual patches. Note, we think
      that addressing this one in net-next should be sufficient.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bd3847b
    • D
      bpf: refactor bpf_prog_get and type check into helper · 113214be
      Daniel Borkmann 提交于
      Since bpf_prog_get() and program type check is used in a couple of places,
      refactor this into a small helper function that we can make use of. Since
      the non RO prog->aux part is not used in performance critical paths and a
      program destruction via RCU is rather very unlikley when doing the put, we
      shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
      check, but actually not taking the ref at all (due to being in fdget() /
      fdput() section of the bpf fd) is even cleaner and makes the diff smaller
      as well, so just go for that. Callsites are changed to make use of the new
      helper where possible.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      113214be
    • D
      bpf: generally move prog destruction to RCU deferral · 1aacde3d
      Daniel Borkmann 提交于
      Jann Horn reported following analysis that could potentially result
      in a very hard to trigger (if not impossible) UAF race, to quote his
      event timeline:
      
       - Set up a process with threads T1, T2 and T3
       - Let T1 set up a socket filter F1 that invokes another filter F2
         through a BPF map [tail call]
       - Let T1 trigger the socket filter via a unix domain socket write,
         don't wait for completion
       - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
       - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
       - Let T3 close the file descriptor for F2, dropping the reference
         count of F2 to 2
       - At this point, T1 should have looked up F2 from the map, but not
         finished executing it
       - Let T3 remove F2 from the BPF map, dropping the reference count of
         F2 to 1
       - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
         the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
         via schedule_work()
       - At this point, the BPF program could be freed
       - BPF execution is still running in a freed BPF program
      
      While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
      event fd we're doing the syscall on doesn't disappear from underneath us
      for whole syscall time, it may not be the case for the bpf fd used as
      an argument only after we did the put. It needs to be a valid fd pointing
      to a BPF program at the time of the call to make the bpf_prog_get() and
      while T2 gets preempted, F2 must have dropped reference to 1 on the other
      CPU. The fput() from the close() in T3 should also add additionally delay
      to the reference drop via exit_task_work() when bpf_prog_release() gets
      called as well as scheduling bpf_prog_free_deferred().
      
      That said, it makes nevertheless sense to move the BPF prog destruction
      generally after RCU grace period to guarantee that such scenario above,
      but also others as recently fixed in ceb56070 ("bpf, perf: delay release
      of BPF prog after grace period") with regards to tail calls won't happen.
      Integrating bpf_prog_free_deferred() directly into the RCU callback is
      not allowed since the invocation might happen from either softirq or
      process context, so we're not permitted to block. Reviewing all bpf_prog_put()
      invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
      their destruction) with call_rcu() look good to me.
      
      Since we don't know whether at the time of attaching the program, we're
      already part of a tail call map, we need to use RCU variant. However, due
      to this, there won't be severely more stress on the RCU callback queue:
      situations with above bpf_prog_get() and bpf_prog_put() combo in practice
      normally won't lead to releases, but even if they would, enough effort/
      cycles have to be put into loading a BPF program into the kernel already.
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1aacde3d
  2. 01 7月, 2016 21 次提交
  3. 30 6月, 2016 16 次提交
    • D
      Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 435c556c
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2016-06-29
      
      This series contains updates and fixes to e1000e, igb, ixgbe and fm10k.  A
      true smorgasbord of changes.
      
      Jake cleans up some obscurity by not using the BIT() macro on bitshift
      operation and also fixed the calculated index when looping through the
      indir array.  Fixes the issue with igb's workqueue item for overflow
      check from causing a surprise remove event.  The ptp_flags variable is
      added to simplify the work of writing several complex MAC type checks
      in the PTP code while fixing the workqueue.
      
      Alex Duyck fixes the receive buffers alignment which should not be L1
      cache aligned, but to 512 bytes instead.
      
      Denys Vlasenko prevents a division by zero which was reported under
      VMWare for e1000e.
      
      Amritha fixes an issue where filters in a child hash table must be
      cleared from the hardware before delete the filter links in ixgbe.
      
      Bhaktipriya Shridhar simply replaces the deprecated create_workqueue()
      with alloc_workqueue() for fm10k.
      
      Tony corrects ixgbe ethtool reporting to show x550 supports hardware
      timestamping of all packets.
      
      Emil fixes an issue where MAC-VLANs on the VF fail to pass traffic due
      to spoofed packets.
      
      Andrew Lunn increases performance on some systems where syncing a buffer
      for DMA is expensive.  So rather than sync the whole 2K receive buffer,
      only synchronize the length of the frame.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      435c556c
    • D
      Merge branch 'nfp-next' · c435e6e0
      David S. Miller 提交于
      Jakub Kicinski says:
      
      ====================
      nfp: few code improvements
      
      Three small patches for net-next.  First and second patches
      improve the code quality by spelling things correctly and
      removing unused parameters.  Third patch hooks-in standard
      kernel implementation of .get_link() in ethtool ops.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c435e6e0
    • J
      nfp: implement ethtool .get_link() callback · 2370def2
      Jakub Kicinski 提交于
      Point the ethtool .get_link() callback to the standard
      ethtool_op_get_link() implementation.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2370def2
    • J
      nfp: remove unused parameter from nfp_net_write_mac_addr() · f642963b
      Jakub Kicinski 提交于
      nfp_net_write_mac_addr() always writes to the BAR the current
      device address taken from netdev struct.  The address given
      as parameter is actually ignored.  Since all callers pass
      netdev->dev_addr simply remove the parameter.
      
      While at it improve the function's kdoc a bit.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f642963b
    • J
      nfp: correct name of control BAR define · 796312cd
      Jakub Kicinski 提交于
      Spell abbreviation of control as ctrl not crtl.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      796312cd
    • D
      be2net: signedness bug in be_msix_enable() · 6fde0e63
      Dan Carpenter 提交于
      "num_vec" needs to be signed for the error handling to work.
      
      Fixes: e261768e ('be2net: support asymmetric rx/tx queue counts')
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NSathya Perla <sathya.perla@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fde0e63
    • M
      net: netcp: Fix a typo in keystone-netcp.txt · 9b9a553c
      Masanari Iida 提交于
      This patch fix a spelling typo in keystone-netcp.txt
      Signed-off-by: NMasanari Iida <standby24x7@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b9a553c
    • D
      Merge branch 'mediatek-next' · 833ba3d5
      David S. Miller 提交于
      John Crispin says:
      
      ====================
      net-next: mediatek: IRQ cleanups, fixes and grouping
      
      This series contains 2 small code cleanups that are leftovers from the
      MIPS support. There is also a small fix that adds proper locking to the
      code accessing the IRQ registers. Without this fix we saw deadlocks caused
      by the last patch of the series, which adds IRQ grouping. The grouping
      feature allows us to use different IRQs for TX and RX. By doing so we can
      use affinity to let the SoC handle the IRQs on different cores.
      
      This series depends on a previous series currently sitting in net.git
      starting with
      	commit 562c5a70 ("net: mediatek: only wake the queue if it is stopped")
      up to
      	commit 82c6544d ("net: mediatek: remove superfluous queue wake up call")
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      833ba3d5
    • J
      net-next: mediatek: add support for IRQ grouping · 80673029
      John Crispin 提交于
      The ethernet core has 3 IRQs. Using the IRQ grouping registers we are able
      to separate TX and RX IRQs, which allows us to service them on separate
      cores. This patch splits the IRQ handler into 2 separate functions, one for
      TX and another for RX. The TX housekeeping is split out into its own NAPI
      handler.
      Signed-off-by: NJohn Crispin <john@phrozen.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80673029
    • J
      net-next: mediatek: add IRQ locking · 7bc9ccec
      John Crispin 提交于
      The code that enables and disables IRQs is missing proper locking. After
      adding the IRQ grouping patch and routing the RX and TX IRQs to different
      cores we experienced IRQ stalls. Fix this by adding proper locking.
      We use a dedicated lock to reduce the latency if the IRQ code.
      Signed-off-by: NJohn Crispin <john@phrozen.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bc9ccec
    • J
      net-next: mediatek: don't use intermediate variables to store IRQ masks · eece71e8
      John Crispin 提交于
      The code currently uses variables to store and never modify the bit masks
      of interrupts. This is legacy code from an early version of the driver
      that supported MIPS based SoCs where the IRQ bits depended on the actual
      SoC. As the bits are the same for all ARM based SoCs using this driver we
      can remove the intermediate variables.
      Signed-off-by: NJohn Crispin <john@phrozen.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eece71e8
    • J
      net-next: mediatek: remove superfluous register reads · 6e6edd8b
      John Crispin 提交于
      The driver was originally written for MIPS based SoC. These required the
      IRQ mask register to be read after writing it to ensure that the content
      was actually applied. As this version only works on ARM based SoCs, we can
      safely remove the 2 reads as they are no longer required.
      Signed-off-by: NJohn Crispin <john@phrozen.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e6edd8b
    • M
      fib_rules: Added NLM_F_EXCL support to fib_nl_newrule · 153380ec
      Mateusz Bajorski 提交于
      When adding rule with NLM_F_EXCL flag then check if the same rule exist.
      If yes then exit with -EEXIST.
      
      This is already implemented in iproute2:
              if (cmd == RTM_NEWRULE) {
                      req.n.nlmsg_flags |= NLM_F_CREATE|NLM_F_EXCL;
                      req.r.rtm_type = RTN_UNICAST;
              }
      
      Tested ipv4 and ipv6 with net-next linux on qemu x86
      
      expected behavior after patch:
      localhost ~ # ip rule
      0:    from all lookup local
      32766:    from all lookup main
      32767:    from all lookup default
      localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
      localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
      RTNETLINK answers: File exists
      localhost ~ # ip rule
      0:    from all lookup local
      1005:    from 10.46.177.97 lookup 104
      32766:    from all lookup main
      32767:    from all lookup default
      
      There was already topic regarding this but I don't see any changes
      merged and problem still occurs.
      https://lkml.kernel.org/r/1135778809.5944.7.camel+%28%29+localhost+%21+localdomainSigned-off-by: NMateusz Bajorski <mateusz.bajorski@nokia.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      153380ec
    • S
      tcp: increase size at which tcp_bound_to_half_wnd bounds to > TCP_MSS_DEFAULT · 2631b79f
      Seymour, Shane M 提交于
      In previous commit 01f83d69
      the following comments were added:
      
      "When peer uses tiny windows, there is no use in packetizing to sub-MSS
      pieces for the sake of SWS or making sure there are enough packets in
      the pipe for fast recovery."
      
      The test should be > TCP_MSS_DEFAULT not >= 512. This allows low end
      devices that send an MSS of 536 (TCP_MSS_DEFAULT) to see better network
      performance by sending it 536 bytes of data at a time instead of bounding
      to half window size (268). Other network stacks work this way, e.g. HP-UX.
      Signed-off-by: NShane Seymour <shane.seymour@hpe.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2631b79f
    • A
      tcp: add an ability to dump and restore window parameters · b1ed4c4f
      Andrey Vagin 提交于
      We found that sometimes a restored tcp socket doesn't work.
      
      A reason of this bug is incorrect window parameters and in this case
      tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The
      other side drops packets with this seq, because seq is less than
      tp->rcv_nxt ( tcp_sequence() ).
      
      Data from a send queue is sent only if there is enough space in a
      window, so when we restore unacked data, we need to expand a window to
      fit this data.
      
      This was in a first version of this patch:
      "tcp: extend window to fit all restored unacked data in a send queue"
      
      Then Alexey recommended me to restore window parameters instead of
      adjusted them according with data in a sent queue. This sounds resonable.
      
      rcv_wnd has to be restored, because it was reported to another side
      and the offered window is never shrunk.
      One of reasons why we need to restore snd_wnd was described above.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1ed4c4f
    • D
      Merge branch 'bridge-igmp-stats' · 641f7e40
      David S. Miller 提交于
      Nikolay Aleksandrov says:
      
      ====================
      net: bridge: add support for IGMP/MLD stats
      
      This patchset adds support for the new IFLA_STATS_LINK_XSTATS_SLAVE
      attribute which can be used with RTM_GETSTATS in order to export per-slave
      statistics. It works by passing the attribute to the linkxstats callback
      and if the callback user supports it - it should dump that slave's stats.
      This is much more scalable and permits us to request only a single port's
      statistics instead of dumping everything every time.
      The second patch adds support for per-port IGMP/MLD statistics and uses
      the new API to export them for the bridge and its ports. The stats are
      made in a very lightweight manner, the normal fast-path is not affected
      at all and the flood paths (br_flood/br_multicast_flood) are only affected
      if the packet is IGMP and the IGMP stats have been enabled using cache-hot
      data for the check.
      
      v2: Patch 01 is new, patch 02 has been reworked to use the new API, also
      in addition counters for IGMP/MLD parse errors have been added and members
      are added for per-port multicast traffic stats. The multicast counting has
      been slightly optimized (moved the br_multicast_count inside the IPv4/6
      IGMP functions after the checks for IGMP traffic) to avoid one conditional
      that was on all of the multicast traffic path (both IGMP and other).
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      641f7e40