1. 03 3月, 2015 2 次提交
  2. 02 3月, 2015 12 次提交
    • E
      net: move skb->dropcount to skb->cb[] · 744d5a3e
      Eyal Birger 提交于
      Commit 97775007 ("af_packet: add interframe drop cmsg (v6)")
      unionized skb->mark and skb->dropcount in order to allow recording
      of the socket drop count while maintaining struct sk_buff size.
      
      skb->dropcount was introduced since there was no available room
      in skb->cb[] in packet sockets. However, its introduction led to
      the inability to export skb->mark, or any other aliased field to
      userspace if so desired.
      
      Moving the dropcount metric to skb->cb[] eliminates this problem
      at the expense of 4 bytes less in skb->cb[] for protocol families
      using it.
      Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      744d5a3e
    • E
      net: add common accessor for setting dropcount on packets · 3bc3b96f
      Eyal Birger 提交于
      As part of an effort to move skb->dropcount to skb->cb[], use
      a common function in order to set dropcount in struct sk_buff.
      Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3bc3b96f
    • E
      net: use common macro for assering skb->cb[] available size in protocol families · b4772ef8
      Eyal Birger 提交于
      As part of an effort to move skb->dropcount to skb->cb[] use a common
      macro in protocol families using skb->cb[] for ancillary data to
      validate available room in skb->cb[].
      Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4772ef8
    • E
      net: packet: use sockaddr_ll fields as storage for skb original length in recvmsg path · 2472d761
      Eyal Birger 提交于
      As part of an effort to move skb->dropcount to skb->cb[], 4 bytes
      of additional room are needed in skb->cb[] in packet sockets.
      
      Store the skb original length in the first two fields of sockaddr_ll
      (sll_family and sll_protocol) as they can be derived from the skb when
      needed.
      Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2472d761
    • E
      net: rxrpc: change call to sock_recv_ts_and_drops() on rxrpc recvmsg to sock_recv_timestamp() · 2cfdf9fc
      Eyal Birger 提交于
      Commit 3b885787 ("net: Generalize socket rx gap / receive queue overflow cmsg")
      allowed receiving packet dropcount information as a socket level option.
      RXRPC sockets recvmsg function was changed to support this by calling
      sock_recv_ts_and_drops() instead of sock_recv_timestamp().
      
      However, protocol families wishing to receive dropcount should call
      sock_queue_rcv_skb() or set the dropcount specifically (as done
      in packet_rcv()). This was not done for rxrpc and thus this feature
      never worked on these sockets.
      
      Formalizing this by not calling sock_recv_ts_and_drops() in rxrpc as
      part of an effort to move skb->dropcount into skb->cb[]
      Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2cfdf9fc
    • E
      net: bluetooth: compact struct bt_skb_cb by converting boolean fields to bit fields · 6368c235
      Eyal Birger 提交于
      Convert boolean fields incoming and req_start to bit fields and move
      force_active in order save space in bt_skb_cb in an effort to use
      a portion of skb->cb[] for storing skb->dropcount.
      Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6368c235
    • E
      net: bluetooth: compact struct bt_skb_cb by inlining struct hci_req_ctrl · 49a6fe05
      Eyal Birger 提交于
      struct hci_req_ctrl is never used outside of struct bt_skb_cb;
      Inlining it frees 8 bytes on a 64 bit system in skb->cb[] allowing
      the addition of more ancillary data.
      Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
      Reviewed-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49a6fe05
    • D
      cls_bpf: add initial eBPF support for programmable classifiers · e2e9b654
      Daniel Borkmann 提交于
      This work extends the "classic" BPF programmable tc classifier by
      extending its scope also to native eBPF code!
      
      This allows for user space to implement own custom, 'safe' C like
      classifiers (or whatever other frontend language LLVM et al may
      provide in future), that can then be compiled with the LLVM eBPF
      backend to an eBPF elf file. The result of this can be loaded into
      the kernel via iproute2's tc. In the kernel, they can be JITed on
      major archs and thus run in native performance.
      
      Simple, minimal toy example to demonstrate the workflow:
      
        #include <linux/ip.h>
        #include <linux/if_ether.h>
        #include <linux/bpf.h>
      
        #include "tc_bpf_api.h"
      
        __section("classify")
        int cls_main(struct sk_buff *skb)
        {
          return (0x800 << 16) | load_byte(skb, ETH_HLEN + __builtin_offsetof(struct iphdr, tos));
        }
      
        char __license[] __section("license") = "GPL";
      
      The classifier can then be compiled into eBPF opcodes and loaded
      via tc, for example:
      
        clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o
        tc filter add dev em1 parent 1: bpf cls.o [...]
      
      As it has been demonstrated, the scope can even reach up to a fully
      fledged flow dissector (similarly as in samples/bpf/sockex2_kern.c).
      
      For tc, maps are allowed to be used, but from kernel context only,
      in other words, eBPF code can keep state across filter invocations.
      In future, we perhaps may reattach from a different application to
      those maps e.g., to read out collected statistics/state.
      
      Similarly as in socket filters, we may extend functionality for eBPF
      classifiers over time depending on the use cases. For that purpose,
      cls_bpf programs are using BPF_PROG_TYPE_SCHED_CLS program type, so
      we can allow additional functions/accessors (e.g. an ABI compatible
      offset translation to skb fields/metadata). For an initial cls_bpf
      support, we allow the same set of helper functions as eBPF socket
      filters, but we could diverge at some point in time w/o problem.
      
      I was wondering whether cls_bpf and act_bpf could share C programs,
      I can imagine that at some point, we introduce i) further common
      handlers for both (or even beyond their scope), and/or if truly needed
      ii) some restricted function space for each of them. Both can be
      abstracted easily through struct bpf_verifier_ops in future.
      
      The context of cls_bpf versus act_bpf is slightly different though:
      a cls_bpf program will return a specific classid whereas act_bpf a
      drop/non-drop return code, latter may also in future mangle skbs.
      That said, we can surely have a "classify" and "action" section in
      a single object file, or considered mentioned constraint add a
      possibility of a shared section.
      
      The workflow for getting native eBPF running from tc [1] is as
      follows: for f_bpf, I've added a slightly modified ELF parser code
      from Alexei's kernel sample, which reads out the LLVM compiled
      object, sets up maps (and dynamically fixes up map fds) if any, and
      loads the eBPF instructions all centrally through the bpf syscall.
      
      The resulting fd from the loaded program itself is being passed down
      to cls_bpf, which looks up struct bpf_prog from the fd store, and
      holds reference, so that it stays available also after tc program
      lifetime. On tc filter destruction, it will then drop its reference.
      
      Moreover, I've also added the optional possibility to annotate an
      eBPF filter with a name (e.g. path to object file, or something
      else if preferred) so that when tc dumps currently installed filters,
      some more context can be given to an admin for a given instance (as
      opposed to just the file descriptor number).
      
      Last but not least, bpf_prog_get() and bpf_prog_put() needed to be
      exported, so that eBPF can be used from cls_bpf built as a module.
      Thanks to 60a3b225 ("net: bpf: make eBPF interpreter images
      read-only") I think this is of no concern since anything wanting to
      alter eBPF opcode after verification stage would crash the kernel.
      
        [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2e9b654
    • D
      ebpf: move read-only fields to bpf_prog and shrink bpf_prog_aux · 24701ece
      Daniel Borkmann 提交于
      is_gpl_compatible and prog_type should be moved directly into bpf_prog
      as they stay immutable during bpf_prog's lifetime, are core attributes
      and they can be locked as read-only later on via bpf_prog_select_runtime().
      
      With a bit of rearranging, this also allows us to shrink bpf_prog_aux
      to exactly 1 cacheline.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24701ece
    • D
      ebpf: add sched_cls_type and map it to sk_filter's verifier ops · 96be4325
      Daniel Borkmann 提交于
      As discussed recently and at netconf/netdev01, we want to prevent making
      bpf_verifier_ops registration available for modules, but have them at a
      controlled place inside the kernel instead.
      
      The reason for this is, that out-of-tree modules can go crazy and define
      and register any verfifier ops they want, doing all sorts of crap, even
      bypassing available GPLed eBPF helper functions. We don't want to offer
      such a shiny playground, of course, but keep strict control to ourselves
      inside the core kernel.
      
      This also encourages us to design eBPF user helpers carefully and
      generically, so they can be shared among various subsystems using eBPF.
      
      For the eBPF traffic classifier (cls_bpf), it's a good start to share
      the same helper facilities as we currently do in eBPF for socket filters.
      
      That way, we have BPF_PROG_TYPE_SCHED_CLS look like it's own type, thus
      one day if there's a good reason to diverge the set of helper functions
      from the set available to socket filters, we keep ABI compatibility.
      
      In future, we could place all bpf_prog_type_list at a central place,
      perhaps.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96be4325
    • D
      ebpf: remove CONFIG_BPF_SYSCALL ifdefs in socket filter code · d4052c4a
      Daniel Borkmann 提交于
      This gets rid of CONFIG_BPF_SYSCALL ifdefs in the socket filter code,
      now that the BPF internal header can deal with it.
      
      While going over it, I also changed eBPF related functions to a sk_filter
      prefix to be more consistent with the rest of the file.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4052c4a
    • D
      ebpf: constify various function pointer structs · a2c83fff
      Daniel Borkmann 提交于
      We can move bpf_map_ops and bpf_verifier_ops and other structs into ro
      section, bpf_map_type_list and bpf_prog_type_list into read mostly.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2c83fff
  3. 01 3月, 2015 4 次提交
    • E
      tcp: cleanup static functions · 74abc20c
      Eric Dumazet 提交于
      tcp_fastopen_create_child() is static and should not be exported.
      
      tcp4_gso_segment() and tcp6_gso_segment() should be static.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74abc20c
    • E
      tcp: tso: allow CA_CWR state in tcp_tso_should_defer() · a0ea700e
      Eric Dumazet 提交于
      Another TCP issue is triggered by ECN.
      
      Under pressure, receiver gets ECN marks, and send back ACK packets
      with ECE TCP flag. Senders enter CA_CWR state.
      
      In this state, tcp_tso_should_defer() is short cut :
      
      if (icsk->icsk_ca_state != TCP_CA_Open)
          goto send_now;
      
      This means that about all ACK packets we receive are triggering
      a partial send, and because cwnd is kept small, we can only send
      a small amount of data for each incoming ACK,
      which in return generate more ACK packets.
      
      Allowing CA_Open and CA_CWR states to enable TSO defer in
      tcp_tso_should_defer() brings performance back :
      TSO autodefer has more chance to defer under pressure.
      
      This patch increases TSO and LRO/GRO efficiency back to normal levels,
      and does not impact overall ECN behavior.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0ea700e
    • E
      tcp: tso: restore IW10 after TSO autosizing · 50c8339e
      Eric Dumazet 提交于
      With sysctl_tcp_min_tso_segs being 4, it is very possible
      that tcp_tso_should_defer() decides not sending last 2 MSS
      of initial window of 10 packets. This also applies if
      autosizing decides to send X MSS per GSO packet, and cwnd
      is not a multiple of X.
      
      This patch implements an heuristic based on age of first
      skb in write queue : If it was sent very recently (less than half srtt),
      we can predict that no ACK packet will come in less than half rtt,
      so deferring might cause an under utilization of our window.
      
      This is visible on initial send (IW10) on web servers,
      but more generally on some RPC, as the last part of the message
      might need an extra RTT to get delivered.
      
      Tested:
      
      Ran following packetdrill test
      // A simple server-side test that sends exactly an initial window (IW10)
      // worth of packets.
      
      `sysctl -e -q net.ipv4.tcp_min_tso_segs=4`
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +.1   < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.1   < . 1:1(0) ack 1 win 257
      +0    accept(3, ..., ...) = 4
      
      +0    write(4, ..., 14600) = 14600
      +0    > . 1:5841(5840) ack 1 win 457
      +0    > . 5841:11681(5840) ack 1 win 457
      // Following packet should be sent right now.
      +0    > P. 11681:14601(2920) ack 1 win 457
      
      +.1   < . 1:1(0) ack 14601 win 257
      
      +0    close(4) = 0
      +0    > F. 14601:14601(0) ack 1
      +.1   < F. 1:1(0) ack 14602 win 257
      +0    > . 14602:14602(0) ack 2
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      50c8339e
    • E
      tcp: tso: remove tp->tso_deferred · 5f852eb5
      Eric Dumazet 提交于
      TSO relies on ability to defer sending a small amount of packets.
      Heuristic is to wait for future ACKS in hope to send more packets at once.
      Current algorithm uses a per socket tso_deferred field as a pseudo timer.
      
      This pseudo timer relies on future ACK, but there is no guarantee
      we receive them in time.
      
      Fix would be to use a real timer, but cost of such timer is probably too
      expensive for typical cases.
      
      This patch changes the logic to test the time of last transmit,
      because we should not add bursts of more than 1ms for any given flow.
      
      We've used this patch for about two years at Google, before FQ/pacing
      as it would reduce a fair amount of bursts.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f852eb5
  4. 28 2月, 2015 12 次提交
  5. 27 2月, 2015 1 次提交
    • R
      bridge: fix link notification skb size calculation to include vlan ranges · fed0a159
      Roopa Prabhu 提交于
      my previous patch skipped vlan range optimizations during skb size
      calculations for simplicity.
      
      This incremental patch considers vlan ranges during
      skb size calculations. This leads to a bit of code duplication
      in the fill and size calculation functions. But, I could not find a
      prettier way to do this. will take any suggestions.
      
      Previously, I had reused the existing br_get_link_af_size size calculation
      function to calculate skb size for notifications. Reusing it this time
      around creates some change in behaviour issues for the usual
      .get_link_af_size callback.
      
      This patch adds a new br_get_link_af_size_filtered() function to
      base the size calculation on the incoming filter flag and include
      vlan ranges.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Reviewed-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fed0a159
  6. 26 2月, 2015 3 次提交
    • G
      net: dsa: Introduce dsa_is_port_initialized · d79d2107
      Guenter Roeck 提交于
      To avoid race conditions when using the ds->ports[] array,
      we need to check if the accessed port has been initialized.
      Introduce and use helper function dsa_is_port_initialized
      for that purpose and use it where needed.
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d79d2107
    • F
      net: dsa: integrate with SWITCHDEV for HW bridging · b73adef6
      Florian Fainelli 提交于
      In order to support bridging offloads in DSA switch drivers, select
      NET_SWITCHDEV to get access to the port_stp_update and parent_get_id
      NDOs that we are required to implement.
      
      To facilitate the integratation at the DSA driver level, we implement 3
      types of operations:
      
      - port_join_bridge
      - port_leave_bridge
      - port_stp_update
      
      DSA will resolve which switch ports that are currently bridge port
      members as some Switch hardware/drivers need to know about that to limit
      the register programming to just the relevant registers (especially for
      slow MDIO buses).
      
      We also take care of setting the correct STP state when slave network
      devices are brought up/down while being bridge members.
      
      Finally, when a port is leaving the bridge, we make sure we set in
      BR_STATE_FORWARDING state, otherwise the bridge layer would leave it
      disabled as a result of having left the bridge.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b73adef6
    • G
      net: dsa: Ensure that port array elements are initialized before being used · d87d6f44
      Guenter Roeck 提交于
      A network device notifier can be called for one or more of the created
      slave devices before all slave devices have been registered. This can
      result in a mismatch between ds->phys_port_mask and the registered devices
      by the time the call is made, and it can result in a slave device being
      added to a bridge before its entry in ds->ports[] has been initialized.
      
      Rework the initialization code to initialize entries in ds->ports[] in
      dsa_slave_create. With this change, dsa_slave_create no longer needs
      to return slave_dev but can return an error code instead.
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d87d6f44
  7. 25 2月, 2015 1 次提交
  8. 23 2月, 2015 4 次提交
  9. 22 2月, 2015 1 次提交