1. 29 9月, 2014 5 次提交
    • E
      tcp: change tcp_skb_pcount() location · cd7d8498
      Eric Dumazet 提交于
      Our goal is to access no more than one cache line access per skb in
      a write or receive queue when doing the various walks.
      
      After recent TCP_SKB_CB() reorganizations, it is almost done.
      
      Last part is tcp_skb_pcount() which currently uses
      skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs
      3 cache lines in current kernel (skb->head, skb->end, and
      shinfo->gso_segs are all in 3 different cache lines, far from skb->cb)
      
      This very simple patch reuses space currently taken by tcp_tw_isn
      only in input path, as tcp_skb_pcount is only needed for skb stored in
      write queue.
      
      This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags
      to get SKBTX_ACK_TSTAMP, which seems possible.
      
      This also speeds up all sack processing in general.
      
      This speeds up tcp_sendmsg() because it no longer has to access/dirty
      shinfo.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd7d8498
    • D
      Merge branch 'tcp_skb_cb' · dc83d4d8
      David S. Miller 提交于
      Eric Dumazet says:
      
      ====================
      tcp: better TCP_SKB_CB layout
      
      TCP had the assumption that IPCB and IP6CB are first members of skb->cb[]
      
      This is fine, except that IPCB/IP6CB are used in TCP for a very short time
      in input path.
      
      What really matters for TCP stack is to get skb->next,
      TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq in the same cache line.
      
      skb that are immediately consumed do not care because whole skb->cb[] is
      hot in cpu cache, while skb that sit in wocket write queue or receive queues
      do not need TCP_SKB_CB(skb)->header at all.
      
      This patch set implements the prereq for IPv4, IPv6, and TCP to make this
      possible. This makes TCP more efficient.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc83d4d8
    • E
      tcp: better TCP_SKB_CB layout to reduce cache line misses · 971f10ec
      Eric Dumazet 提交于
      TCP maintains lists of skb in write queue, and in receive queues
      (in order and out of order queues)
      
      Scanning these lists both in input and output path usually requires
      access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq
      
      These fields are currently in two different cache lines, meaning we
      waste lot of memory bandwidth when these queues are big and flows
      have either packet drops or packet reorders.
      
      We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because
      this header is not used in fast path. This allows TCP to search much faster
      in the skb lists.
      
      Even with regular flows, we save one cache line miss in fast path.
      
      Thanks to Christoph Paasch for noticing we need to cleanup
      skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path,
      and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options().
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      971f10ec
    • E
      ipv6: add a struct inet6_skb_parm param to ipv6_opt_accepted() · a224772d
      Eric Dumazet 提交于
      ipv6_opt_accepted() assumes IP6CB(skb) holds the struct inet6_skb_parm
      that it needs. Lets not assume this, as TCP stack might use a different
      place.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a224772d
    • E
      ipv4: rename ip_options_echo to __ip_options_echo() · 24a2d43d
      Eric Dumazet 提交于
      ip_options_echo() assumes struct ip_options is provided in &IPCB(skb)->opt
      Lets break this assumption, but provide a helper to not change all call points.
      
      ip_send_unicast_reply() gets a new struct ip_options pointer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24a2d43d
  2. 27 9月, 2014 29 次提交
    • E
      net : optimize skb_release_data() · ff04a771
      Eric Dumazet 提交于
      Cache skb_shinfo(skb) in a variable to avoid computing it multiple
      times.
      
      Reorganize the tests to remove one indentation level.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff04a771
    • A
      sparc: bpf_jit: add support for BPF_LD(X) | BPF_LEN instructions · cec08315
      Alexei Starovoitov 提交于
      BPF_LD | BPF_W | BPF_LEN instruction is occasionally used by tcpdump
      and present in 11 tests in lib/test_bpf.c
      Teach sparc JIT compiler to emit it.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cec08315
    • T
      net: bcmgenet: Fix compile warning · 0a29b3da
      Tobias Klauser 提交于
      bcmgenet_wol_resume() is only used in bcmgenet_resume(), which is only
      defined when CONFIG_PM_SLEEP is enabled. This leads to the following
      compile warning when building with !CONFIG_PM_SLEEP:
      
      drivers/net/ethernet/broadcom/genet/bcmgenet.c:1967:12: warning: ‘bcmgenet_wol_resume’ defined but not used [-Wunused-function]
      
      Since bcmgenet_resume() is the only user of bcmgenet_wol_resume(), fix
      this by directly inlining the function there.
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a29b3da
    • W
      net/openvswitch: remove dup comment in vport.h · 8280bf00
      Wang Sheng-Hui 提交于
      Remove the duplicated comment
      "/* The following definitions are for users of the vport subsytem: */"
      in vport.h
      Signed-off-by: NWang Sheng-Hui <shhuiw@gmail.com>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8280bf00
    • D
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next · b1840060
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2014-09-23
      
      This patch series adds support for the FM10000 Ethernet switch host
      interface.  The Intel FM10000 Ethernet Switch is a 48-port Ethernet switch
      supporting both Ethernet ports and PCI Express host interfaces.  The fm10k
      driver provides support for the host interface portion of the switch, both
      PF and VF.
      
      As the host interfaces are directly connected to the switch this results in
      some significant differences versus a standard network driver.  For example
      there is no PHY or MII on the device.  Since packets are delivered directly
      from the switch to the host interface these are unnecessary.  Otherwise most
      of the functionality is very similar to our other network drivers such as
      ixgbe or igb.  For example we support all the standard network offloads,
      jumbo frames, SR-IOV (64 VFS), PTP, and some VXLAN and NVGRE offloads.
      
      v2: converted dev_consume_skb_any() to dev_kfree_skb_any()
          fix up PTP code based on feedback from the community
      v3: converted the use of smb_mb__before_clear_bit() to smb_mb__before_atomic()
          added vmalloc header to patch 15
          added prefetch header to patch 16
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1840060
    • L
      net: optimise inet_proto_csum_replace4() · 58e3cac5
      LEROY Christophe 提交于
      csum_partial() is a generic function which is not optimised for small fixed
      length calculations, and its use requires to store "from" and "to" values in
      memory while we already have them available in registers. This also has impact,
      especially on RISC processors. In the same spirit as the change done by
      Eric Dumazet on csum_replace2(), this patch rewrites inet_proto_csum_replace4()
      taking into account RFC1624.
      
      I spotted during a NATted tcp transfert that csum_partial() is one of top 5
      consuming functions (around 8%), and the second user of csum_partial() is
      inet_proto_csum_replace4().
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58e3cac5
    • L
      net: optimise csum_replace4() · 4565af0d
      LEROY Christophe 提交于
      csum_partial() is a generic function which is not optimised for small fixed
      length calculations, and its use requires to store "from" and "to" values in
      memory while we already have them available in registers. This also has impact,
      especially on RISC processors. In the same spirit as the change done by
      Eric Dumazet on csum_replace2(), this patch rewrites inet_proto_csum_replace4()
      taking into account RFC1624.
      
      I spotted during a NATted tcp transfert that csum_partial() is one of top 5
      consuming functions (around 8%), and the second user of csum_partial() is
      inet_proto_csum_replace4().
      
      I have proposed the same modification to inet_proto_csum_replace4() in another
      patch.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4565af0d
    • D
      Merge branch 'fec' · 3290d655
      David S. Miller 提交于
      Fugang Duan says:
      
      ====================
      net: fec: Code cleanup
      
      This patches does several things:
        - Fixing multiqueue issue.
        - Removing the unnecessary errata workaround.
        - Aligning the data buffer dma map/unmap size.
        - Freeing resource after probe failed.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3290d655
    • N
      net: fec: free resource after phy probe failed · e3c9614f
      Nimrod Andy 提交于
      Free memory and disable all related clocks when there has no phy
      connection or phy probe failed.
      Signed-off-by: NFugang Duan <B38611@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3c9614f
    • N
      net: fec: align rx data buffer size for dma map/unmap · b64bf4b7
      Nimrod Andy 提交于
      Align allocated rx data buffer size for dma map/unmap, otherwise
      kernel print warning when enable DMA_API_DEBUG.
      Signed-off-by: NFugang Duan <B38611@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b64bf4b7
    • N
      net: fec: remove the ERR006358 workaround for imx6sx enet · f88c7ede
      Nimrod Andy 提交于
      Remove the ERR006358 workaround for imx6sx enet since the hw issue
      was fixed on the SOC.
      Signed-off-by: NFugang Duan <B38611@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f88c7ede
    • N
      net: fec: Add Ftype to BD to distiguish three tx queues for AVB · befe8213
      Nimrod Andy 提交于
      The current driver loss Ftype field init for BD, which cause tx
      queue #1 and #2 cannot work well.
      
      Add Ftype field to BD to distiguish three queues for AVB:
      0 -> Best Effort
      1 -> ClassA
      2 -> ClassB
      Signed-off-by: NFugang Duan <B38611@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      befe8213
    • E
      net: introduce __skb_header_release() · f4a775d1
      Eric Dumazet 提交于
      While profiling TCP stack, I noticed one useless atomic operation
      in tcp_sendmsg(), caused by skb_header_release().
      
      It turns out all current skb_header_release() users have a fresh skb,
      that no other user can see, so we can avoid one atomic operation.
      
      Introduce __skb_header_release() to clearly document this.
      
      This gave me a 1.5 % improvement on TCP_RR workload.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4a775d1
    • F
      fec: Remove fec_enet_select_queue() · aebac744
      Fabio Estevam 提交于
      Sparse complains about fec_enet_select_queue() not being static.
      
      Feedback from David Miller [1] was to remove this function instead of making it
      static:
      
      "Please just delete this function.
      
      It's overriding code which does exactly the same thing.
      
      Actually, more precisely, this code is duplicating code in a way that
      bypasses many core facilitites of the networking.  For example, this
      override means that socket based flow steering, XPS, etc. are all
      not happening on these devices.
      
      Without ->ndo_select_queue(), the flow dissector does __netdev_pick_tx
      which is exactly what you want to happen."
      
      [1] http://www.spinics.net/lists/netdev/msg297653.htmlSigned-off-by: NFabio Estevam <fabio.estevam@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aebac744
    • D
      Merge tag 'master-2014-09-16' of... · 57219dc7
      David S. Miller 提交于
      Merge tag 'master-2014-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
      
      John W. Linville says:
      
      ====================
      pull request: wireless-next 2014-09-22
      
      Please pull this batch of updates intended for the 3.18 stream...
      
      For the mac80211 bits, Johannes says:
      
      "This time, I have some rate minstrel improvements, support for a very
      small feature from CCX that Steinar reverse-engineered, dynamic ACK
      timeout support, a number of changes for TDLS, early support for radio
      resource measurement and many fixes. Also, I'm changing a number of
      places to clear key memory when it's freed and Intel claims copyright
      for code they developed."
      
      For the bluetooth bits, Johan says:
      
      "Here are some more patches intended for 3.18. Most of them are cleanups
      or fixes for SMP. The only exception is a fix for BR/EDR L2CAP fixed
      channels which should now work better together with the L2CAP
      information request procedure."
      
      For the iwlwifi bits, Emmanuel says:
      
      "I fix here dvm which was broken by my last pull request. Arik
      continues to work on TDLS and Luca solved a few issues in CT-Kill. Eyal
      keeps digging into rate scaling code, more to come soon. Besides this,
      nothing really special here."
      
      Beyond that, there are the usual big batches of updates to ath9k, b43,
      mwifiex, and wil6210 as well as a handful of other bits here and there.
      Also, rtlwifi gets some btcoexist attention from Larry.
      
      Please let me know if there are problems!
      ====================
      
      Had to adjust the wil6210 code to comply with Joe Perches's recent
      change in net-next to make the netdev_*() routines return void instead
      of 'int'.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57219dc7
    • J
      net: Change netdev_<level> logging functions to return void · 6ea754eb
      Joe Perches 提交于
      No caller or macro uses the return value so make all
      the functions return void.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ea754eb
    • J
      mellanox: Change en_print to return void · 0c87b29c
      Joe Perches 提交于
      No caller or macro uses the return value so make it void.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-By: NAmir Vadai <amirv@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c87b29c
    • D
      Merge branch 'bpf-next' · b4fc1a46
      David S. Miller 提交于
      Alexei Starovoitov says:
      
      ====================
      eBPF syscall, verifier, testsuite
      
      v14 -> v15:
      - got rid of macros with hidden control flow (suggested by David)
        replaced macro with explicit goto or return and simplified
        where possible (affected patches #9 and #10)
      - rebased, retested
      
      v13 -> v14:
      - small change to 1st patch to ease 'new userspace with old kernel'
        problem (done similar to perf_copy_attr()) (suggested by Daniel)
      - the rest unchanged
      
      v12 -> v13:
      - replaced 'foo __user *' pointers with __aligned_u64 (suggested by David)
      - added __attribute__((aligned(8)) to 'union bpf_attr' to keep
        constant alignment between patches
      - updated manpage and syscall wrappers due to __aligned_u64
      - rebased, retested on x64 with 32-bit and 64-bit userspace and on i386,
        build tested on arm32,sparc64
      
      v11 -> v12:
      - dropped patch 11 and copied few macros to libbpf.h (suggested by Daniel)
      - replaced 'enum bpf_prog_type' with u32 to be safe in compat (.. Andy)
      - implemented and tested compat support (not part of this set) (.. Daniel)
      - changed 'void *log_buf' to 'char *' (.. Daniel)
      - combined struct bpf_work_struct and bpf_prog_info (.. Daniel)
      - added better return value explanation to manpage (.. Andy)
      - added log_buf/log_size explanation to manpage (.. Andy & Daniel)
      - added a lot more info about prog_type and map_type to manpage (.. Andy)
      - rebased, tweaked test_stubs
      
      Patches 1-4 establish BPF syscall shell for maps and programs.
      Patches 5-10 add verifier step by step
      Patch 11 adds test stubs for 'unspec' program type and verifier testsuite
        from user space
      
      Note that patches 1,3,4,7 add commands and attributes to the syscall
      while being backwards compatible from each other, which should demonstrate
      how other commands can be added in the future.
      
      After this set the programs can be loaded for testing only. They cannot
      be attached to any events. Though manpage talks about tracing and sockets,
      it will be a subject of future patches.
      
      Please take a look at manpage:
      
      BPF(2)                     Linux Programmer's Manual                    BPF(2)
      
      NAME
             bpf - perform a command on eBPF map or program
      
      SYNOPSIS
             #include <linux/bpf.h>
      
             int bpf(int cmd, union bpf_attr *attr, unsigned int size);
      
      DESCRIPTION
             bpf()  syscall  is a multiplexor for a range of different operations on
             eBPF  which  can  be  characterized  as  "universal  in-kernel  virtual
             machine".  eBPF  is  similar  to  original  Berkeley  Packet Filter (or
             "classic BPF") used to filter network packets. Both statically  analyze
             the  programs  before  loading  them  into  the  kernel  to ensure that
             programs cannot harm the running system.
      
             eBPF extends classic BPF in multiple ways including ability to call in-
             kernel  helper  functions  and  access shared data structures like eBPF
             maps.  The programs can be written in a restricted C that  is  compiled
             into  eBPF  bytecode  and executed on the eBPF virtual machine or JITed
             into native instruction set.
      
         eBPF Design/Architecture
             eBPF maps is a generic storage of different types.   User  process  can
             create  multiple  maps  (with key/value being opaque bytes of data) and
             access them via file descriptor. In parallel eBPF programs  can  access
             maps  from inside the kernel.  It's up to user process and eBPF program
             to decide what they store inside maps.
      
             eBPF programs are similar to kernel modules. They  are  loaded  by  the
             user  process  and automatically unloaded when process exits. Each eBPF
             program is a safe run-to-completion set of instructions. eBPF  verifier
             statically  determines  that  the  program  terminates  and  is safe to
             execute. During verification the program takes a hold of maps  that  it
             intends to use, so selected maps cannot be removed until the program is
             unloaded. The program can be attached to different events. These events
             can  be packets, tracepoint events and other types in the future. A new
             event triggers execution of the program  which  may  store  information
             about the event in the maps.  Beyond storing data the programs may call
             into in-kernel helper functions which may, for example, dump stack,  do
             trace_printk  or other forms of live kernel debugging. The same program
             can be attached to multiple events. Different programs can  access  the
             same map:
               tracepoint  tracepoint  tracepoint    sk_buff    sk_buff
                event A     event B     event C      on eth0    on eth1
                 |             |          |            |          |
                 |             |          |            |          |
                 --> tracing <--      tracing       socket      socket
                      prog_1           prog_2       prog_3      prog_4
                      |  |               |            |
                   |---  -----|  |-------|           map_3
                 map_1       map_2
      
         Syscall Arguments
             bpf()  syscall  operation  is determined by cmd which can be one of the
             following:
      
             BPF_MAP_CREATE
                    Create a map with given type and attributes and return map FD
      
             BPF_MAP_LOOKUP_ELEM
                    Lookup element by key in a given map and return its value
      
             BPF_MAP_UPDATE_ELEM
                    Create or update element (key/value pair) in a given map
      
             BPF_MAP_DELETE_ELEM
                    Lookup and delete element by key in a given map
      
             BPF_MAP_GET_NEXT_KEY
                    Lookup element by key in a given map  and  return  key  of  next
                    element
      
             BPF_PROG_LOAD
                    Verify and load eBPF program
      
             attr   is a pointer to a union of type bpf_attr as defined below.
      
             size   is the size of the union.
      
             union bpf_attr {
                 struct { /* anonymous struct used by BPF_MAP_CREATE command */
                     __u32             map_type;
                     __u32             key_size;    /* size of key in bytes */
                     __u32             value_size;  /* size of value in bytes */
                     __u32             max_entries; /* max number of entries in a map */
                 };
      
                 struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
                     __u32             map_fd;
                     __aligned_u64     key;
                     union {
                         __aligned_u64 value;
                         __aligned_u64 next_key;
                     };
                 };
      
                 struct { /* anonymous struct used by BPF_PROG_LOAD command */
                     __u32         prog_type;
                     __u32         insn_cnt;
                     __aligned_u64 insns;     /* 'const struct bpf_insn *' */
                     __aligned_u64 license;   /* 'const char *' */
                     __u32         log_level; /* verbosity level of eBPF verifier */
                     __u32         log_size;  /* size of user buffer */
                     __aligned_u64 log_buf;   /* user supplied 'char *' buffer */
                 };
             } __attribute__((aligned(8)));
      
         eBPF maps
             maps  is  a generic storage of different types for sharing data between
             kernel and userspace.
      
             Any map type has the following attributes:
               . type
               . max number of elements
               . key size in bytes
               . value size in bytes
      
             The following wrapper functions demonstrate how  this  syscall  can  be
             used  to  access the maps. The functions use the cmd argument to invoke
             different operations.
      
             BPF_MAP_CREATE
                    int bpf_create_map(enum bpf_map_type map_type, int key_size,
                                       int value_size, int max_entries)
                    {
                        union bpf_attr attr = {
                            .map_type = map_type,
                            .key_size = key_size,
                            .value_size = value_size,
                            .max_entries = max_entries
                        };
      
                        return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
                    }
                    bpf()  syscall  creates  a  map  of  map_type  type  and   given
                    attributes  key_size,  value_size,  max_entries.   On success it
                    returns process-local file descriptor. On error, -1 is  returned
                    and errno is set to EINVAL or EPERM or ENOMEM.
      
                    The  attributes key_size and value_size will be used by verifier
                    during  program  loading  to  check  that  program  is   calling
                    bpf_map_*_elem() helper functions with correctly initialized key
                    and  that  program  doesn't  access  map  element  value  beyond
                    specified  value_size.   For  example,  when map is created with
                    key_size = 8 and program does:
                    bpf_map_lookup_elem(map_fd, fp - 4)
                    such program will be rejected, since in-kernel  helper  function
                    bpf_map_lookup_elem(map_fd,  void  *key) expects to read 8 bytes
                    from 'key' pointer, but 'fp - 4' starting address will cause out
                    of bounds stack access.
      
                    Similarly,  when  map is created with value_size = 1 and program
                    does:
                    value = bpf_map_lookup_elem(...);
                    *(u32 *)value = 1;
                    such program will be rejected, since it accesses  value  pointer
                    beyond specified 1 byte value_size limit.
      
                    Currently only hash table map_type is supported:
                    enum bpf_map_type {
                       BPF_MAP_TYPE_UNSPEC,
                       BPF_MAP_TYPE_HASH,
                    };
                    map_type  selects  one  of  the available map implementations in
                    kernel. For all map_types eBPF programs  access  maps  with  the
                    same      bpf_map_lookup_elem()/bpf_map_update_elem()     helper
                    functions.
      
             BPF_MAP_LOOKUP_ELEM
                    int bpf_lookup_elem(int fd, void *key, void *value)
                    {
                        union bpf_attr attr = {
                            .map_fd = fd,
                            .key = ptr_to_u64(key),
                            .value = ptr_to_u64(value),
                        };
      
                        return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
                    }
                    bpf() syscall looks up an element with given key in  a  map  fd.
                    If  element  is found it returns zero and stores element's value
                    into value.  If element is not found  it  returns  -1  and  sets
                    errno to ENOENT.
      
             BPF_MAP_UPDATE_ELEM
                    int bpf_update_elem(int fd, void *key, void *value)
                    {
                        union bpf_attr attr = {
                            .map_fd = fd,
                            .key = ptr_to_u64(key),
                            .value = ptr_to_u64(value),
                        };
      
                        return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
                    }
                    The  call  creates  or updates element with given key/value in a
                    map fd.  On success it returns zero.  On error, -1  is  returned
                    and  errno  is set to EINVAL or EPERM or ENOMEM or E2BIG.  E2BIG
                    indicates that number of elements in the map reached max_entries
                    limit specified at map creation time.
      
             BPF_MAP_DELETE_ELEM
                    int bpf_delete_elem(int fd, void *key)
                    {
                        union bpf_attr attr = {
                            .map_fd = fd,
                            .key = ptr_to_u64(key),
                        };
      
                        return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
                    }
                    The call deletes an element in a map fd with given key.  Returns
                    zero on success. If element is not found it returns -1 and  sets
                    errno to ENOENT.
      
             BPF_MAP_GET_NEXT_KEY
                    int bpf_get_next_key(int fd, void *key, void *next_key)
                    {
                        union bpf_attr attr = {
                            .map_fd = fd,
                            .key = ptr_to_u64(key),
                            .next_key = ptr_to_u64(next_key),
                        };
      
                        return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
                    }
                    The  call  looks  up  an  element  by  key in a given map fd and
                    returns key of the next element into next_key pointer. If key is
                    not  found,  it return zero and returns key of the first element
                    into next_key. If key is the last element,  it  returns  -1  and
                    sets  errno  to  ENOENT. Other possible errno values are ENOMEM,
                    EFAULT, EPERM, EINVAL.  This method can be used to iterate  over
                    all elements of the map.
      
             close(map_fd)
                    will  delete  the  map  map_fd.  Exiting process will delete all
                    maps automatically.
      
         eBPF programs
             BPF_PROG_LOAD
                    This cmd is used to load eBPF program into the kernel.
      
                    char bpf_log_buf[LOG_BUF_SIZE];
      
                    int bpf_prog_load(enum bpf_prog_type prog_type,
                                      const struct bpf_insn *insns, int insn_cnt,
                                      const char *license)
                    {
                        union bpf_attr attr = {
                            .prog_type = prog_type,
                            .insns = ptr_to_u64(insns),
                            .insn_cnt = insn_cnt,
                            .license = ptr_to_u64(license),
                            .log_buf = ptr_to_u64(bpf_log_buf),
                            .log_size = LOG_BUF_SIZE,
                            .log_level = 1,
                        };
      
                        return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
                    }
                    prog_type is one of the available program types:
                    enum bpf_prog_type {
                            BPF_PROG_TYPE_UNSPEC,
                            BPF_PROG_TYPE_SOCKET,
                            BPF_PROG_TYPE_TRACING,
                    };
                    By picking prog_type program author  selects  a  set  of  helper
                    functions callable from eBPF program and corresponding format of
                    struct bpf_context (which is  the  data  blob  passed  into  the
                    program  as  the  first  argument).   For  example, the programs
                    loaded with  prog_type  =  TYPE_TRACING  may  call  bpf_printk()
                    helper,  whereas  TYPE_SOCKET  programs  may  not.   The  set of
                    functions  available  to  the  programs  under  given  type  may
                    increase in the future.
      
                    Currently the set of functions for TYPE_TRACING is:
                    bpf_map_lookup_elem(map_fd, void *key)              // lookup key in a map_fd
                    bpf_map_update_elem(map_fd, void *key, void *value) // update key/value
                    bpf_map_delete_elem(map_fd, void *key)              // delete key in a map_fd
                    bpf_ktime_get_ns(void)                              // returns current ktime
                    bpf_printk(char *fmt, int fmt_size, ...)            // prints into trace buffer
                    bpf_memcmp(void *ptr1, void *ptr2, int size)        // non-faulting memcmp
                    bpf_fetch_ptr(void *ptr)    // non-faulting load pointer from any address
                    bpf_fetch_u8(void *ptr)     // non-faulting 1 byte load
                    bpf_fetch_u16(void *ptr)    // other non-faulting loads
                    bpf_fetch_u32(void *ptr)
                    bpf_fetch_u64(void *ptr)
      
                    and bpf_context is defined as:
                    struct bpf_context {
                        /* argN fields match one to one to arguments passed to trace events */
                        u64 arg1, arg2, arg3, arg4, arg5, arg6;
                        /* return value from kretprobe event or from syscall_exit event */
                        u64 ret;
                    };
      
                    The set of helper functions for TYPE_SOCKET is TBD.
      
                    More   program   types   may   be  added  in  the  future.  Like
                    BPF_PROG_TYPE_USER_TRACING for unprivileged programs.
      
                    BPF_PROG_TYPE_UNSPEC is used for  testing  only.  Such  programs
                    cannot be attached to events.
      
                    insns array of "struct bpf_insn" instructions
      
                    insn_cnt number of instructions in the program
      
                    license  license  string,  which  must be GPL compatible to call
                    helper functions marked gpl_only
      
                    log_buf user supplied buffer that in-kernel verifier is using to
                    store  verification  log. Log is a multi-line string that should
                    be used by program author to understand  how  verifier  came  to
                    conclusion  that program is unsafe. The format of the output can
                    change at any time as verifier evolves.
      
                    log_size size of user buffer. If size of the buffer is not large
                    enough  to store all verifier messages, -1 is returned and errno
                    is set to ENOSPC.
      
                    log_level verbosity level of eBPF verifier, where zero means  no
                    logs provided
      
             close(prog_fd)
                    will unload eBPF program
      
             The  maps  are  accesible  from  programs  and  generally  tie  the two
             together.  Programs process various events  (like  tracepoint,  kprobe,
             packets)  and  store  the  data into maps. User space fetches data from
             maps.  Either the same or a different map may be used by user space  as
             configuration space to alter program behavior on the fly.
      
         Events
             Once an eBPF program is loaded, it can be attached to an event. Various
             kernel subsystems have different ways to do so. For example:
      
             setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd));
             will attach the program prog_fd to socket sock which  was  received  by
             prior call to socket().
      
             ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
             will  attach  the  program  prog_fd  to  perf  event event_fd which was
             received by prior call to perf_event_open().
      
             Another way to attach the program to a tracing event is:
             event_fd = open("/sys/kernel/debug/tracing/events/skb/kfree_skb/filter");
             write(event_fd, "bpf-123"); /* where 123 is eBPF program FD */
             /* here program is attached and will be triggered by events */
             close(event_fd); /* to detach from event */
      
      EXAMPLES
             /* eBPF+sockets example:
              * 1. create map with maximum of 2 elements
              * 2. set map[6] = 0 and map[17] = 0
              * 3. load eBPF program that counts number of TCP and UDP packets received
              *    via map[skb->ip->proto]++
              * 4. attach prog_fd to raw socket via setsockopt()
              * 5. print number of received TCP/UDP packets every second
              */
             int main(int ac, char **av)
             {
                 int sock, map_fd, prog_fd, key;
                 long long value = 0, tcp_cnt, udp_cnt;
      
                 map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 2);
                 if (map_fd < 0) {
                     printf("failed to create map '%s'\n", strerror(errno));
                     /* likely not run as root */
                     return 1;
                 }
      
                 key = 6; /* ip->proto == tcp */
                 assert(bpf_update_elem(map_fd, &key, &value) == 0);
      
                 key = 17; /* ip->proto == udp */
                 assert(bpf_update_elem(map_fd, &key, &value) == 0);
      
                 struct bpf_insn prog[] = {
                     BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),          /* r6 = r1 */
                     BPF_LD_ABS(BPF_B, 14 + 9),                    /* r0 = ip->proto */
                     BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),/* *(u32 *)(fp - 4) = r0 */
                     BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),         /* r2 = fp */
                     BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),        /* r2 = r2 - 4 */
                     BPF_LD_MAP_FD(BPF_REG_1, map_fd),             /* r1 = map_fd */
                     BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),      /* r0 = map_lookup(r1, r2) */
                     BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),        /* if (r0 == 0) goto pc+2 */
                     BPF_MOV64_IMM(BPF_REG_1, 1),                  /* r1 = 1 */
                     BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* lock *(u64 *)r0 += r1 */
                     BPF_MOV64_IMM(BPF_REG_0, 0),                  /* r0 = 0 */
                     BPF_EXIT_INSN(),                              /* return r0 */
                 };
                 prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET, prog, sizeof(prog), "GPL");
                 assert(prog_fd >= 0);
      
                 sock = open_raw_sock("lo");
      
                 assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd,
                                   sizeof(prog_fd)) == 0);
      
                 for (;;) {
                     key = 6;
                     assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
                     key = 17;
                     assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
                     printf("TCP %lld UDP %lld packets0, tcp_cnt, udp_cnt);
                     sleep(1);
                 }
      
                 return 0;
             }
      
      RETURN VALUE
             For a successful call, the return value depends on the operation:
      
             BPF_MAP_CREATE
                    The new file descriptor associated with eBPF map.
      
             BPF_PROG_LOAD
                    The new file descriptor associated with eBPF program.
      
             All other commands
                    Zero.
      
             On error, -1 is returned, and errno is set appropriately.
      
      ERRORS
             EPERM  bpf() syscall was made without sufficient privilege (without the
                    CAP_SYS_ADMIN capability).
      
             ENOMEM Cannot allocate sufficient memory.
      
             EBADF  fd is not an open file descriptor
      
             EFAULT One  of  the  pointers  (  key or value or log_buf or insns ) is
                    outside accessible address space.
      
             EINVAL The value specified in cmd is not recognized by this kernel.
      
             EINVAL For BPF_MAP_CREATE, either map_type or attributes are invalid.
      
             EINVAL For BPF_MAP_*_ELEM  commands,  some  of  the  fields  of  "union
                    bpf_attr" unused by this command are not set to zero.
      
             EINVAL For BPF_PROG_LOAD, attempt to load invalid program (unrecognized
                    instruction or uses reserved fields or jumps  out  of  range  or
                    loop detected or calls unknown function).
      
             EACCES For BPF_PROG_LOAD, though program has valid instructions, it was
                    rejected, since it was  deemed  unsafe  (may  access  disallowed
                    memory   region  or  uninitialized  stack/register  or  function
                    constraints don't match actual types or misaligned  access).  In
                    such case it is recommended to call bpf() again with log_level =
                    1 and examine log_buf for specific reason provided by verifier.
      
             ENOENT For BPF_MAP_LOOKUP_ELEM or BPF_MAP_DELETE_ELEM,  indicates  that
                    element with given key was not found.
      
             E2BIG  program  is  too  large  or a map reached max_entries limit (max
                    number of elements).
      
      NOTES
             These commands may be used only by a privileged process (one having the
             CAP_SYS_ADMIN capability).
      
      SEE ALSO
             eBPF    architecture    and    instruction    set   is   explained   in
             Documentation/networking/filter.txt
      
      Linux                             2014-09-16                            BPF(2)
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4fc1a46
    • A
      bpf: mini eBPF library, test stubs and verifier testsuite · 3c731eba
      Alexei Starovoitov 提交于
      1.
      the library includes a trivial set of BPF syscall wrappers:
      int bpf_create_map(int key_size, int value_size, int max_entries);
      int bpf_update_elem(int fd, void *key, void *value);
      int bpf_lookup_elem(int fd, void *key, void *value);
      int bpf_delete_elem(int fd, void *key);
      int bpf_get_next_key(int fd, void *key, void *next_key);
      int bpf_prog_load(enum bpf_prog_type prog_type,
      		  const struct sock_filter_int *insns, int insn_len,
      		  const char *license);
      bpf_prog_load() stores verifier log into global bpf_log_buf[] array
      
      and BPF_*() macros to build instructions
      
      2.
      test stubs configure eBPF infra with 'unspec' map and program types.
      These are fake types used by user space testsuite only.
      
      3.
      verifier tests valid and invalid programs and expects predefined
      error log messages from kernel.
      40 tests so far.
      
      $ sudo ./test_verifier
       #0 add+sub+mul OK
       #1 unreachable OK
       #2 unreachable2 OK
       #3 out of range jump OK
       #4 out of range jump2 OK
       #5 test1 ld_imm64 OK
       ...
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c731eba
    • A
      bpf: verifier (add verifier core) · 17a52670
      Alexei Starovoitov 提交于
      This patch adds verifier core which simulates execution of every insn and
      records the state of registers and program stack. Every branch instruction seen
      during simulation is pushed into state stack. When verifier reaches BPF_EXIT,
      it pops the state from the stack and continues until it reaches BPF_EXIT again.
      For program:
      1: bpf_mov r1, xxx
      2: if (r1 == 0) goto 5
      3: bpf_mov r0, 1
      4: goto 6
      5: bpf_mov r0, 2
      6: bpf_exit
      The verifier will walk insns: 1, 2, 3, 4, 6
      then it will pop the state recorded at insn#2 and will continue: 5, 6
      
      This way it walks all possible paths through the program and checks all
      possible values of registers. While doing so, it checks for:
      - invalid instructions
      - uninitialized register access
      - uninitialized stack access
      - misaligned stack access
      - out of range stack access
      - invalid calling convention
      - instruction encoding is not using reserved fields
      
      Kernel subsystem configures the verifier with two callbacks:
      
      - bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
        that provides information to the verifer which fields of 'ctx'
        are accessible (remember 'ctx' is the first argument to eBPF program)
      
      - const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
        returns argument constraints of kernel helper functions that eBPF program
        may call, so that verifier can checks that R1-R5 types match the prototype
      
      More details in Documentation/networking/filter.txt and in kernel/bpf/verifier.c
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17a52670
    • A
      bpf: verifier (add branch/goto checks) · 475fb78f
      Alexei Starovoitov 提交于
      check that control flow graph of eBPF program is a directed acyclic graph
      
      check_cfg() does:
      - detect loops
      - detect unreachable instructions
      - check that program terminates with BPF_EXIT insn
      - check that all branches are within program boundary
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      475fb78f
    • A
      bpf: handle pseudo BPF_LD_IMM64 insn · 0246e64d
      Alexei Starovoitov 提交于
      eBPF programs passed from userspace are using pseudo BPF_LD_IMM64 instructions
      to refer to process-local map_fd. Scan the program for such instructions and
      if FDs are valid, convert them to 'struct bpf_map' pointers which will be used
      by verifier to check access to maps in bpf_map_lookup/update() calls.
      If program passes verifier, convert pseudo BPF_LD_IMM64 into generic by dropping
      BPF_PSEUDO_MAP_FD flag.
      
      Note that eBPF interpreter is generic and knows nothing about pseudo insns.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0246e64d
    • A
      bpf: verifier (add ability to receive verification log) · cbd35700
      Alexei Starovoitov 提交于
      add optional attributes for BPF_PROG_LOAD syscall:
      union bpf_attr {
          struct {
      	...
      	__u32         log_level; /* verbosity level of eBPF verifier */
      	__u32         log_size;  /* size of user buffer */
      	__aligned_u64 log_buf;   /* user supplied 'char *buffer' */
          };
      };
      
      when log_level > 0 the verifier will return its verification log in the user
      supplied buffer 'log_buf' which can be used by program author to analyze why
      verifier rejected given program.
      
      'Understanding eBPF verifier messages' section of Documentation/networking/filter.txt
      provides several examples of these messages, like the program:
      
        BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
        BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
        BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
        BPF_LD_MAP_FD(BPF_REG_1, 0),
        BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
        BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
        BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
        BPF_EXIT_INSN(),
      
      will be rejected with the following multi-line message in log_buf:
      
        0: (7a) *(u64 *)(r10 -8) = 0
        1: (bf) r2 = r10
        2: (07) r2 += -8
        3: (b7) r1 = 0
        4: (85) call 1
        5: (15) if r0 == 0x0 goto pc+1
         R0=map_ptr R10=fp
        6: (7a) *(u64 *)(r0 +4) = 0
        misaligned access off 4 size 8
      
      The format of the output can change at any time as verifier evolves.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbd35700
    • A
      bpf: verifier (add docs) · 51580e79
      Alexei Starovoitov 提交于
      this patch adds all of eBPF verfier documentation and empty bpf_check()
      
      The end goal for the verifier is to statically check safety of the program.
      
      Verifier will catch:
      - loops
      - out of range jumps
      - unreachable instructions
      - invalid instructions
      - uninitialized register access
      - uninitialized stack access
      - misaligned stack access
      - out of range stack access
      - invalid calling convention
      
      More details in Documentation/networking/filter.txt
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51580e79
    • A
      bpf: handle pseudo BPF_CALL insn · 0a542a86
      Alexei Starovoitov 提交于
      in native eBPF programs userspace is using pseudo BPF_CALL instructions
      which encode one of 'enum bpf_func_id' inside insn->imm field.
      Verifier checks that program using correct function arguments to given func_id.
      If all checks passed, kernel needs to fixup BPF_CALL->imm fields by
      replacing func_id with in-kernel function pointer.
      eBPF interpreter just calls the function.
      
      In-kernel eBPF users continue to use generic BPF_CALL.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a542a86
    • A
      bpf: expand BPF syscall with program load/unload · 09756af4
      Alexei Starovoitov 提交于
      eBPF programs are similar to kernel modules. They are loaded by the user
      process and automatically unloaded when process exits. Each eBPF program is
      a safe run-to-completion set of instructions. eBPF verifier statically
      determines that the program terminates and is safe to execute.
      
      The following syscall wrapper can be used to load the program:
      int bpf_prog_load(enum bpf_prog_type prog_type,
                        const struct bpf_insn *insns, int insn_cnt,
                        const char *license)
      {
          union bpf_attr attr = {
              .prog_type = prog_type,
              .insns = ptr_to_u64(insns),
              .insn_cnt = insn_cnt,
              .license = ptr_to_u64(license),
          };
      
          return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
      }
      where 'insns' is an array of eBPF instructions and 'license' is a string
      that must be GPL compatible to call helper functions marked gpl_only
      
      Upon succesful load the syscall returns prog_fd.
      Use close(prog_fd) to unload the program.
      
      User space tests and examples follow in the later patches
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09756af4
    • A
      bpf: add lookup/update/delete/iterate methods to BPF maps · db20fd2b
      Alexei Starovoitov 提交于
      'maps' is a generic storage of different types for sharing data between kernel
      and userspace.
      
      The maps are accessed from user space via BPF syscall, which has commands:
      
      - create a map with given type and attributes
        fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
        returns fd or negative error
      
      - lookup key in a given map referenced by fd
        err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->value
        returns zero and stores found elem into value or negative error
      
      - create or update key/value pair in a given map
        err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->value
        returns zero or negative error
      
      - find and delete element by key in a given map
        err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key
      
      - iterate map elements (based on input key return next_key)
        err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->next_key
      
      - close(fd) deletes the map
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db20fd2b
    • A
      bpf: enable bpf syscall on x64 and i386 · 749730ce
      Alexei Starovoitov 提交于
      done as separate commit to ease conflict resolution
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      749730ce
    • A
      bpf: introduce BPF syscall and maps · 99c55f7d
      Alexei Starovoitov 提交于
      BPF syscall is a multiplexor for a range of different operations on eBPF.
      This patch introduces syscall with single command to create a map.
      Next patch adds commands to access maps.
      
      'maps' is a generic storage of different types for sharing data between kernel
      and userspace.
      
      Userspace example:
      /* this syscall wrapper creates a map with given type and attributes
       * and returns map_fd on success.
       * use close(map_fd) to delete the map
       */
      int bpf_create_map(enum bpf_map_type map_type, int key_size,
                         int value_size, int max_entries)
      {
          union bpf_attr attr = {
              .map_type = map_type,
              .key_size = key_size,
              .value_size = value_size,
              .max_entries = max_entries
          };
      
          return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
      }
      
      'union bpf_attr' is backwards compatible with future extensions.
      
      More details in Documentation/networking/filter.txt and in manpage
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99c55f7d
  3. 26 9月, 2014 6 次提交
    • E
      net: sched: use pinned timers · 4a8e320c
      Eric Dumazet 提交于
      While using a MQ + NETEM setup, I had confirmation that the default
      timer migration ( /proc/sys/kernel/timer_migration ) is killing us.
      
      Installing this on a receiver side of a TCP_STREAM test, (NIC has 8 TX
      queues) :
      
      EST="est 1sec 4sec"
      for ETH in eth1
      do
       tc qd del dev $ETH root 2>/dev/null
       tc qd add dev $ETH root handle 1: mq
       tc qd add dev $ETH parent 1:1 $EST netem limit 70000 delay 6ms
       tc qd add dev $ETH parent 1:2 $EST netem limit 70000 delay 8ms
       tc qd add dev $ETH parent 1:3 $EST netem limit 70000 delay 10ms
       tc qd add dev $ETH parent 1:4 $EST netem limit 70000 delay 12ms
       tc qd add dev $ETH parent 1:5 $EST netem limit 70000 delay 14ms
       tc qd add dev $ETH parent 1:6 $EST netem limit 70000 delay 16ms
       tc qd add dev $ETH parent 1:7 $EST netem limit 80000 delay 18ms
       tc qd add dev $ETH parent 1:8 $EST netem limit 90000 delay 20ms
      done
      
      We can see that timers get migrated into a single cpu, presumably idle
      at the time timers are set up.
      Then all qdisc dequeues run from this cpu and huge lock contention
      happens. This single cpu is stuck in softirq mode and cannot dequeue
      fast enough.
      
          39.24%  [kernel]          [k] _raw_spin_lock
           2.65%  [kernel]          [k] netem_enqueue
           1.80%  [kernel]          [k] netem_dequeue
           1.63%  [kernel]          [k] copy_user_enhanced_fast_string
           1.45%  [kernel]          [k] _raw_spin_lock_bh
      
      By pinning qdisc timers on the cpu running the qdisc, we respect proper
      XPS setting and remove this lock contention.
      
           5.84%  [kernel]          [k] netem_enqueue
           4.83%  [kernel]          [k] _raw_spin_lock
           2.92%  [kernel]          [k] copy_user_enhanced_fast_string
      
      Current Qdiscs that benefit from this change are :
      
      	netem, cbq, fq, hfsc, tbf, htb.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a8e320c
    • D
      Merge branch 'gso_send_check' · 9fb426a6
      David S. Miller 提交于
      Tom Herbert says:
      
      ====================
      net: Eliminate gso_send_check
      
      gso_send_check presents a lot of complexity for what it is being used
      for. It seems that there are only two cases where it might be effective:
      TCP and UFO paths. In these cases, the gso_send_check function
      initializes the TCP or UDP checksum respectively to the pseudo header
      checksum so that the checksum computation is appropriately offloaded or
      computed in the gso_segment functions. The gso_send_check functions
      are only called from dev.c in skb_mac_gso_segment when ip_summed !=
      CHECKSUM_PARTIAL (which seems very unlikely in TCP case). We can move
      the logic of this into the respective gso_segment functions where the
      checksum is initialized if ip_summed != CHECKSUM_PARTIAL.
      
      With the above cases handled, gso_send_check is no longer needed, so
      we can remove all uses of it and the fields in the offload callbacks.
      With this change, ip_summed in the skb should be preserved though all
      the layers of gso_segment calls.
      
      In follow-on patches, we may be able to remove the check setup code in
      tcp_gso_segment if we can guarantee that ip_summed will always be
      CHECKSUM_PARTIAL (verify all paths and probably add an assert in
      tcp_gro_segment).
      
      Tested these patches by:
        - netperf TCP_STREAM test with GSO enabled
        - Forced ip_summed != CHECKSUM_PARTIAL with above
        - Ran UDP_RR with 10000 request size over GRE tunnel. This exercised
          UFO path.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fb426a6
    • T
      net: Remove gso_send_check as an offload callback · 53e50398
      Tom Herbert 提交于
      The send_check logic was only interesting in cases of TCP offload and
      UDP UFO where the checksum needed to be initialized to the pseudo
      header checksum. Now we've moved that logic into the related
      gso_segment functions so gso_send_check is no longer needed.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53e50398
    • T
      udp: move logic out of udp[46]_ufo_send_check · f71470b3
      Tom Herbert 提交于
      In udp[46]_ufo_send_check the UDP checksum initialized to the pseudo
      header checksum. We can move this logic into udp[46]_ufo_fragment.
      After this change udp[64]_ufo_send_check is a no-op.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f71470b3
    • T
      tcp: move logic out of tcp_v[64]_gso_send_check · d020f8f7
      Tom Herbert 提交于
      In tcp_v[46]_gso_send_check the TCP checksum is initialized to the
      pseudo header checksum using __tcp_v[46]_send_check. We can move this
      logic into new tcp[46]_gso_segment functions to be done when
      ip_summed != CHECKSUM_PARTIAL (ip_summed == CHECKSUM_PARTIAL should be
      the common case, possibly always true when taking GSO path). After this
      change tcp_v[46]_gso_send_check is no-op.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d020f8f7
    • D
      Merge branch 'stmmac' · 2fdbfea5
      David S. Miller 提交于
      Beniamino Galvani says:
      
      ====================
      net: stmmac glue layer for Amlogic Meson SoCs
      
      the Ethernet controller available in Amlogic Meson6 and Meson8 SoCs is
      a Synopsys DesignWare MAC IP core, already supported by the stmmac
      driver.
      
      These patches add a glue layer to the driver for the platform-specific
      settings required by the Amlogic variant.
      
      This has been tested on a Amlogic S802 device with the initial Meson
      support submitted by Carlo Caione [1].
      
      [1] http://lwn.net/Articles/612000/
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2fdbfea5