1. 07 5月, 2016 11 次提交
    • D
      Merge branch 'bpf-direct-pkt-access' · 4b307a8e
      David S. Miller 提交于
      Alexei Starovoitov says:
      
      ====================
      bpf: introduce direct packet access
      
      This set of patches introduce 'direct packet access' from
      cls_bpf and act_bpf programs (which are root only).
      
      Current bpf programs use LD_ABS, LD_INS instructions which have
      to do 'if (off < skb_headlen)' for every packet access.
      It's ok for socket filters, but too slow for XDP, since single
      LD_ABS insn consumes 3% of cpu. Therefore we have to amortize the cost
      of length check over multiple packet accesses via direct access
      to skb->data, data_end pointers.
      
      The existing packet parser typically look like:
        if (load_half(skb, offsetof(struct ethhdr, h_proto)) != ETH_P_IP)
           return 0;
        if (load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)) != IPPROTO_UDP ||
            load_byte(skb, ETH_HLEN) != 0x45)
           return 0;
        ...
      with 'direct packet access' the bpf program becomes:
         void *data = (void *)(long)skb->data;
         void *data_end = (void *)(long)skb->data_end;
         struct eth_hdr *eth = data;
         struct iphdr *iph = data + sizeof(*eth);
      
         if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end)
            return 0;
         if (eth->h_proto != htons(ETH_P_IP))
            return 0;
         if (iph->protocol != IPPROTO_UDP || iph->ihl != 5)
            return 0;
         ...
      which is more natural to write and significantly faster.
      See patch 6 for performance tests:
      21Mpps(old) vs 24Mpps(new) with just 5 loads.
      For more complex parsers the performance gain is higher.
      
      The other approach implemented in [1] was adding two new instructions
      to interpreter and JITs and was too hard to use from llvm side.
      The approach presented here doesn't need any instruction changes,
      but the verifier has to work harder to check safety of the packet access.
      
      Patch 1 prepares the code and Patch 2 adds new checks for direct
      packet access and all of them are gated with 'env->allow_ptr_leaks'
      which is true for root only.
      Patch 3 improves search pruning for large programs.
      Patch 4 wires in verifier's changes with net/core/filter side.
      Patch 5 updates docs
      Patches 6 and 7 add tests.
      
      [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dw
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b307a8e
    • A
      samples/bpf: add verifier tests · 883e44e4
      Alexei Starovoitov 提交于
      add few tests for "pointer to packet" logic of the verifier
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      883e44e4
    • A
      samples/bpf: add 'pointer to packet' tests · 65d472fb
      Alexei Starovoitov 提交于
      parse_simple.c - packet parser exapmle with single length check that
      filters out udp packets for port 9
      
      parse_varlen.c - variable length parser that understand multiple vlan headers,
      ipip, ipip6 and ip options to filter out udp or tcp packets on port 9.
      The packet is parsed layer by layer with multitple length checks.
      
      parse_ldabs.c - classic style of packet parsing using LD_ABS instruction.
      Same functionality as parse_simple.
      
      simple = 24.1Mpps per core
      varlen = 22.7Mpps
      ldabs  = 21.4Mpps
      
      Parser with LD_ABS instructions is slower than full direct access parser
      which does more packet accesses and checks.
      
      These examples demonstrate the choice bpf program authors can make between
      flexibility of the parser vs speed.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65d472fb
    • A
      bpf: add documentation for 'direct packet access' · f9c8d19d
      Alexei Starovoitov 提交于
      explain how verifier checks safety of packet access
      and update email addresses.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9c8d19d
    • A
      bpf: wire in data and data_end for cls_act_bpf · db58ba45
      Alexei Starovoitov 提交于
      allow cls_bpf and act_bpf programs access skb->data and skb->data_end pointers.
      The bpf helpers that change skb->data need to update data_end pointer as well.
      The verifier checks that programs always reload data, data_end pointers
      after calls to such bpf helpers.
      We cannot add 'data_end' pointer to struct qdisc_skb_cb directly,
      since it's embedded as-is by infiniband ipoib, so wrapper struct is needed.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db58ba45
    • A
      bpf: improve verifier state equivalence · 735b4333
      Alexei Starovoitov 提交于
      since UNKNOWN_VALUE type is weaker than CONST_IMM we can un-teach
      verifier its recognition of constants in conditional branches
      without affecting safety.
      Ex:
      if (reg == 123) {
        .. here verifier was marking reg->type as CONST_IMM
           instead keep reg as UNKNOWN_VALUE
      }
      
      Two verifier states with UNKNOWN_VALUE are equivalent, whereas
      CONST_IMM_X != CONST_IMM_Y, since CONST_IMM is used for stack range
      verification and other cases.
      So help search pruning by marking registers as UNKNOWN_VALUE
      where possible instead of CONST_IMM.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      735b4333
    • A
      bpf: direct packet access · 969bf05e
      Alexei Starovoitov 提交于
      Extended BPF carried over two instructions from classic to access
      packet data: LD_ABS and LD_IND. They're highly optimized in JITs,
      but due to their design they have to do length check for every access.
      When BPF is processing 20M packets per second single LD_ABS after JIT
      is consuming 3% cpu. Hence the need to optimize it further by amortizing
      the cost of 'off < skb_headlen' over multiple packet accesses.
      One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW
      with similar usage as skb_header_pointer().
      The kernel part for interpreter and x64 JIT was implemented in [1], but such
      new insns behave like old ld_abs and abort the program with 'return 0' if
      access is beyond linear data. Such hidden control flow is hard to workaround
      plus changing JITs and rolling out new llvm is incovenient.
      
      Therefore allow cls_bpf/act_bpf program access skb->data directly:
      int bpf_prog(struct __sk_buff *skb)
      {
        struct iphdr *ip;
      
        if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end)
            /* packet too small */
            return 0;
      
        ip = skb->data + ETH_HLEN;
      
        /* access IP header fields with direct loads */
        if (ip->version != 4 || ip->saddr == 0x7f000001)
            return 1;
        [...]
      }
      
      This solution avoids introduction of new instructions. llvm stays
      the same and all JITs stay the same, but verifier has to work extra hard
      to prove safety of the above program.
      
      For XDP the direct store instructions can be allowed as well.
      
      The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check
      the alignment. The complex packet parsers where packet pointer is adjusted
      incrementally cannot be tracked for alignment, so allow byte access in such cases
      and misaligned access on architectures that define efficient_unaligned_access
      
      [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dwSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      969bf05e
    • A
      bpf: cleanup verifier code · 1a0dc1ac
      Alexei Starovoitov 提交于
      cleanup verifier code and prepare it for addition of "pointer to packet" logic
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a0dc1ac
    • D
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 95aef7ce
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2016-05-05
      
      This series contains updates to i40e and i40evf.
      
      The theme behind this series is code reduction, yeah!  Jesse provides
      most of the changes starting with a refactor of the interpretation of
      a tunnel which lets us start using the hardware's parsing.  Removed
      the packet split receive routine and ancillary code in preparation
      for the Rx-refactor.  The refactor of the receive routine,
      aligns the receive routine with the one in ixgbe which was highly
      optimized.  The hardware supports a 16 byte descriptor for receive,
      but the driver was never using it in production.  There was no performance
      benefit to the real driver of 16 byte descriptors, so drop a whole lot
      of complexity while getting rid of the code.  Fixed a bug where while
      changing the number of descriptors using ethtool, the driver did not
      test the limits of the system memory before permanently assuming it
      would be able to get receive buffer memory.
      
      Mitch fixes a memory leak of one page each time the driver is opened by
      allocating the correct number of receive buffers and do not fiddle with
      next_to_use in the VF driver.
      
      Arnd Bergmann fixed a indentation issue by adding the appropriate
      curly braces in i40e_vc_config_promiscuous_mode_msg().
      
      Julia Lawall fixed an issue found by Coccinelle, where i40e_client_ops
      structure can be const since it is never modified.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95aef7ce
    • D
      net: vrf: Create FIB tables on link create · b3b4663c
      David Ahern 提交于
      Tables have to exist for VRFs to function. Ensure they exist
      when VRF device is created.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3b4663c
    • J
      cnic: call cp->stop_hw() in cnic_start_hw() on allocation failure · f37bd0cc
      Jon Maxwell 提交于
      We recently had a system crash in the cnic module. Vmcore analysis confirmed
      that "ip link up" was executed which failed due to an allocation failure
      because of memory fragmentation. Futher analysis revealed that the cnic irq
      vector was still allocated after the "ip link up" that failed. When
      "ip link down" was executed it called free_msi_irqs() which crashed the system
      because the cnic irq was still inuse.
      
      PANIC: "kernel BUG at drivers/pci/msi.c:411!"
      
      The code execution was:
      
      cnic_netdev_event()
      if (event == NETDEV_UP) {
      .
      .
             ▹       if (!cnic_start_hw(dev))
      cnic_start_hw()
      calls cnic_cm_open() which failed with -ENOMEM
      cnic_start_hw() then took the err1 path:
      
      err1:
             cp->free_resc(dev); <---- frees resources but not irq vector
             pci_dev_put(dev->pcidev);
             return err;
      }
      
      This returns control back to cnic_netdev_event() but now the cnic irq vector
      is still allocated even although cnic_cm_open() failed. The next
      "ip link down" while trigger the crash.
      
      The cnic_start_hw() routine is not handling the allocation failure correctly.
      Fix this by checking whether CNIC_DRV_STATE_HANDLES_IRQ flag is set indicating
      that the hardware has been started in cnic_start_hw(). If it has then call
      cp->stop_hw() which frees the cnic irq vector and cnic resources. Otherwise
      just maintain the previous behaviour and free cnic resources.
      
      I reproduced this by injecting an ENOMEM error into cnic_cm_alloc_mem()s return
      code.
      
      # ip link set dev enpX down
      # ip link set dev enpX up <--- hit's allocation failure
      # ip link set dev enpX down <--- crashes here
      
      With this patch I confirmed there was no crash in the reproducer.
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f37bd0cc
  2. 06 5月, 2016 12 次提交
  3. 05 5月, 2016 17 次提交