1. 20 10月, 2017 2 次提交
  2. 18 10月, 2017 3 次提交
    • J
      bpf: remove the verifier ops from program structure · 00176a34
      Jakub Kicinski 提交于
      Since the verifier ops don't have to be associated with
      the program for its entire lifetime we can move it to
      verifier's struct bpf_verifier_env.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00176a34
    • J
      bpf: split verifier and program ops · 7de16e3a
      Jakub Kicinski 提交于
      struct bpf_verifier_ops contains both verifier ops and operations
      used later during program's lifetime (test_run).  Split the runtime
      ops into a different structure.
      
      BPF_PROG_TYPE() will now append ## _prog_ops or ## _verifier_ops
      to the names.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7de16e3a
    • J
      bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP · 6710e112
      Jesper Dangaard Brouer 提交于
      The 'cpumap' is primarily used as a backend map for XDP BPF helper
      call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
      
      This patch implement the main part of the map.  It is not connected to
      the XDP redirect system yet, and no SKB allocation are done yet.
      
      The main concern in this patch is to ensure the datapath can run
      without any locking.  This adds complexity to the setup and tear-down
      procedure, which assumptions are extra carefully documented in the
      code comments.
      
      V2:
       - make sure array isn't larger than NR_CPUS
       - make sure CPUs added is a valid possible CPU
      
      V3: fix nitpicks from Jakub Kicinski <kubakici@wp.pl>
      
      V5:
       - Restrict map allocation to root / CAP_SYS_ADMIN
       - WARN_ON_ONCE if queue is not empty on tear-down
       - Return -EPERM on memlock limit instead of -ENOMEM
       - Error code in __cpu_map_entry_alloc() also handle ptr_ring_cleanup()
       - Moved cpu_map_enqueue() to next patch
      
      V6: all notice by Daniel Borkmann
       - Fix err return code in cpu_map_alloc() introduced in V5
       - Move cpu_possible() check after max_entries boundary check
       - Forbid usage initially in check_map_func_compatibility()
      
      V7:
       - Fix alloc error path spotted by Daniel Borkmann
       - Did stress test adding+removing CPUs from the map concurrently
       - Fixed refcnt issue on cpu_map_entry, kthread started too soon
       - Make sure packets are flushed during tear-down, involved use of
         rcu_barrier() and kthread_run only exit after queue is empty
       - Fix alloc error path in __cpu_map_entry_alloc() for ptr_ring
      
      V8:
       - Nitpicking comments and gramma by Edward Cree
       - Fix missing semi-colon introduced in V7 due to rebasing
       - Move struct bpf_cpu_map_entry members cpu+map_id to tracepoint patch
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6710e112
  3. 08 10月, 2017 1 次提交
  4. 05 10月, 2017 2 次提交
    • A
      bpf: introduce BPF_PROG_QUERY command · 468e2f64
      Alexei Starovoitov 提交于
      introduce BPF_PROG_QUERY command to retrieve a set of either
      attached programs to given cgroup or a set of effective programs
      that will execute for events within a cgroup
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      468e2f64
    • A
      bpf: multi program support for cgroup+bpf · 324bda9e
      Alexei Starovoitov 提交于
      introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
      bpf programs to a cgroup.
      
      The difference between three possible flags for BPF_PROG_ATTACH command:
      - NONE(default): No further bpf programs allowed in the subtree.
      - BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
        the program in this cgroup yields to sub-cgroup program.
      - BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
        that cgroup program gets run in addition to the program in this cgroup.
      
      NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
      change their behavior. It only clarifies the semantics in relation
      to new flag.
      
      Only one program is allowed to be attached to a cgroup with
      NONE or BPF_F_ALLOW_OVERRIDE flag.
      Multiple programs are allowed to be attached to a cgroup with
      BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
      (those that were attached first, run first)
      The programs of sub-cgroup are executed first, then programs of
      this cgroup and then programs of parent cgroup.
      All eligible programs are executed regardless of return code from
      earlier programs.
      
      To allow efficient execution of multiple programs attached to a cgroup
      and to avoid penalizing cgroups without any programs attached
      introduce 'struct bpf_prog_array' which is RCU protected array
      of pointers to bpf programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      324bda9e
  5. 01 10月, 2017 1 次提交
  6. 29 9月, 2017 2 次提交
  7. 20 9月, 2017 1 次提交
    • E
      bpf: do not disable/enable BH in bpf_map_free_id() · 930651a7
      Eric Dumazet 提交于
      syzkaller reported following splat [1]
      
      Since hard irq are disabled by the caller, bpf_map_free_id()
      should not try to enable/disable BH.
      
      Another solution would be to change htab_map_delete_elem() to
      defer the free_htab_elem() call after
      raw_spin_unlock_irqrestore(&b->lock, flags), but this might be not
      enough to cover other code paths.
      
      [1]
      WARNING: CPU: 1 PID: 8052 at kernel/softirq.c:161 __local_bh_enable_ip
      +0x1e/0x160 kernel/softirq.c:161
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 1 PID: 8052 Comm: syz-executor1 Not tainted 4.13.0-next-20170915+
      #23
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       panic+0x1e4/0x417 kernel/panic.c:181
       __warn+0x1c4/0x1d9 kernel/panic.c:542
       report_bug+0x211/0x2d0 lib/bug.c:183
       fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
       do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
       do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
       do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
       invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
      RIP: 0010:__local_bh_enable_ip+0x1e/0x160 kernel/softirq.c:161
      RSP: 0018:ffff8801cdcd7748 EFLAGS: 00010046
      RAX: 0000000000000082 RBX: 0000000000000201 RCX: 0000000000000000
      RDX: 1ffffffff0b5933c RSI: 0000000000000201 RDI: ffffffff85ac99e0
      RBP: ffff8801cdcd7758 R08: ffffffff85b87158 R09: 1ffff10039b9aec6
      R10: ffff8801c99f24c0 R11: 0000000000000002 R12: ffffffff817b0b47
      R13: dffffc0000000000 R14: ffff8801cdcd77e8 R15: 0000000000000001
       __raw_spin_unlock_bh include/linux/spinlock_api_smp.h:176 [inline]
       _raw_spin_unlock_bh+0x30/0x40 kernel/locking/spinlock.c:207
       spin_unlock_bh include/linux/spinlock.h:361 [inline]
       bpf_map_free_id kernel/bpf/syscall.c:197 [inline]
       __bpf_map_put+0x267/0x320 kernel/bpf/syscall.c:227
       bpf_map_put+0x1a/0x20 kernel/bpf/syscall.c:235
       bpf_map_fd_put_ptr+0x15/0x20 kernel/bpf/map_in_map.c:96
       free_htab_elem+0xc3/0x1b0 kernel/bpf/hashtab.c:658
       htab_map_delete_elem+0x74d/0x970 kernel/bpf/hashtab.c:1063
       map_delete_elem kernel/bpf/syscall.c:633 [inline]
       SYSC_bpf kernel/bpf/syscall.c:1479 [inline]
       SyS_bpf+0x2188/0x46a0 kernel/bpf/syscall.c:1451
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Fixes: f3f1c054 ("bpf: Introduce bpf_map ID")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      930651a7
  8. 09 9月, 2017 1 次提交
  9. 06 9月, 2017 1 次提交
    • E
      bpf: fix numa_node validation · 96e5ae4e
      Eric Dumazet 提交于
      syzkaller reported crashes in bpf map creation or map update [1]
      
      Problem is that nr_node_ids is a signed integer,
      NUMA_NO_NODE is also an integer, so it is very tempting
      to declare numa_node as a signed integer.
      
      This means the typical test to validate a user provided value :
      
              if (numa_node != NUMA_NO_NODE &&
                  (numa_node >= nr_node_ids ||
                   !node_online(numa_node)))
      
      must be written :
      
              if (numa_node != NUMA_NO_NODE &&
                  ((unsigned int)numa_node >= nr_node_ids ||
                   !node_online(numa_node)))
      
      [1]
      kernel BUG at mm/slab.c:3256!
      invalid opcode: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 0 PID: 2946 Comm: syzkaller916108 Not tainted 4.13.0-rc7+ #35
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      task: ffff8801d2bc60c0 task.stack: ffff8801c0c90000
      RIP: 0010:____cache_alloc_node+0x1d4/0x1e0 mm/slab.c:3292
      RSP: 0018:ffff8801c0c97638 EFLAGS: 00010096
      RAX: ffffffffffff8b7b RBX: 0000000001080220 RCX: 0000000000000000
      RDX: 00000000ffff8b7b RSI: 0000000001080220 RDI: ffff8801dac00040
      RBP: ffff8801c0c976c0 R08: 0000000000000000 R09: 0000000000000000
      R10: ffff8801c0c97620 R11: 0000000000000001 R12: ffff8801dac00040
      R13: ffff8801dac00040 R14: 0000000000000000 R15: 00000000ffff8b7b
      FS:  0000000002119940(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020001fec CR3: 00000001d2980000 CR4: 00000000001406f0
      Call Trace:
       __do_kmalloc_node mm/slab.c:3688 [inline]
       __kmalloc_node+0x33/0x70 mm/slab.c:3696
       kmalloc_node include/linux/slab.h:535 [inline]
       alloc_htab_elem+0x2a8/0x480 kernel/bpf/hashtab.c:740
       htab_map_update_elem+0x740/0xb80 kernel/bpf/hashtab.c:820
       map_update_elem kernel/bpf/syscall.c:587 [inline]
       SYSC_bpf kernel/bpf/syscall.c:1468 [inline]
       SyS_bpf+0x20c5/0x4c40 kernel/bpf/syscall.c:1443
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x440409
      RSP: 002b:00007ffd1f1792b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440409
      RDX: 0000000000000020 RSI: 0000000020006000 RDI: 0000000000000002
      RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401d70
      R13: 0000000000401e00 R14: 0000000000000000 R15: 0000000000000000
      Code: 83 c2 01 89 50 18 4c 03 70 08 e8 38 f4 ff ff 4d 85 f6 0f 85 3e ff ff ff 44 89 fe 4c 89 ef e8 94 fb ff ff 49 89 c6 e9 2b ff ff ff <0f> 0b 0f 0b 0f 0b 66 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41
      RIP: ____cache_alloc_node+0x1d4/0x1e0 mm/slab.c:3292 RSP: ffff8801c0c97638
      ---[ end trace d745f355da2e33ce ]---
      Kernel panic - not syncing: Fatal exception
      
      Fixes: 96eabe7a ("bpf: Allow selecting numa node during map creation")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96e5ae4e
  10. 29 8月, 2017 1 次提交
  11. 20 8月, 2017 1 次提交
    • M
      bpf: Allow selecting numa node during map creation · 96eabe7a
      Martin KaFai Lau 提交于
      The current map creation API does not allow to provide the numa-node
      preference.  The memory usually comes from where the map-creation-process
      is running.  The performance is not ideal if the bpf_prog is known to
      always run in a numa node different from the map-creation-process.
      
      One of the use case is sharding on CPU to different LRU maps (i.e.
      an array of LRU maps).  Here is the test result of map_perf_test on
      the INNER_LRU_HASH_PREALLOC test if we force the lru map used by
      CPU0 to be allocated from a remote numa node:
      
      [ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ]
      
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec  #<<<
      
      After specifying numa node:
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<<
      
      This patch adds one field, numa_node, to the bpf_attr.  Since numa node 0
      is a valid node, a new flag BPF_F_NUMA_NODE is also added.  The numa_node
      field is honored if and only if the BPF_F_NUMA_NODE flag is set.
      
      Numa node selection is not supported for percpu map.
      
      This patch does not change all the kmalloc.  F.e.
      'htab = kzalloc()' is not changed since the object
      is small enough to stay in the cache.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96eabe7a
  12. 19 8月, 2017 1 次提交
  13. 17 8月, 2017 2 次提交
    • J
      bpf: sockmap with sk redirect support · 174a79ff
      John Fastabend 提交于
      Recently we added a new map type called dev map used to forward XDP
      packets between ports (6093ec2d). This patches introduces a
      similar notion for sockets.
      
      A sockmap allows users to add participating sockets to a map. When
      sockets are added to the map enough context is stored with the
      map entry to use the entry with a new helper
      
        bpf_sk_redirect_map(map, key, flags)
      
      This helper (analogous to bpf_redirect_map in XDP) is given the map
      and an entry in the map. When called from a sockmap program, discussed
      below, the skb will be sent on the socket using skb_send_sock().
      
      With the above we need a bpf program to call the helper from that will
      then implement the send logic. The initial site implemented in this
      series is the recv_sock hook. For this to work we implemented a map
      attach command to add attributes to a map. In sockmap we add two
      programs a parse program and a verdict program. The parse program
      uses strparser to build messages and pass them to the verdict program.
      The parse programs use the normal strparser semantics. The verdict
      program is of type SK_SKB.
      
      The verdict program returns a verdict SK_DROP, or  SK_REDIRECT for
      now. Additional actions may be added later. When SK_REDIRECT is
      returned, expected when bpf program uses bpf_sk_redirect_map(), the
      sockmap logic will consult per cpu variables set by the helper routine
      and pull the sock entry out of the sock map. This pattern follows the
      existing redirect logic in cls and xdp programs.
      
      This gives the flow,
      
       recv_sock -> str_parser (parse_prog) -> verdict_prog -> skb_send_sock
                                                           \
                                                            -> kfree_skb
      
      As an example use case a message based load balancer may use specific
      logic in the verdict program to select the sock to send on.
      
      Sample programs are provided in future patches that hopefully illustrate
      the user interfaces. Also selftests are in follow-on patches.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      174a79ff
    • J
      bpf: export bpf_prog_inc_not_zero · a6f6df69
      John Fastabend 提交于
      bpf_prog_inc_not_zero will be used by upcoming sockmap patches this
      patch simply exports it so we can pull it in.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6f6df69
  14. 09 8月, 2017 2 次提交
  15. 30 7月, 2017 2 次提交
  16. 03 7月, 2017 1 次提交
  17. 02 7月, 2017 1 次提交
    • L
      bpf: BPF support for sock_ops · 40304b2a
      Lawrence Brakmo 提交于
      Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
      struct that allows BPF programs of this type to access some of the
      socket's fields (such as IP addresses, ports, etc.). It uses the
      existing bpf cgroups infrastructure so the programs can be attached per
      cgroup with full inheritance support. The program will be called at
      appropriate times to set relevant connections parameters such as buffer
      sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
      as IP addresses, port numbers, etc.
      
      Alghough there are already 3 mechanisms to set parameters (sysctls,
      route metrics and setsockopts), this new mechanism provides some
      distinct advantages. Unlike sysctls, it can set parameters per
      connection. In contrast to route metrics, it can also use port numbers
      and information provided by a user level program. In addition, it could
      set parameters probabilistically for evaluation purposes (i.e. do
      something different on 10% of the flows and compare results with the
      other 90% of the flows). Also, in cases where IPv6 addresses contain
      geographic information, the rules to make changes based on the distance
      (or RTT) between the hosts are much easier than route metric rules and
      can be global. Finally, unlike setsockopt, it oes not require
      application changes and it can be updated easily at any time.
      
      Although the bpf cgroup framework already contains a sock related
      program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
      (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
      only once during the connections's lifetime. In contrast, the new
      program type will be called multiple times from different places in the
      network stack code.  For example, before sending SYN and SYN-ACKs to set
      an appropriate timeout, when the connection is established to set
      congestion control, etc. As a result it has "op" field to specify the
      type of operation requested.
      
      The purpose of this new program type is to simplify setting connection
      parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
      easy to use facebook's internal IPv6 addresses to determine if both hosts
      of a connection are in the same datacenter. Therefore, it is easy to
      write a BPF program to choose a small SYN RTO value when both hosts are
      in the same datacenter.
      
      This patch only contains the framework to support the new BPF program
      type, following patches add the functionality to set various connection
      parameters.
      
      This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
      and a new bpf syscall command to load a new program of this type:
      BPF_PROG_LOAD_SOCKET_OPS.
      
      Two new corresponding structs (one for the kernel one for the user/BPF
      program):
      
      /* kernel version */
      struct bpf_sock_ops_kern {
              struct sock *sk;
              __u32  op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
      };
      
      /* user version
       * Some fields are in network byte order reflecting the sock struct
       * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
       * convert them to host byte order.
       */
      struct bpf_sock_ops {
              __u32 op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
              __u32 family;
              __u32 remote_ip4;     /* In network byte order */
              __u32 local_ip4;      /* In network byte order */
              __u32 remote_ip6[4];  /* In network byte order */
              __u32 local_ip6[4];   /* In network byte order */
              __u32 remote_port;    /* In network byte order */
              __u32 local_port;     /* In host byte horder */
      };
      
      Currently there are two types of ops. The first type expects the BPF
      program to return a value which is then used by the caller (or a
      negative value to indicate the operation is not supported). The second
      type expects state changes to be done by the BPF program, for example
      through a setsockopt BPF helper function, and they ignore the return
      value.
      
      The reply fields of the bpf_sockt_ops struct are there in case a bpf
      program needs to return a value larger than an integer.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40304b2a
  18. 30 6月, 2017 2 次提交
  19. 07 6月, 2017 6 次提交
  20. 03 6月, 2017 1 次提交
  21. 12 5月, 2017 1 次提交
  22. 09 5月, 2017 1 次提交
  23. 25 4月, 2017 1 次提交
  24. 18 4月, 2017 2 次提交
    • D
      bpf: fix checking xdp_adjust_head on tail calls · c2002f98
      Daniel Borkmann 提交于
      Commit 17bedab2 ("bpf: xdp: Allow head adjustment in XDP prog")
      added the xdp_adjust_head bit to the BPF prog in order to tell drivers
      that the program that is to be attached requires support for the XDP
      bpf_xdp_adjust_head() helper such that drivers not supporting this
      helper can reject the program. There are also drivers that do support
      the helper, but need to check for xdp_adjust_head bit in order to move
      packet metadata prepended by the firmware away for making headroom.
      
      For these cases, the current check for xdp_adjust_head bit is insufficient
      since there can be cases where the program itself does not use the
      bpf_xdp_adjust_head() helper, but tail calls into another program that
      uses bpf_xdp_adjust_head(). As such, the xdp_adjust_head bit is still
      set to 0. Since the first program has no control over which program it
      calls into, we need to assume that bpf_xdp_adjust_head() helper is used
      upon tail calls. Thus, for the very same reasons in cb_access, set the
      xdp_adjust_head bit to 1 when the main program uses tail calls.
      
      Fixes: 17bedab2 ("bpf: xdp: Allow head adjustment in XDP prog")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2002f98
    • D
      bpf: fix cb access in socket filter programs on tail calls · 6b1bb01b
      Daniel Borkmann 提交于
      Commit ff936a04 ("bpf: fix cb access in socket filter programs")
      added a fix for socket filter programs such that in i) AF_PACKET the
      20 bytes of skb->cb[] area gets zeroed before use in order to not leak
      data, and ii) socket filter programs attached to TCP/UDP sockets need
      to save/restore these 20 bytes since they are also used by protocol
      layers at that time.
      
      The problem is that bpf_prog_run_save_cb() and bpf_prog_run_clear_cb()
      only look at the actual attached program to determine whether to zero
      or save/restore the skb->cb[] parts. There can be cases where the
      actual attached program does not access the skb->cb[], but the program
      tail calls into another program which does access this area. In such
      a case, the zero or save/restore is currently not performed.
      
      Since the programs we tail call into are unknown at verification time
      and can dynamically change, we need to assume that whenever the attached
      program performs a tail call, that later programs could access the
      skb->cb[], and therefore we need to always set cb_access to 1.
      
      Fixes: ff936a04 ("bpf: fix cb access in socket filter programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b1bb01b
  25. 12 4月, 2017 1 次提交