1. 05 7月, 2017 1 次提交
  2. 03 7月, 2017 4 次提交
    • J
      bpf, verifier: add additional patterns to evaluate_reg_imm_alu · 43188702
      John Fastabend 提交于
      Currently the verifier does not track imm across alu operations when
      the source register is of unknown type. This adds additional pattern
      matching to catch this and track imm. We've seen LLVM generating this
      pattern while working on cilium.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43188702
    • J
      bpf: extend bpf_trace_printk to support %i · 7bda4b40
      John Fastabend 提交于
      Currently, bpf_trace_printk does not support common formatting
      symbol '%i' however vsprintf does and is what eventually gets
      called by bpf helper. If users are used to '%i' and currently
      make use of it, then bpf_trace_printk will just return with
      error without dumping anything to the trace pipe, so just add
      support for '%i' to the helper.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bda4b40
    • D
      bpf: export whether tail call has jited owner · 9780c0ab
      Daniel Borkmann 提交于
      We do export through fdinfo already whether a prog is JITed or not,
      given a program load can fail in case of either prog or tail call map
      has JITed property, but neither both are JITed or not JITed, we can
      facilitate error reporting in loaders like iproute2 through exporting
      owner_jited of tail call map. We already do export owner_prog_type
      through this facility, so parser can pick up both for comparison.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9780c0ab
    • D
      bpf: simplify narrower ctx access · f96da094
      Daniel Borkmann 提交于
      This work tries to make the semantics and code around the
      narrower ctx access a bit easier to follow. Right now
      everything is done inside the .is_valid_access(). Offset
      matching is done differently for read/write types, meaning
      writes don't support narrower access and thus matching only
      on offsetof(struct foo, bar) is enough whereas for read
      case that supports narrower access we must check for
      offsetof(struct foo, bar) + offsetof(struct foo, bar) +
      sizeof(<bar>) - 1 for each of the cases. For read cases of
      individual members that don't support narrower access (like
      packet pointers or skb->cb[] case which has its own narrow
      access logic), we check as usual only offsetof(struct foo,
      bar) like in write case. Then, for the case where narrower
      access is allowed, we also need to set the aux info for the
      access. Meaning, ctx_field_size and converted_op_size have
      to be set. First is the original field size e.g. sizeof(<bar>)
      as in above example from the user facing ctx, and latter
      one is the target size after actual rewrite happened, thus
      for the kernel facing ctx. Also here we need the range match
      and we need to keep track changing convert_ctx_access() and
      converted_op_size from is_valid_access() as both are not at
      the same location.
      
      We can simplify the code a bit: check_ctx_access() becomes
      simpler in that we only store ctx_field_size as a meta data
      and later in convert_ctx_accesses() we fetch the target_size
      right from the location where we do convert. Should the verifier
      be misconfigured we do reject for BPF_WRITE cases or target_size
      that are not provided. For the subsystems, we always work on
      ranges in is_valid_access() and add small helpers for ranges
      and narrow access, convert_ctx_accesses() sets target_size
      for the relevant instruction.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Cc: Yonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f96da094
  3. 02 7月, 2017 1 次提交
    • L
      bpf: BPF support for sock_ops · 40304b2a
      Lawrence Brakmo 提交于
      Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
      struct that allows BPF programs of this type to access some of the
      socket's fields (such as IP addresses, ports, etc.). It uses the
      existing bpf cgroups infrastructure so the programs can be attached per
      cgroup with full inheritance support. The program will be called at
      appropriate times to set relevant connections parameters such as buffer
      sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
      as IP addresses, port numbers, etc.
      
      Alghough there are already 3 mechanisms to set parameters (sysctls,
      route metrics and setsockopts), this new mechanism provides some
      distinct advantages. Unlike sysctls, it can set parameters per
      connection. In contrast to route metrics, it can also use port numbers
      and information provided by a user level program. In addition, it could
      set parameters probabilistically for evaluation purposes (i.e. do
      something different on 10% of the flows and compare results with the
      other 90% of the flows). Also, in cases where IPv6 addresses contain
      geographic information, the rules to make changes based on the distance
      (or RTT) between the hosts are much easier than route metric rules and
      can be global. Finally, unlike setsockopt, it oes not require
      application changes and it can be updated easily at any time.
      
      Although the bpf cgroup framework already contains a sock related
      program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
      (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
      only once during the connections's lifetime. In contrast, the new
      program type will be called multiple times from different places in the
      network stack code.  For example, before sending SYN and SYN-ACKs to set
      an appropriate timeout, when the connection is established to set
      congestion control, etc. As a result it has "op" field to specify the
      type of operation requested.
      
      The purpose of this new program type is to simplify setting connection
      parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
      easy to use facebook's internal IPv6 addresses to determine if both hosts
      of a connection are in the same datacenter. Therefore, it is easy to
      write a BPF program to choose a small SYN RTO value when both hosts are
      in the same datacenter.
      
      This patch only contains the framework to support the new BPF program
      type, following patches add the functionality to set various connection
      parameters.
      
      This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
      and a new bpf syscall command to load a new program of this type:
      BPF_PROG_LOAD_SOCKET_OPS.
      
      Two new corresponding structs (one for the kernel one for the user/BPF
      program):
      
      /* kernel version */
      struct bpf_sock_ops_kern {
              struct sock *sk;
              __u32  op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
      };
      
      /* user version
       * Some fields are in network byte order reflecting the sock struct
       * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
       * convert them to host byte order.
       */
      struct bpf_sock_ops {
              __u32 op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
              __u32 family;
              __u32 remote_ip4;     /* In network byte order */
              __u32 local_ip4;      /* In network byte order */
              __u32 remote_ip6[4];  /* In network byte order */
              __u32 local_ip6[4];   /* In network byte order */
              __u32 remote_port;    /* In network byte order */
              __u32 local_port;     /* In host byte horder */
      };
      
      Currently there are two types of ops. The first type expects the BPF
      program to return a value which is then used by the caller (or a
      negative value to indicate the operation is not supported). The second
      type expects state changes to be done by the BPF program, for example
      through a setsockopt BPF helper function, and they ignore the return
      value.
      
      The reply fields of the bpf_sockt_ops struct are there in case a bpf
      program needs to return a value larger than an integer.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40304b2a
  4. 30 6月, 2017 14 次提交
  5. 29 6月, 2017 5 次提交
    • S
      ftrace: Fix regression with module command in stack_trace_filter · 0f179765
      Steven Rostedt (VMware) 提交于
      When doing the following command:
      
       # echo ":mod:kvm_intel" > /sys/kernel/tracing/stack_trace_filter
      
      it triggered a crash.
      
      This happened with the clean up of probes. It required all callers to the
      regex function (doing ftrace filtering) to have ops->private be a pointer to
      a trace_array. But for the stack tracer, that is not the case.
      
      Allow for the ops->private to be NULL, and change the function command
      callbacks to handle the trace_array pointer being NULL as well.
      
      Fixes: d2afd57a ("tracing/ftrace: Allow instances to have their own function probes")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      0f179765
    • T
      sched/numa: Hide numa_wake_affine() from UP build · ff801b71
      Thomas Gleixner 提交于
      Stephen reported the following build warning in UP:
      
      kernel/sched/fair.c:2657:9: warning: 'struct sched_domain' declared inside
      parameter list
               ^
      /home/sfr/next/next/kernel/sched/fair.c:2657:9: warning: its scope is only this
      definition or declaration, which is probably not what you want
      
      Hide the numa_wake_affine() inline stub on UP builds to get rid of it.
      
      Fixes: 3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      ff801b71
    • S
      timers: Make the cpu base lock raw · 2287d866
      Sebastian Andrzej Siewior 提交于
      The timers cpu base lock could not be converted to a raw spinlock becaue
      the lock held time was non-deterministic due to cascading and long lasting
      timer wheel traversals.
      
      The rework of the timer wheel to the new non-cascading model removed also
      the wheel traversals and the lock held times are deterministic now. This
      allows to make the lock raw and thereby unbreaks NOHz* on preempt-RT.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: http://lkml.kernel.org/r/20170627161538.30257-1-bigeasy@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      2287d866
    • T
      cgroup: implement "nsdelegate" mount option · 5136f636
      Tejun Heo 提交于
      Currently, cgroup only supports delegation to !root users and cgroup
      namespaces don't get any special treatments.  This limits the
      usefulness of cgroup namespaces as they by themselves can't be safe
      delegation boundaries.  A process inside a cgroup can change the
      resource control knobs of the parent in the namespace root and may
      move processes in and out of the namespace if cgroups outside its
      namespace are visible somehow.
      
      This patch adds a new mount option "nsdelegate" which makes cgroup
      namespaces delegation boundaries.  If set, cgroup behaves as if write
      permission based delegation took place at namespace boundaries -
      writes to the resource control knobs from the namespace root are
      denied and migration crossing the namespace boundary aren't allowed
      from inside the namespace.
      
      This allows cgroup namespace to function as a delegation boundary by
      itself.
      
      v2: Silently ignore nsdelegate specified on !init mounts.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Aravind Anbudurai <aru7@fb.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      5136f636
    • T
      cgroup: restructure cgroup_procs_write_permission() · 824ecbe0
      Tejun Heo 提交于
      Restructure cgroup_procs_write_permission() to make extending
      permission logic easier.
      
      This patch doesn't cause any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      824ecbe0
  6. 28 6月, 2017 6 次提交
  7. 27 6月, 2017 4 次提交
  8. 26 6月, 2017 5 次提交