1. 02 7月, 2017 1 次提交
    • L
      bpf: BPF support for sock_ops · 40304b2a
      Lawrence Brakmo 提交于
      Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
      struct that allows BPF programs of this type to access some of the
      socket's fields (such as IP addresses, ports, etc.). It uses the
      existing bpf cgroups infrastructure so the programs can be attached per
      cgroup with full inheritance support. The program will be called at
      appropriate times to set relevant connections parameters such as buffer
      sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
      as IP addresses, port numbers, etc.
      
      Alghough there are already 3 mechanisms to set parameters (sysctls,
      route metrics and setsockopts), this new mechanism provides some
      distinct advantages. Unlike sysctls, it can set parameters per
      connection. In contrast to route metrics, it can also use port numbers
      and information provided by a user level program. In addition, it could
      set parameters probabilistically for evaluation purposes (i.e. do
      something different on 10% of the flows and compare results with the
      other 90% of the flows). Also, in cases where IPv6 addresses contain
      geographic information, the rules to make changes based on the distance
      (or RTT) between the hosts are much easier than route metric rules and
      can be global. Finally, unlike setsockopt, it oes not require
      application changes and it can be updated easily at any time.
      
      Although the bpf cgroup framework already contains a sock related
      program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
      (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
      only once during the connections's lifetime. In contrast, the new
      program type will be called multiple times from different places in the
      network stack code.  For example, before sending SYN and SYN-ACKs to set
      an appropriate timeout, when the connection is established to set
      congestion control, etc. As a result it has "op" field to specify the
      type of operation requested.
      
      The purpose of this new program type is to simplify setting connection
      parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
      easy to use facebook's internal IPv6 addresses to determine if both hosts
      of a connection are in the same datacenter. Therefore, it is easy to
      write a BPF program to choose a small SYN RTO value when both hosts are
      in the same datacenter.
      
      This patch only contains the framework to support the new BPF program
      type, following patches add the functionality to set various connection
      parameters.
      
      This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
      and a new bpf syscall command to load a new program of this type:
      BPF_PROG_LOAD_SOCKET_OPS.
      
      Two new corresponding structs (one for the kernel one for the user/BPF
      program):
      
      /* kernel version */
      struct bpf_sock_ops_kern {
              struct sock *sk;
              __u32  op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
      };
      
      /* user version
       * Some fields are in network byte order reflecting the sock struct
       * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
       * convert them to host byte order.
       */
      struct bpf_sock_ops {
              __u32 op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
              __u32 family;
              __u32 remote_ip4;     /* In network byte order */
              __u32 local_ip4;      /* In network byte order */
              __u32 remote_ip6[4];  /* In network byte order */
              __u32 local_ip6[4];   /* In network byte order */
              __u32 remote_port;    /* In network byte order */
              __u32 local_port;     /* In host byte horder */
      };
      
      Currently there are two types of ops. The first type expects the BPF
      program to return a value which is then used by the caller (or a
      negative value to indicate the operation is not supported). The second
      type expects state changes to be done by the BPF program, for example
      through a setsockopt BPF helper function, and they ignore the return
      value.
      
      The reply fields of the bpf_sockt_ops struct are there in case a bpf
      program needs to return a value larger than an integer.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40304b2a
  2. 30 6月, 2017 3 次提交
    • D
      bpf: prevent leaking pointer via xadd on unpriviledged · 6bdf6abc
      Daniel Borkmann 提交于
      Leaking kernel addresses on unpriviledged is generally disallowed,
      for example, verifier rejects the following:
      
        0: (b7) r0 = 0
        1: (18) r2 = 0xffff897e82304400
        3: (7b) *(u64 *)(r1 +48) = r2
        R2 leaks addr into ctx
      
      Doing pointer arithmetic on them is also forbidden, so that they
      don't turn into unknown value and then get leaked out. However,
      there's xadd as a special case, where we don't check the src reg
      for being a pointer register, e.g. the following will pass:
      
        0: (b7) r0 = 0
        1: (7b) *(u64 *)(r1 +48) = r0
        2: (18) r2 = 0xffff897e82304400 ; map
        4: (db) lock *(u64 *)(r1 +48) += r2
        5: (95) exit
      
      We could store the pointer into skb->cb, loose the type context,
      and then read it out from there again to leak it eventually out
      of a map value. Or more easily in a different variant, too:
      
         0: (bf) r6 = r1
         1: (7a) *(u64 *)(r10 -8) = 0
         2: (bf) r2 = r10
         3: (07) r2 += -8
         4: (18) r1 = 0x0
         6: (85) call bpf_map_lookup_elem#1
         7: (15) if r0 == 0x0 goto pc+3
         R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R6=ctx R10=fp
         8: (b7) r3 = 0
         9: (7b) *(u64 *)(r0 +0) = r3
        10: (db) lock *(u64 *)(r0 +0) += r6
        11: (b7) r0 = 0
        12: (95) exit
      
        from 7 to 11: R0=inv,min_value=0,max_value=0 R6=ctx R10=fp
        11: (b7) r0 = 0
        12: (95) exit
      
      Prevent this by checking xadd src reg for pointer types. Also
      add a couple of test cases related to this.
      
      Fixes: 1be7f75d ("bpf: enable non-root eBPF programs")
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bdf6abc
    • M
      bpf: Fix out-of-bound access on interpreters[] · 8007e40a
      Martin KaFai Lau 提交于
      The index is off-by-one when fp->aux->stack_depth
      has already been rounded up to 32.  In particular,
      if stack_depth is 512, the index will be 16.
      
      The fix is to round_up and then takes -1 instead of round_down.
      
      [   22.318680] ==================================================================
      [   22.319745] BUG: KASAN: global-out-of-bounds in bpf_prog_select_runtime+0x48a/0x670
      [   22.320737] Read of size 8 at addr ffffffff82aadae0 by task sockex3/1946
      [   22.321646]
      [   22.321858] CPU: 1 PID: 1946 Comm: sockex3 Tainted: G        W       4.12.0-rc6-01680-g2ee87db3 #22
      [   22.323061] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
      [   22.324260] Call Trace:
      [   22.324612]  dump_stack+0x67/0x99
      [   22.325081]  print_address_description+0x1e8/0x290
      [   22.325734]  ? bpf_prog_select_runtime+0x48a/0x670
      [   22.326360]  kasan_report+0x265/0x350
      [   22.326860]  __asan_report_load8_noabort+0x19/0x20
      [   22.327484]  bpf_prog_select_runtime+0x48a/0x670
      [   22.328109]  bpf_prog_load+0x626/0xd40
      [   22.328637]  ? __bpf_prog_charge+0xc0/0xc0
      [   22.329222]  ? check_nnp_nosuid.isra.61+0x100/0x100
      [   22.329890]  ? __might_fault+0xf6/0x1b0
      [   22.330446]  ? lock_acquire+0x360/0x360
      [   22.331013]  SyS_bpf+0x67c/0x24d0
      [   22.331491]  ? trace_hardirqs_on+0xd/0x10
      [   22.332049]  ? __getnstimeofday64+0xaf/0x1c0
      [   22.332635]  ? bpf_prog_get+0x20/0x20
      [   22.333135]  ? __audit_syscall_entry+0x300/0x600
      [   22.333770]  ? syscall_trace_enter+0x540/0xdd0
      [   22.334339]  ? exit_to_usermode_loop+0xe0/0xe0
      [   22.334950]  ? do_syscall_64+0x48/0x410
      [   22.335446]  ? bpf_prog_get+0x20/0x20
      [   22.335954]  do_syscall_64+0x181/0x410
      [   22.336454]  entry_SYSCALL64_slow_path+0x25/0x25
      [   22.337121] RIP: 0033:0x7f263fe81f19
      [   22.337618] RSP: 002b:00007ffd9a3440c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
      [   22.338619] RAX: ffffffffffffffda RBX: 0000000000aac5fb RCX: 00007f263fe81f19
      [   22.339600] RDX: 0000000000000030 RSI: 00007ffd9a3440d0 RDI: 0000000000000005
      [   22.340470] RBP: 0000000000a9a1e0 R08: 0000000000a9a1e0 R09: 0000009d00000001
      [   22.341430] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000010000
      [   22.342411] R13: 0000000000a9a023 R14: 0000000000000001 R15: 0000000000000003
      [   22.343369]
      [   22.343593] The buggy address belongs to the variable:
      [   22.344241]  interpreters+0x80/0x980
      [   22.344708]
      [   22.344908] Memory state around the buggy address:
      [   22.345556]  ffffffff82aad980: 00 00 00 04 fa fa fa fa 04 fa fa fa fa fa fa fa
      [   22.346449]  ffffffff82aada00: 00 00 00 00 00 fa fa fa fa fa fa fa 00 00 00 00
      [   22.347361] >ffffffff82aada80: 00 00 00 00 00 00 00 00 00 00 00 00 fa fa fa fa
      [   22.348301]                                                        ^
      [   22.349142]  ffffffff82aadb00: 00 01 fa fa fa fa fa fa 00 00 00 00 00 00 00 00
      [   22.350058]  ffffffff82aadb80: 00 00 07 fa fa fa fa fa 00 00 05 fa fa fa fa fa
      [   22.350984] ==================================================================
      
      Fixes: b870aa90 ("bpf: use different interpreter depending on required stack size")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8007e40a
    • M
      bpf: Add syscall lookup support for fd array and htab · 14dc6f04
      Martin KaFai Lau 提交于
      This patch allows userspace to do BPF_MAP_LOOKUP_ELEM on
      BPF_MAP_TYPE_PROG_ARRAY,
      BPF_MAP_TYPE_ARRAY_OF_MAPS and
      BPF_MAP_TYPE_HASH_OF_MAPS.
      
      The lookup returns a prog-id or map-id to the userspace.
      The userspace can then use the BPF_PROG_GET_FD_BY_ID
      or BPF_MAP_GET_FD_BY_ID to get a fd.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14dc6f04
  3. 24 6月, 2017 1 次提交
    • Y
      bpf: possibly avoid extra masking for narrower load in verifier · 23994631
      Yonghong Song 提交于
      Commit 31fd8581 ("bpf: permits narrower load from bpf program
      context fields") permits narrower load for certain ctx fields.
      The commit however will already generate a masking even if
      the prog-specific ctx conversion produces the result with
      narrower size.
      
      For example, for __sk_buff->protocol, the ctx conversion
      loads the data into register with 2-byte load.
      A narrower 2-byte load should not generate masking.
      For __sk_buff->vlan_present, the conversion function
      set the result as either 0 or 1, essentially a byte.
      The narrower 2-byte or 1-byte load should not generate masking.
      
      To avoid unnecessary masking, prog-specific *_is_valid_access
      now passes converted_op_size back to verifier, which indicates
      the valid data width after perceived future conversion.
      Based on this information, verifier is able to avoid
      unnecessary marking.
      
      Since we want more information back from prog-specific
      *_is_valid_access checking, all of them are packed into
      one data structure for more clarity.
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23994631
  4. 21 6月, 2017 1 次提交
  5. 20 6月, 2017 3 次提交
    • P
      livepatch: Fix stacking of patches with respect to RCU · 842c0884
      Petr Mladek 提交于
      rcu_read_(un)lock(), list_*_rcu(), and synchronize_rcu() are used for a secure
      access and manipulation of the list of patches that modify the same function.
      In particular, it is the variable func_stack that is accessible from the ftrace
      handler via struct ftrace_ops and klp_ops.
      
      Of course, it synchronizes also some states of the patch on the top of the
      stack, e.g. func->transition in klp_ftrace_handler.
      
      At the same time, this mechanism guards also the manipulation of
      task->patch_state. It is modified according to the state of the transition and
      the state of the process.
      
      Now, all this works well as long as RCU works well. Sadly livepatching might
      get into some corner cases when this is not true. For example, RCU is not
      watching when rcu_read_lock() is taken in idle threads.  It is because they
      might sleep and prevent reaching the grace period for too long.
      
      There are ways how to make RCU watching even in idle threads, see
      rcu_irq_enter(). But there is a small location inside RCU infrastructure when
      even this does not work.
      
      This small problematic location can be detected either before calling
      rcu_irq_enter() by rcu_irq_enter_disabled() or later by rcu_is_watching().
      Sadly, there is no safe way how to handle it.  Once we detect that RCU was not
      watching, we might see inconsistent state of the function stack and the related
      variables in klp_ftrace_handler(). Then we could do a wrong decision, use an
      incompatible implementation of the function and break the consistency of the
      system. We could warn but we could not avoid the damage.
      
      Fortunately, ftrace has similar problems and they seem to be solved well there.
      It uses a heavy weight implementation of some RCU operations. In particular, it
      replaces:
      
        + rcu_read_lock() with preempt_disable_notrace()
        + rcu_read_unlock() with preempt_enable_notrace()
        + synchronize_rcu() with schedule_on_each_cpu(sync_work)
      
      My understanding is that this is RCU implementation from a stone age. It meets
      the core RCU requirements but it is rather ineffective. Especially, it does not
      allow to batch or speed up the synchronize calls.
      
      On the other hand, it is very trivial. It allows to safely trace and/or
      livepatch even the RCU core infrastructure.  And the effectiveness is a not a
      big issue because using ftrace or livepatches on productive systems is a rare
      operation.  The safety is much more important than a negligible extra load.
      
      Note that the alternative implementation follows the RCU principles. Therefore,
           we could and actually must use list_*_rcu() variants when manipulating the
           func_stack.  These functions allow to access the pointers in the right
           order and with the right barriers. But they do not use any other
           information that would be set only by rcu_read_lock().
      
      Also note that there are actually two problems solved in ftrace:
      
      First, it cares about the consistency of RCU read sections.  It is being solved
      the way as described and used in this patch.
      
      Second, ftrace needs to make sure that nobody is inside the dynamic trampoline
      when it is being freed. For this, it also calls synchronize_rcu_tasks() in
      preemptive kernel in ftrace_shutdown().
      
      Livepatch has similar problem but it is solved by ftrace for free.
      klp_ftrace_handler() is a good guy and never sleeps. In addition, it is
      registered with FTRACE_OPS_FL_DYNAMIC. It causes that
      unregister_ftrace_function() calls:
      
      	* schedule_on_each_cpu(ftrace_sync) - always
      	* synchronize_rcu_tasks() - in preemptive kernel
      
      The effect is that nobody is neither inside the dynamic trampoline nor inside
      the ftrace handler after unregister_ftrace_function() returns.
      
      [jkosina@suse.cz: reformat changelog, fix comment]
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      842c0884
    • J
      time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting · 3d88d56c
      John Stultz 提交于
      Due to how the MONOTONIC_RAW accumulation logic was handled,
      there is the potential for a 1ns discontinuity when we do
      accumulations. This small discontinuity has for the most part
      gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW
      in their vDSO clock_gettime implementation, we've seen failures
      with the inconsistency-check test in kselftest.
      
      This patch addresses the issue by using the same sub-ns
      accumulation handling that CLOCK_MONOTONIC uses, which avoids
      the issue for in-kernel users.
      
      Since the ARM64 vDSO implementation has its own clock_gettime
      calculation logic, this patch reduces the frequency of errors,
      but failures are still seen. The ARM64 vDSO will need to be
      updated to include the sub-nanosecond xtime_nsec values in its
      calculation for this issue to be completely fixed.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Tested-by: NDaniel Mentz <danielmentz@google.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Stephen Boyd <stephen.boyd@linaro.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "stable #4 . 8+" <stable@vger.kernel.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Link: http://lkml.kernel.org/r/1496965462-20003-3-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      3d88d56c
    • J
      time: Fix clock->read(clock) race around clocksource changes · ceea5e37
      John Stultz 提交于
      In tests, which excercise switching of clocksources, a NULL
      pointer dereference can be observed on AMR64 platforms in the
      clocksource read() function:
      
      u64 clocksource_mmio_readl_down(struct clocksource *c)
      {
      	return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask;
      }
      
      This is called from the core timekeeping code via:
      
      	cycle_now = tkr->read(tkr->clock);
      
      tkr->read is the cached tkr->clock->read() function pointer.
      When the clocksource is changed then tkr->clock and tkr->read
      are updated sequentially. The code above results in a sequential
      load operation of tkr->read and tkr->clock as well.
      
      If the store to tkr->clock hits between the loads of tkr->read
      and tkr->clock, then the old read() function is called with the
      new clock pointer. As a consequence the read() function
      dereferences a different data structure and the resulting 'reg'
      pointer can point anywhere including NULL.
      
      This problem was introduced when the timekeeping code was
      switched over to use struct tk_read_base. Before that, it was
      theoretically possible as well when the compiler decided to
      reload clock in the code sequence:
      
           now = tk->clock->read(tk->clock);
      
      Add a helper function which avoids the issue by reading
      tk_read_base->clock once into a local variable clk and then issue
      the read function via clk->read(clk). This guarantees that the
      read() function always gets the proper clocksource pointer handed
      in.
      
      Since there is now no use for the tkr.read pointer, this patch
      also removes it, and to address stopping the fast timekeeper
      during suspend/resume, it introduces a dummy clocksource to use
      rather then just a dummy read function.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Stephen Boyd <stephen.boyd@linaro.org>
      Cc: stable <stable@vger.kernel.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Daniel Mentz <danielmentz@google.com>
      Link: http://lkml.kernel.org/r/1496965462-20003-2-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      ceea5e37
  6. 18 6月, 2017 1 次提交
    • E
      signal: Only reschedule timers on signals timers have sent · 57db7e4a
      Eric W. Biederman 提交于
      Thomas Gleixner  wrote:
      > The CRIU support added a 'feature' which allows a user space task to send
      > arbitrary (kernel) signals to itself. The changelog says:
      >
      >   The kernel prevents sending of siginfo with positive si_code, because
      >   these codes are reserved for kernel.  I think we can allow a task to
      >   send such a siginfo to itself.  This operation should not be dangerous.
      >
      > Quite contrary to that claim, it turns out that it is outright dangerous
      > for signals with info->si_code == SI_TIMER. The following code sequence in
      > a user space task allows to crash the kernel:
      >
      >    id = timer_create(CLOCK_XXX, ..... signo = SIGX);
      >    timer_set(id, ....);
      >    info->si_signo = SIGX;
      >    info->si_code = SI_TIMER:
      >    info->_sifields._timer._tid = id;
      >    info->_sifields._timer._sys_private = 2;
      >    rt_[tg]sigqueueinfo(..., SIGX, info);
      >    sigemptyset(&sigset);
      >    sigaddset(&sigset, SIGX);
      >    rt_sigtimedwait(sigset, info);
      >
      > For timers based on CLOCK_PROCESS_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID this
      > results in a kernel crash because sigwait() dequeues the signal and the
      > dequeue code observes:
      >
      >   info->si_code == SI_TIMER && info->_sifields._timer._sys_private != 0
      >
      > which triggers the following callchain:
      >
      >  do_schedule_next_timer() -> posix_cpu_timer_schedule() -> arm_timer()
      >
      > arm_timer() executes a list_add() on the timer, which is already armed via
      > the timer_set() syscall. That's a double list add which corrupts the posix
      > cpu timer list. As a consequence the kernel crashes on the next operation
      > touching the posix cpu timer list.
      >
      > Posix clocks which are internally implemented based on hrtimers are not
      > affected by this because hrtimer_start() can handle already armed timers
      > nicely, but it's a reliable way to trigger the WARN_ON() in
      > hrtimer_forward(), which complains about calling that function on an
      > already armed timer.
      
      This problem has existed since the posix timer code was merged into
      2.5.63. A few releases earlier in 2.5.60 ptrace gained the ability to
      inject not just a signal (which linux has supported since 1.0) but the
      full siginfo of a signal.
      
      The core problem is that the code will reschedule in response to
      signals getting dequeued not just for signals the timers sent but
      for other signals that happen to a si_code of SI_TIMER.
      
      Avoid this confusion by testing to see if the queued signal was
      preallocated as all timer signals are preallocated, and so far
      only the timer code preallocates signals.
      
      Move the check for if a timer needs to be rescheduled up into
      collect_signal where the preallocation check must be performed,
      and pass the result back to dequeue_signal where the code reschedules
      timers.   This makes it clear why the code cares about preallocated
      timers.
      
      Cc: stable@vger.kernel.org
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
      Reference: 66dd34ad ("signal: allow to send any siginfo to itself")
      Reference: 1669ce53e2ff ("Add PTRACE_GETSIGINFO and PTRACE_SETSIGINFO")
      Fixes: db8b50ba75f2 ("[PATCH] POSIX clocks & timers")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      57db7e4a
  7. 15 6月, 2017 1 次提交
    • Y
      bpf: permits narrower load from bpf program context fields · 31fd8581
      Yonghong Song 提交于
      Currently, verifier will reject a program if it contains an
      narrower load from the bpf context structure. For example,
              __u8 h = __sk_buff->hash, or
              __u16 p = __sk_buff->protocol
              __u32 sample_period = bpf_perf_event_data->sample_period
      which are narrower loads of 4-byte or 8-byte field.
      
      This patch solves the issue by:
        . Introduce a new parameter ctx_field_size to carry the
          field size of narrower load from prog type
          specific *__is_valid_access validator back to verifier.
        . The non-zero ctx_field_size for a memory access indicates
          (1). underlying prog type specific convert_ctx_accesses
               supporting non-whole-field access
          (2). the current insn is a narrower or whole field access.
        . In verifier, for such loads where load memory size is
          less than ctx_field_size, verifier transforms it
          to a full field load followed by proper masking.
        . Currently, __sk_buff and bpf_perf_event_data->sample_period
          are supporting narrowing loads.
        . Narrower stores are still not allowed as typical ctx stores
          are just normal stores.
      
      Because of this change, some tests in verifier will fail and
      these tests are removed. As a bonus, rename some out of bound
      __sk_buff->cb access to proper field name and remove two
      redundant "skb cb oob" tests.
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31fd8581
  8. 13 6月, 2017 2 次提交
  9. 12 6月, 2017 1 次提交
  10. 11 6月, 2017 6 次提交
  11. 08 6月, 2017 4 次提交
    • P
      srcu: Allow use of Classic SRCU from both process and interrupt context · 1123a604
      Paolo Bonzini 提交于
      Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
      down a guest running iperf on a VFIO assigned device.  This happens
      because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
      context, while a worker thread does the same inside kvm_set_irq().  If the
      interrupt happens while the worker thread is executing __srcu_read_lock(),
      updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
      ->srcu_lock_count[] field can be lost.
      
      The docs say you are not supposed to call srcu_read_lock() and
      srcu_read_unlock() from irq context, but KVM interrupt injection happens
      from (host) interrupt context and it would be nice if SRCU supported the
      use case.  KVM is using SRCU here not really for the "sleepable" part,
      but rather due to its IPI-free fast detection of grace periods.  It is
      therefore not desirable to switch back to RCU, which would effectively
      revert commit 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
      2014-01-16).
      
      However, the docs are overly conservative.  You can have an SRCU instance
      only has users in irq context, and you can mix process and irq context
      as long as process context users disable interrupts.  In addition,
      __srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
      Classic SRCU.  For those two implementations, only srcu_read_lock()
      is unsafe.
      
      When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
      in commit 5a41344a ("srcu: Simplify __srcu_read_unlock() via
      this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
      Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
      the caller.  Tree SRCU however only does one increment, so on most
      architectures it is more efficient for __srcu_read_lock() to use
      this_cpu_inc(), and any performance differences appear to be down in
      the noise.
      
      Cc: stable@vger.kernel.org
      Fixes: 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
      Reported-by: NLinu Cherian <linuc.decode@gmail.com>
      Suggested-by: NLinu Cherian <linuc.decode@gmail.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1123a604
    • P
      srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context · cdf7abc4
      Paolo Bonzini 提交于
      Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
      down a guest running iperf on a VFIO assigned device.  This happens
      because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
      context, while a worker thread does the same inside kvm_set_irq().  If the
      interrupt happens while the worker thread is executing __srcu_read_lock(),
      updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
      ->srcu_lock_count[] field can be lost.
      
      The docs say you are not supposed to call srcu_read_lock() and
      srcu_read_unlock() from irq context, but KVM interrupt injection happens
      from (host) interrupt context and it would be nice if SRCU supported the
      use case.  KVM is using SRCU here not really for the "sleepable" part,
      but rather due to its IPI-free fast detection of grace periods.  It is
      therefore not desirable to switch back to RCU, which would effectively
      revert commit 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
      2014-01-16).
      
      However, the docs are overly conservative.  You can have an SRCU instance
      only has users in irq context, and you can mix process and irq context
      as long as process context users disable interrupts.  In addition,
      __srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
      Classic SRCU.  For those two implementations, only srcu_read_lock()
      is unsafe.
      
      When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
      in commit 5a41344a ("srcu: Simplify __srcu_read_unlock() via
      this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
      Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
      the caller.  Tree SRCU however only does one increment, so on most
      architectures it is more efficient for __srcu_read_lock() to use
      this_cpu_inc(), and any performance differences appear to be down in
      the noise.
      
      Unlike Classic and Tree SRCU, Tiny SRCU does increments and decrements on
      a single variable.  Therefore, as Peter Zijlstra pointed out, Tiny SRCU's
      implementation already supports mixed-context use of srcu_read_lock()
      and srcu_read_unlock(), at least as long as uses of srcu_read_lock()
      and srcu_read_unlock() in each handler are nested and paired properly.
      In other words, it is still illegal to (say) invoke srcu_read_lock()
      in an interrupt handler and to invoke the matching srcu_read_unlock()
      in a softirq handler.  Therefore, the only change required for Tiny SRCU
      is to its comments.
      
      Fixes: 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
      Reported-by: NLinu Cherian <linuc.decode@gmail.com>
      Suggested-by: NLinu Cherian <linuc.decode@gmail.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NPaolo Bonzini <pbonzini@redhat.com>
      cdf7abc4
    • P
      Revert "printk: fix double printing with earlycon" · dac8bbba
      Petr Mladek 提交于
      This reverts commit cf39bf58.
      
      The commit regression to users that define both console=ttyS1
      and console=ttyS0 on the command line, see
      https://lkml.kernel.org/r/20170509082915.GA13236@bistromath.localdomain
      
      The kernel log messages always appeared only on one serial port. It is
      even documented in Documentation/admin-guide/serial-console.rst:
      
      "Note that you can only define one console per device type (serial,
      video)."
      
      The above mentioned commit changed the order in which the command line
      parameters are searched. As a result, the kernel log messages go to
      the last mentioned ttyS* instead of the first one.
      
      We long thought that using two console=ttyS* on the command line
      did not make sense. But then we realized that console= parameters
      were handled also by systemd, see
      http://0pointer.de/blog/projects/serial-console.html
      
      "By default systemd will instantiate one serial-getty@.service on
      the main kernel console, if it is not a virtual terminal."
      
      where
      
      "[4] If multiple kernel consoles are used simultaneously, the main
      console is the one listed first in /sys/class/tty/console/active,
      which is the last one listed on the kernel command line."
      
      This puts the original report into another light. The system is running
      in qemu. The first serial port is used to store the messages into a file.
      The second one is used to login to the system via a socket. It depends
      on systemd and the historic kernel behavior.
      
      By other words, systemd causes that it makes sense to define both
      console=ttyS1 console=ttyS0 on the command line. The kernel fix
      caused regression related to userspace (systemd) and need to be
      reverted.
      
      In addition, it went out that the fix helped only partially.
      The messages still were duplicated when the boot console was
      removed early by late_initcall(printk_late_init). Then the entire
      log was replayed when the same console was registered as a normal one.
      
      Link: 20170606160339.GC7604@pathway.suse.cz
      Cc: Aleksey Makarov <aleksey.makarov@linaro.org>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Robin Murphy <robin.murphy@arm.com>,
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: "Nair, Jayachandran" <Jayachandran.Nair@cavium.com>
      Cc: linux-serial@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reported-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      dac8bbba
    • J
      perf/core: Drop kernel samples even though :u is specified · cc1582c2
      Jin Yao 提交于
      When doing sampling, for example:
      
        perf record -e cycles:u ...
      
      On workloads that do a lot of kernel entry/exits we see kernel
      samples, even though :u is specified. This is due to skid existing.
      
      This might be a security issue because it can leak kernel addresses even
      though kernel sampling support is disabled.
      
      The patch drops the kernel samples if exclude_kernel is specified.
      
      For example, test on Haswell desktop:
      
        perf record -e cycles:u <mgen>
        perf report --stdio
      
      Before patch applied:
      
          99.77%  mgen     mgen              [.] buf_read
           0.20%  mgen     mgen              [.] rand_buf_init
           0.01%  mgen     [kernel.vmlinux]  [k] apic_timer_interrupt
           0.00%  mgen     mgen              [.] last_free_elem
           0.00%  mgen     libc-2.23.so      [.] __random_r
           0.00%  mgen     libc-2.23.so      [.] _int_malloc
           0.00%  mgen     mgen              [.] rand_array_init
           0.00%  mgen     [kernel.vmlinux]  [k] page_fault
           0.00%  mgen     libc-2.23.so      [.] __random
           0.00%  mgen     libc-2.23.so      [.] __strcasestr
           0.00%  mgen     ld-2.23.so        [.] strcmp
           0.00%  mgen     ld-2.23.so        [.] _dl_start
           0.00%  mgen     libc-2.23.so      [.] sched_setaffinity@@GLIBC_2.3.4
           0.00%  mgen     ld-2.23.so        [.] _start
      
      We can see kernel symbols apic_timer_interrupt and page_fault.
      
      After patch applied:
      
          99.79%  mgen     mgen           [.] buf_read
           0.19%  mgen     mgen           [.] rand_buf_init
           0.00%  mgen     libc-2.23.so   [.] __random_r
           0.00%  mgen     mgen           [.] rand_array_init
           0.00%  mgen     mgen           [.] last_free_elem
           0.00%  mgen     libc-2.23.so   [.] vfprintf
           0.00%  mgen     libc-2.23.so   [.] rand
           0.00%  mgen     libc-2.23.so   [.] __random
           0.00%  mgen     libc-2.23.so   [.] _int_malloc
           0.00%  mgen     libc-2.23.so   [.] _IO_doallocbuf
           0.00%  mgen     ld-2.23.so     [.] do_lookup_x
           0.00%  mgen     ld-2.23.so     [.] open_verify.constprop.7
           0.00%  mgen     ld-2.23.so     [.] _dl_important_hwcaps
           0.00%  mgen     libc-2.23.so   [.] sched_setaffinity@@GLIBC_2.3.4
           0.00%  mgen     ld-2.23.so     [.] _start
      
      There are only userspace symbols.
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Cc: kan.liang@intel.com
      Cc: mark.rutland@arm.com
      Cc: will.deacon@arm.com
      Cc: yao.jin@intel.com
      Link: http://lkml.kernel.org/r/1495706947-3744-1-git-send-email-yao.jin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cc1582c2
  12. 07 6月, 2017 8 次提交
  13. 05 6月, 2017 1 次提交
  14. 04 6月, 2017 2 次提交
    • T
      alarmtimer: Rate limit periodic intervals · ff86bf0c
      Thomas Gleixner 提交于
      The alarmtimer code has another source of potentially rearming itself too
      fast. Interval timers with a very samll interval have a similar CPU hog
      effect as the previously fixed overflow issue.
      
      The reason is that alarmtimers do not implement the normal protection
      against this kind of problem which the other posix timer use:
      
        timer expires -> queue signal -> deliver signal -> rearm timer
      
      This scheme brings the rearming under scheduler control and prevents
      permanently firing timers which hog the CPU.
      
      Bringing this scheme to the alarm timer code is a major overhaul because it
      lacks all the necessary mechanisms completely.
      
      So for a quick fix limit the interval to one jiffie. This is not
      problematic in practice as alarmtimers are usually backed by an RTC for
      suspend which have 1 second resolution. It could be therefor argued that
      the resolution of this clock should be set to 1 second in general, but
      that's outside the scope of this fix.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170530211655.896767100@linutronix.de
      ff86bf0c
    • T
      alarmtimer: Prevent overflow of relative timers · f4781e76
      Thomas Gleixner 提交于
      Andrey reported a alartimer related RCU stall while fuzzing the kernel with
      syzkaller.
      
      The reason for this is an overflow in ktime_add() which brings the
      resulting time into negative space and causes immediate expiry of the
      timer. The following rearm with a small interval does not bring the timer
      back into positive space due to the same issue.
      
      This results in a permanent firing alarmtimer which hogs the CPU.
      
      Use ktime_add_safe() instead which detects the overflow and clamps the
      result to KTIME_SEC_MAX.
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170530211655.802921648@linutronix.de
      f4781e76
  15. 03 6月, 2017 3 次提交
  16. 01 6月, 2017 2 次提交