1. 07 5月, 2016 1 次提交
  2. 29 4月, 2016 6 次提交
  3. 28 4月, 2016 1 次提交
    • P
      perf/core: Fix perf_event_open() vs. execve() race · 79c9ce57
      Peter Zijlstra 提交于
      Jann reported that the ptrace_may_access() check in
      find_lively_task_by_vpid() is racy against exec().
      
      Specifically:
      
        perf_event_open()		execve()
      
        ptrace_may_access()
      				commit_creds()
        ...				if (get_dumpable() != SUID_DUMP_USER)
      				  perf_event_exit_task();
        perf_install_in_context()
      
      would result in installing a counter across the creds boundary.
      
      Fix this by wrapping lots of perf_event_open() in cred_guard_mutex.
      This should be fine as perf_event_exit_task() is already called with
      cred_guard_mutex held, so all perf locks already nest inside it.
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      79c9ce57
  4. 27 4月, 2016 1 次提交
  5. 26 4月, 2016 2 次提交
    • R
      workqueue: fix ghost PENDING flag while doing MQ IO · 346c09f8
      Roman Pen 提交于
      The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
      with the following backtrace:
      
      [  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
      [  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
      [  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
      [  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
      [  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
      [  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
      [  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
      [  601.350965] Call Trace:
      [  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
      [  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
      [  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
      [  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
      [  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
      [  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
      [  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
      [  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
      [  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
      [  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
      [  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
      [  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
      [  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
      [  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
      [  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
      [  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
      [  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
      [  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
      [  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
      [  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
      [  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50
      
      The underlying device is a null_blk, with default parameters:
      
        queue_mode    = MQ
        submit_queues = 1
      
      Verification that nullb0 has something inflight:
      
      root@pserver8:~# cat /sys/block/nullb0/inflight
             0        1
      root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
      ...
      /sys/block/nullb0/mq/0/cpu2/rq_list
      CTX pending:
              ffff8838038e2400
      ...
      
      During debug it became clear that stalled request is always inserted in
      the rq_list from the following path:
      
         save_stack_trace_tsk + 34
         blk_mq_insert_requests + 231
         blk_mq_flush_plug_list + 281
         blk_flush_plug_list + 199
         wait_on_page_bit + 192
         __filemap_fdatawait_range + 228
         filemap_fdatawait_range + 20
         filemap_write_and_wait_range + 63
         blkdev_fsync + 27
         vfs_fsync_range + 73
         blkdev_write_iter + 202
         __vfs_write + 170
         vfs_write + 169
         kernel_write + 56
      
      So blk_flush_plug_list() was called with from_schedule == true.
      
      If from_schedule is true, that means that finally blk_mq_insert_requests()
      offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
      i.e. it calls kblockd_schedule_delayed_work_on().
      
      That means, that we race with another CPU, which is about to execute
      __blk_mq_run_hw_queue() work.
      
      Further debugging shows the following traces from different CPUs:
      
        CPU#0                                  CPU#1
        ----------------------------------     -------------------------------
        reqeust A inserted
        STORE hctx->ctx_map[0] bit marked
        kblockd_schedule...() returns 1
        <schedule to kblockd workqueue>
                                               request B inserted
                                               STORE hctx->ctx_map[1] bit marked
                                               kblockd_schedule...() returns 0
        *** WORK PENDING bit is cleared ***
        flush_busy_ctxs() is executed, but
        bit 1, set by CPU#1, is not observed
      
      As a result request B pended forever.
      
      This behaviour can be explained by speculative LOAD of hctx->ctx_map on
      CPU#0, which is reordered with clear of PENDING bit and executed _before_
      actual STORE of bit 1 on CPU#1.
      
      The proper fix is an explicit full barrier <mfence>, which guarantees
      that clear of PENDING bit is to be executed before all possible
      speculative LOADS or STORES inside actual work function.
      Signed-off-by: NRoman Pen <roman.penyaev@profitbricks.com>
      Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
      Cc: Michael Wang <yun.wang@profitbricks.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NTejun Heo <tj@kernel.org>
      346c09f8
    • T
      cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback · 5cf1cacb
      Tejun Heo 提交于
      Since e93ad19d ("cpuset: make mm migration asynchronous"), cpuset
      kicks off asynchronous NUMA node migration if necessary during task
      migration and flushes it from cpuset_post_attach_flush() which is
      called at the end of __cgroup_procs_write().  This is to avoid
      performing migration with cgroup_threadgroup_rwsem write-locked which
      can lead to deadlock through dependency on kworker creation.
      
      memcg has a similar issue with charge moving, so let's convert it to
      an official callback rather than the current one-off cpuset specific
      function.  This patch adds cgroup_subsys->post_attach callback and
      makes cpuset register cpuset_post_attach_flush() as its ->post_attach.
      
      The conversion is mostly one-to-one except that the new callback is
      called under cgroup_mutex.  This is to guarantee that no other
      migration operations are started before ->post_attach callbacks are
      finished.  cgroup_mutex is one of the outermost mutex in the system
      and has never been and shouldn't be a problem.  We can add specialized
      synchronization around __cgroup_procs_write() but I don't think
      there's any noticeable benefit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
      5cf1cacb
  6. 24 4月, 2016 1 次提交
  7. 23 4月, 2016 3 次提交
  8. 22 4月, 2016 2 次提交
    • S
      cpu/hotplug: Fix rollback during error-out in __cpu_disable() · 3b9d6da6
      Sebastian Andrzej Siewior 提交于
      The recent introduction of the hotplug thread which invokes the callbacks on
      the plugged cpu, cased the following regression:
      
      If takedown_cpu() fails, then we run into several issues:
      
       1) The rollback of the target cpu states is not invoked. That leaves the smp
          threads and the hotplug thread in disabled state.
      
       2) notify_online() is executed due to a missing skip_onerr flag. That causes
          that both CPU_DOWN_FAILED and CPU_ONLINE notifications are invoked which
          confuses quite some notifiers.
      
       3) The CPU_DOWN_FAILED notification is not invoked on the target CPU. That's
          not an issue per se, but it is inconsistent and in consequence blocks the
          patches which rely on these states being invoked on the target CPU and not
          on the controlling cpu. It also does not preserve the strict call order on
          rollback which is problematic for the ongoing state machine conversion as
          well.
      
      To fix this we add a rollback flag to the remote callback machinery and invoke
      the rollback including the CPU_DOWN_FAILED notification on the remote
      cpu. Further mark the notify online state with 'skip_onerr' so we don't get a
      double invokation.
      
      This workaround will go away once we moved the unplug invocation to the target
      cpu itself.
      
      [ tglx: Massaged changelog and moved the CPU_DOWN_FAILED notifiaction to the
        	target cpu ]
      
      Fixes: 4cb28ced ("cpu/hotplug: Create hotplug threads")
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: linux-s390@vger.kernel.org
      Cc: rt@linutronix.de
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Link: http://lkml.kernel.org/r/20160408124015.GA21960@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      3b9d6da6
    • A
      perf, bpf: minimize the size of perf_trace_() tracepoint handler · 85b67bcb
      Alexei Starovoitov 提交于
      move trace_call_bpf() into helper function to minimize the size
      of perf_trace_*() tracepoint handlers.
          text	   data	    bss	    dec	 	   hex	filename
      10541679	5526646	2945024	19013349	1221ee5	vmlinux_before
      10509422	5526646	2945024	18981092	121a0e4	vmlinux_after
      
      It may seem that perf_fetch_caller_regs() can also be moved,
      but that is incorrect, since ip/sp will be wrong.
      
      bpf+tracepoint performance is not affected, since
      perf_swevent_put_recursion_context() is now inlined.
      export_symbol_gpl can also be dropped.
      
      No measurable change in normal perf tracepoints.
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85b67bcb
  9. 21 4月, 2016 2 次提交
  10. 20 4月, 2016 3 次提交
    • S
      futex: Handle unlock_pi race gracefully · 89e9e66b
      Sebastian Andrzej Siewior 提交于
      If userspace calls UNLOCK_PI unconditionally without trying the TID -> 0
      transition in user space first then the user space value might not have the
      waiters bit set. This opens the following race:
      
      CPU0	    	      	    CPU1
      uval = get_user(futex)
      			    lock(hb)
      lock(hb)
      			    futex |= FUTEX_WAITERS
      			    ....
      			    unlock(hb)
      
      cmpxchg(futex, uval, newval)
      
      So the cmpxchg fails and returns -EINVAL to user space, which is wrong because
      the futex value is valid.
      
      To handle this (yes, yet another) corner case gracefully, check for a flag
      change and retry.
      
      [ tglx: Massaged changelog and slightly reworked implementation ]
      
      Fixes: ccf9e6a8 ("futex: Make unlock_pi more robust")
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: stable@vger.kernel.org
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1460723739-5195-1-git-send-email-bigeasy@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      89e9e66b
    • D
      bpf: add event output helper for notifications/sampling/logging · bd570ff9
      Daniel Borkmann 提交于
      This patch adds a new helper for cls/act programs that can push events
      to user space applications. For networking, this can be f.e. for sampling,
      debugging, logging purposes or pushing of arbitrary wake-up events. The
      idea is similar to a43eec30 ("bpf: introduce bpf_perf_event_output()
      helper") and 39111695 ("samples: bpf: add bpf_perf_event_output example").
      
      The eBPF program utilizes a perf event array map that user space populates
      with fds from perf_event_open(), the eBPF program calls into the helper
      f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
      so that the raw data is pushed into the fd f.e. at the map index of the
      current CPU.
      
      User space can poll/mmap/etc on this and has a data channel for receiving
      events that can be post-processed. The nice thing is that since the eBPF
      program and user space application making use of it are tightly coupled,
      they can define their own arbitrary raw data format and what/when they
      want to push.
      
      While f.e. packet headers could be one part of the meta data that is being
      pushed, this is not a substitute for things like packet sockets as whole
      packet is not being pushed and push is only done in a single direction.
      Intention is more of a generically usable, efficient event pipe to applications.
      Workflow is that tc can pin the map and applications can attach themselves
      e.g. after cls/act setup to one or multiple map slots, demuxing is done by
      the eBPF program.
      
      Adding this facility is with minimal effort, it reuses the helper
      introduced in a43eec30 ("bpf: introduce bpf_perf_event_output() helper")
      and we get its functionality for free by overloading its BPF_FUNC_ identifier
      for cls/act programs, ctx is currently unused, but will be made use of in
      future. Example will be added to iproute2's BPF example files.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd570ff9
    • D
      bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output · 1e33759c
      Daniel Borkmann 提交于
      Add a BPF_F_CURRENT_CPU flag to optimize the use-case where user space has
      per-CPU ring buffers and the eBPF program pushes the data into the current
      CPU's ring buffer which saves us an extra helper function call in eBPF.
      Also, make sure to properly reserve the remaining flags which are not used.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e33759c
  11. 19 4月, 2016 2 次提交
    • D
      locking/pvqspinlock: Fix division by zero in qstat_read() · 66876595
      Davidlohr Bueso 提交于
      While playing with the qstat statistics (in <debugfs>/qlockstat/) I ran into
      the following splat on a VM when opening pv_hash_hops:
      
        divide error: 0000 [#1] SMP
        ...
        RIP: 0010:[<ffffffff810b61fe>]  [<ffffffff810b61fe>] qstat_read+0x12e/0x1e0
        ...
        Call Trace:
          [<ffffffff811cad7c>] ? mem_cgroup_commit_charge+0x6c/0xd0
          [<ffffffff8119750c>] ? page_add_new_anon_rmap+0x8c/0xd0
          [<ffffffff8118d3b9>] ? handle_mm_fault+0x1439/0x1b40
          [<ffffffff811937a9>] ? do_mmap+0x449/0x550
          [<ffffffff811d3de3>] ? __vfs_read+0x23/0xd0
          [<ffffffff811d4ab2>] ? rw_verify_area+0x52/0xd0
          [<ffffffff811d4bb1>] ? vfs_read+0x81/0x120
          [<ffffffff811d5f12>] ? SyS_read+0x42/0xa0
          [<ffffffff815720f6>] ? entry_SYSCALL_64_fastpath+0x1e/0xa8
      
      Fix this by verifying that qstat_pv_kick_unlock is in fact non-zero,
      similarly to what the qstat_pv_latency_wake case does, as if nothing
      else, this can come from resetting the statistics, thus having 0 kicks
      should be quite valid in this context.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NWaiman Long <Waiman.Long@hpe.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: waiman.long@hpe.com
      Link: http://lkml.kernel.org/r/1460961103-24953-1-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      66876595
    • A
      bpf: avoid warning for wrong pointer cast · 266a0a79
      Arnd Bergmann 提交于
      Two new functions in bpf contain a cast from a 'u64' to a
      pointer. This works on 64-bit architectures but causes a warning
      on all 32-bit architectures:
      
      kernel/trace/bpf_trace.c: In function 'bpf_perf_event_output_tp':
      kernel/trace/bpf_trace.c:350:13: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
        u64 ctx = *(long *)r1;
      
      This changes the cast to first convert the u64 argument into a uintptr_t,
      which is guaranteed to be the same size as a pointer.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Fixes: 9940d67c ("bpf: support bpf_get_stackid() and bpf_perf_event_output() in tracepoint programs")
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      266a0a79
  12. 15 4月, 2016 4 次提交
    • D
      bpf: convert relevant helper args to ARG_PTR_TO_RAW_STACK · 074f528e
      Daniel Borkmann 提交于
      This patch converts all helpers that can use ARG_PTR_TO_RAW_STACK as argument
      type. For tc programs this is bpf_skb_load_bytes(), bpf_skb_get_tunnel_key(),
      bpf_skb_get_tunnel_opt(). For tracing, this optimizes bpf_get_current_comm()
      and bpf_probe_read(). The check in bpf_skb_load_bytes() for MAX_BPF_STACK can
      also be removed since the verifier already makes sure we stay within bounds
      on stack buffers.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      074f528e
    • D
      bpf, verifier: add ARG_PTR_TO_RAW_STACK type · 435faee1
      Daniel Borkmann 提交于
      When passing buffers from eBPF stack space into a helper function, we have
      ARG_PTR_TO_STACK argument type for helpers available. The verifier makes sure
      that such buffers are initialized, within boundaries, etc.
      
      However, the downside with this is that we have a couple of helper functions
      such as bpf_skb_load_bytes() that fill out the passed buffer in the expected
      success case anyway, so zero initializing them prior to the helper call is
      unneeded/wasted instructions in the eBPF program that can be avoided.
      
      Therefore, add a new helper function argument type called ARG_PTR_TO_RAW_STACK.
      The idea is to skip the STACK_MISC check in check_stack_boundary() and color
      the related stack slots as STACK_MISC after we checked all call arguments.
      
      Helper functions using ARG_PTR_TO_RAW_STACK must make sure that every path of
      the helper function will fill the provided buffer area, so that we cannot leak
      any uninitialized stack memory. This f.e. means that error paths need to
      memset() the buffers, but the expected fast-path doesn't have to do this
      anymore.
      
      Since there's no such helper needing more than at most one ARG_PTR_TO_RAW_STACK
      argument, we can keep it simple and don't need to check for multiple areas.
      Should in future such a use-case really appear, we have check_raw_mode() that
      will make sure we implement support for it first.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      435faee1
    • D
      bpf, verifier: add bpf_call_arg_meta for passing meta data · 33ff9823
      Daniel Borkmann 提交于
      Currently, when the verifier checks calls in check_call() function, we
      call check_func_arg() for all 5 arguments e.g. to make sure expected types
      are correct. In some cases, we collect meta data (here: map pointer) to
      perform additional checks such as checking stack boundary on key/value
      sizes for subsequent arguments. As we're going to extend the meta data,
      add a generic struct bpf_call_arg_meta that we can use for passing into
      check_func_arg().
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33ff9823
    • L
      /proc/iomem: only expose physical resource addresses to privileged users · 51d7b120
      Linus Torvalds 提交于
      In commit c4004b02 ("x86: remove the kernel code/data/bss resources
      from /proc/iomem") I was hoping to remove the phyiscal kernel address
      data from /proc/iomem entirely, but that had to be reverted because some
      system programs actually use it.
      
      This limits all the detailed resource information to properly
      credentialed users instead.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51d7b120
  13. 14 4月, 2016 1 次提交
  14. 11 4月, 2016 1 次提交
  15. 09 4月, 2016 1 次提交
    • D
      bpf, verifier: further improve search pruning · 07016151
      Daniel Borkmann 提交于
      The verifier needs to go through every path of the program in
      order to check that it terminates safely, which can be quite a
      lot of instructions that need to be processed f.e. in cases with
      more branchy programs. With search pruning from f1bca824 ("bpf:
      add search pruning optimization to verifier") the search space can
      already be reduced significantly when the verifier detects that
      a previously walked path with same register and stack contents
      terminated already (see verifier's states_equal()), so the search
      can skip walking those states.
      
      When working with larger programs of > ~2000 (out of max 4096)
      insns, we found that the current limit of 32k instructions is easily
      hit. For example, a case we ran into is that the search space cannot
      be pruned due to branches at the beginning of the program that make
      use of certain stack space slots (STACK_MISC), which are never used
      in the remaining program (STACK_INVALID). Therefore, the verifier
      needs to walk paths for the slots in STACK_INVALID state, but also
      all remaining paths with a stack structure, where the slots are in
      STACK_MISC, which can nearly double the search space needed. After
      various experiments, we find that a limit of 64k processed insns is
      a more reasonable choice when dealing with larger programs in practice.
      This still allows to reject extreme crafted cases that can have a
      much higher complexity (f.e. > ~300k) within the 4096 insns limit
      due to search pruning not being able to take effect.
      
      Furthermore, we found that a lot of states can be pruned after a
      call instruction, f.e. we were able to reduce the search state by
      ~35% in some cases with this heuristic, trade-off is to keep a bit
      more states in env->explored_states. Usually, call instructions
      have a number of preceding register assignments and/or stack stores,
      where search pruning has a better chance to suceed in states_equal()
      test. The current code marks the branch targets with STATE_LIST_MARK
      in case of conditional jumps, and the next (t + 1) instruction in
      case of unconditional jump so that f.e. a backjump will walk it. We
      also did experiments with using t + insns[t].off + 1 as a marker in
      the unconditionally jump case instead of t + 1 with the rationale
      that these two branches of execution that converge after the label
      might have more potential of pruning. We found that it was a bit
      better, but not necessarily significantly better than the current
      state, perhaps also due to clang not generating back jumps often.
      Hence, we left that as is for now.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07016151
  16. 08 4月, 2016 6 次提交
  17. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  18. 04 4月, 2016 1 次提交
  19. 31 3月, 2016 1 次提交