You need to sign in or sign up before continuing.
  1. 10 10月, 2014 12 次提交
    • S
      kernel/sys.c: compat sysinfo syscall: fix undefined behavior · 0baae41e
      Scotty Bauer 提交于
      Fix undefined behavior and compiler warning by replacing right shift 32
      with upper_32_bits macro
      Signed-off-by: NScotty Bauer <sbauer@eng.utah.edu>
      Cc: Clemens Ladisch <clemens@ladisch.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0baae41e
    • V
      kernel/sys.c: whitespace fixes · ec94fc3d
      vishnu.ps 提交于
      Fix minor errors and warning messages in kernel/sys.c.  These errors were
      reported by checkpatch while working with some modifications in sys.c
      file.  Fixing this first will help me to improve my further patches.
      
      ERROR: trailing whitespace - 9
      ERROR: do not use assignment in if condition - 4
      ERROR: spaces required around that '?' (ctx:VxO) - 10
      ERROR: switch and case should be at the same indent - 3
      
      total 26 errors & 3 warnings fixed.
      Signed-off-by: Nvishnu.ps <vishnu.ps@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec94fc3d
    • Y
      acct: eliminate compile warning · 067b722f
      Ying Xue 提交于
      If ACCT_VERSION is not defined to 3, below warning appears:
        CC      kernel/acct.o
        kernel/acct.c: In function `do_acct_process':
        kernel/acct.c:475:24: warning: unused variable `ns' [-Wunused-variable]
      
      [akpm@linux-foundation.org: retain the local for code size improvements
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      067b722f
    • I
      kernel/async.c: switch to pr_foo() · 27fb10ed
      Ionut Alexa 提交于
      Signed-off-by: NIonut Alexa <ionut.m.alexa@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27fb10ed
    • S
      mm: use VM_BUG_ON_MM where possible · 96dad67f
      Sasha Levin 提交于
      Dump the contents of the relevant struct_mm when we hit the bug condition.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96dad67f
    • O
      mempolicy: remove the "task" arg of vma_policy_mof() and simplify it · 6b6482bb
      Oleg Nesterov 提交于
      1. vma_policy_mof(task) is simply not safe unless task == current,
         it can race with do_exit()->mpol_put(). Remove this arg and update
         its single caller.
      
      2. vma can not be NULL, remove this check and simplify the code.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b6482bb
    • J
      mm: remove noisy remainder of the scan_unevictable interface · 1f13ae39
      Johannes Weiner 提交于
      The deprecation warnings for the scan_unevictable interface triggers by
      scripts doing `sysctl -a | grep something else'.  This is annoying and not
      helpful.
      
      The interface has been defunct since 264e56d8 ("mm: disable user
      interface to manually rescue unevictable pages"), which was in 2011, and
      there haven't been any reports of usecases for it, only reports that the
      deprecation warnings are annying.  It's unlikely that anybody is using
      this interface specifically at this point, so remove it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f13ae39
    • C
      prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation · f606b77f
      Cyrill Gorcunov 提交于
      During development of c/r we've noticed that in case if we need to support
      user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
      ...) call, in particular once new user namespace is created
      capable(CAP_SYS_RESOURCE) no longer passes.
      
      A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
      in one bundle, which would allow the kernel to make more intensive test
      for sanity of values and same time allow us to support checkpoint/restore
      of user namespaces.
      
      Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
      prctl_mm_map structure which carries all the members to be updated.
      
      	prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)
      
      	struct prctl_mm_map {
      		__u64	start_code;
      		__u64	end_code;
      		__u64	start_data;
      		__u64	end_data;
      		__u64	start_brk;
      		__u64	brk;
      		__u64	start_stack;
      		__u64	arg_start;
      		__u64	arg_end;
      		__u64	env_start;
      		__u64	env_end;
      		__u64	*auxv;
      		__u32	auxv_size;
      		__u32	exe_fd;
      	};
      
      All members except @exe_fd correspond ones of struct mm_struct.  To figure
      out which available values these members may take here are meanings of the
      members.
      
       - start_code, end_code: represent bounds of executable code area
       - start_data, end_data: represent bounds of data area
       - start_brk, brk: used to calculate bounds for brk() syscall
       - start_stack: used when accounting space needed for command
         line arguments, environment and shmat() syscall
       - arg_start, arg_end, env_start, env_end: represent memory area
         supplied for command line arguments and environment variables
       - auxv, auxv_size: carries auxiliary vector, Elf format specifics
       - exe_fd: file descriptor number for executable link (/proc/self/exe)
      
      Thus we apply the following requirements to the values
      
      1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
         in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
         interval.
      
      2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
         VMAs (say a program maps own new .text and .data segments during execution)
         the rest of members should belong to VMA which must exist.
      
      3) Addresses must be ordered, ie @start_ member must not be greater or
         equal to appropriate @end_ member.
      
      4) As in regular Elf loading procedure we require that @start_brk and
         @brk be greater than @end_data.
      
      5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
         exceed existing limit. Same applies to RLIMIT_STACK.
      
      6) Auxiliary vector size must not exceed existing one (which is
         predefined as AT_VECTOR_SIZE and depends on architecture).
      
      7) File descriptor passed in @exe_file should be pointing
         to executable file (because we use existing prctl_set_mm_exe_file_locked
         helper it ensures that the file we are going to use as exe link has all
         required permission granted).
      
      Now about where these members are involved inside kernel code:
      
       - @start_code and @end_code are used in /proc/$pid/[stat|statm] output;
      
       - @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
         also they are considered if there enough space for brk() syscall
         result if RLIMIT_DATA is set;
      
       - @start_brk shown in /proc/$pid/stat output and accounted in brk()
         syscall if RLIMIT_DATA is set; also this member is tested to
         find a symbolic name of mmap event for perf system (we choose
         if event is generated for "heap" area); one more aplication is
         selinux -- we test if a process has PROCESS__EXECHEAP permission
         if trying to make heap area being executable with mprotect() syscall;
      
       - @brk is a current value for brk() syscall which lays inside heap
         area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
         provides new memory area to a user space upon brk() completion the
         mm::brk is updated to carry new value;
      
         Both @start_brk and @brk are actively used in /proc/$pid/maps
         and /proc/$pid/smaps output to find a symbolic name "heap" for
         VMA being scanned;
      
       - @start_stack is printed out in /proc/$pid/stat and used to
         find a symbolic name "stack" for task and threads in
         /proc/$pid/maps and /proc/$pid/smaps output, and as the same
         as with @start_brk -- perf system uses it for event naming.
         Also kernel treat this member as a start address of where
         to map vDSO pages and to check if there is enough space
         for shmat() syscall;
      
       - @arg_start, @arg_end, @env_start and @env_end are printed out
         in /proc/$pid/stat. Another access to the data these members
         represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
         Any attempt to read these areas kernel tests with access_process_vm
         helper so a user must have enough rights for this action;
      
       - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
         speaking kernel doesn't care much about which exactly data is
         sitting there because it is solely for userspace;
      
       - @exe_fd is referred from /proc/$pid/exe and when generating
         coredump. We uses prctl_set_mm_exe_file_locked helper to update
         this member, so exe-file link modification remains one-shot
         action.
      
      Still note that updating exe-file link now doesn't require sys-resource
      capability anymore, after all there is no much profit in preventing setup
      own file link (there are a number of ways to execute own code -- ptrace,
      ld-preload, so that the only reliable way to find which exactly code is
      executed is to inspect running program memory).  Still we require the
      caller to be at least user-namespace root user.
      
      I believe the old interface should be deprecated and ripped off in a
      couple of kernel releases if no one against.
      
      To test if new interface is implemented in the kernel one can pass
      PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
      supported struct prctl_mm_map.
      
      [akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NAndrew Vagin <avagin@openvz.org>
      Tested-by: NAndrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f606b77f
    • C
      prctl: PR_SET_MM -- factor out mmap_sem when updating mm::exe_file · 71fe97e1
      Cyrill Gorcunov 提交于
      Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out
      and rename the helper to prctl_set_mm_exe_file_locked().  This will allow
      to reuse this function in a next patch.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71fe97e1
    • C
      mm: use may_adjust_brk helper · 8764b338
      Cyrill Gorcunov 提交于
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8764b338
    • N
      kernel/kthread.c: partial revert of 81c98869 ("kthread: ensure locality of... · 10922838
      Nishanth Aravamudan 提交于
      kernel/kthread.c: partial revert of 81c98869 ("kthread: ensure locality of task_struct allocations")
      
      After discussions with Tejun, we don't want to spread the use of
      cpu_to_mem() (and thus knowledge of allocators/NUMA topology details) into
      callers, but would rather ensure the callees correctly handle memoryless
      nodes.  With the previous patches ("topology: add support for
      node_to_mem_node() to determine the fallback node" and "slub: fallback to
      node_to_mem_node() node if allocating on memoryless node") adding and
      using node_to_mem_node(), we can safely undo part of the change to the
      kthread logic from 81c98869.
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10922838
    • C
      softlockup: make detector be aware of task switch of processes hogging cpu · b1a8de1f
      chai wen 提交于
      For now, soft lockup detector warns once for each case of process
      softlockup.  But the thread 'watchdog/n' may not always get the cpu at the
      time slot between the task switch of two processes hogging that cpu to
      reset soft_watchdog_warn.
      
      An example would be two processes hogging the cpu.  Process A causes the
      softlockup warning and is killed manually by a user.  Process B
      immediately becomes the new process hogging the cpu preventing the
      softlockup code from resetting the soft_watchdog_warn variable.
      
      This case is a false negative of "warn only once for a process", as there
      may be a different process that is going to hog the cpu.  Resolve this by
      saving/checking the task pointer of the hogging process and use that to
      reset soft_watchdog_warn too.
      
      [dzickus@redhat.com: update comment]
      Signed-off-by: Nchai wen <chaiw.fnst@cn.fujitsu.com>
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1a8de1f
  2. 03 10月, 2014 3 次提交
    • P
      perf: fix perf bug in fork() · 6c72e350
      Peter Zijlstra 提交于
      Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
      calling perf_event_free_task() when failing sched_fork() we will not yet
      have done the memset() on ->perf_event_ctxp[] and will therefore try and
      'free' the inherited contexts, which are still in use by the parent
      process.  This is bad..
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NSylvain 'ythier' Hitier <sylvain.hitier@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c72e350
    • S
      ring-buffer: Fix infinite spin in reading buffer · 24607f11
      Steven Rostedt (Red Hat) 提交于
      Commit 651e22f2 "ring-buffer: Always reset iterator to reader page"
      fixed one bug but in the process caused another one. The reset is to
      update the header page, but that fix also changed the way the cached
      reads were updated. The cache reads are used to test if an iterator
      needs to be updated or not.
      
      A ring buffer iterator, when created, disables writes to the ring buffer
      but does not stop other readers or consuming reads from happening.
      Although all readers are synchronized via a lock, they are only
      synchronized when in the ring buffer functions. Those functions may
      be called by any number of readers. The iterator continues down when
      its not interrupted by a consuming reader. If a consuming read
      occurs, the iterator starts from the beginning of the buffer.
      
      The way the iterator sees that a consuming read has happened since
      its last read is by checking the reader "cache". The cache holds the
      last counts of the read and the reader page itself.
      
      Commit 651e22f2 changed what was saved by the cache_read when
      the rb_iter_reset() occurred, making the iterator never match the cache.
      Then if the iterator calls rb_iter_reset(), it will go into an
      infinite loop by checking if the cache doesn't match, doing the reset
      and retrying, just to see that the cache still doesn't match! Which
      should never happen as the reset is suppose to set the cache to the
      current value and there's locks that keep a consuming reader from
      having access to the data.
      
      Fixes: 651e22f2 "ring-buffer: Always reset iterator to reader page"
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      24607f11
    • K
      aarch64: filter $x from kallsyms · 6c34f1f5
      Kyle McMartin 提交于
      Similar to ARM, AArch64 is generating $x and $d syms... which isn't
      terribly helpful when looking at %pF output and the like. Filter those
      out in kallsyms, modpost and when looking at module symbols.
      
      Seems simplest since none of these check EM_ARM anyway, to just add it
      to the strchr used, rather than trying to make things overly
      complicated.
      
      initcall_debug improves:
      dmesg_before.txt: initcall $x+0x0/0x154 [sg] returned 0 after 26331 usecs
      dmesg_after.txt: initcall init_sg+0x0/0x154 [sg] returned 0 after 15461 usecs
      Signed-off-by: NKyle McMartin <kyle@redhat.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      6c34f1f5
  3. 02 10月, 2014 1 次提交
    • A
      bpf: add search pruning optimization to verifier · f1bca824
      Alexei Starovoitov 提交于
      consider C program represented in eBPF:
      int filter(int arg)
      {
          int a, b, c, *ptr;
      
          if (arg == 1)
              ptr = &a;
          else if (arg == 2)
              ptr = &b;
          else
              ptr = &c;
      
          *ptr = 0;
          return 0;
      }
      eBPF verifier has to follow all possible paths through the program
      to recognize that '*ptr = 0' instruction would be safe to execute
      in all situations.
      It's doing it by picking a path towards the end and observes changes
      to registers and stack at every insn until it reaches bpf_exit.
      Then it comes back to one of the previous branches and goes towards
      the end again with potentially different values in registers.
      When program has a lot of branches, the number of possible combinations
      of branches is huge, so verifer has a hard limit of walking no more
      than 32k instructions. This limit can be reached and complex (but valid)
      programs could be rejected. Therefore it's important to recognize equivalent
      verifier states to prune this depth first search.
      
      Basic idea can be illustrated by the program (where .. are some eBPF insns):
          1: ..
          2: if (rX == rY) goto 4
          3: ..
          4: ..
          5: ..
          6: bpf_exit
      In the first pass towards bpf_exit the verifier will walk insns: 1, 2, 3, 4, 5, 6
      Since insn#2 is a branch the verifier will remember its state in verifier stack
      to come back to it later.
      Since insn#4 is marked as 'branch target', the verifier will remember its state
      in explored_states[4] linked list.
      Once it reaches insn#6 successfully it will pop the state recorded at insn#2 and
      will continue.
      Without search pruning optimization verifier would have to walk 4, 5, 6 again,
      effectively simulating execution of insns 1, 2, 4, 5, 6
      With search pruning it will check whether state at #4 after jumping from #2
      is equivalent to one recorded in explored_states[4] during first pass.
      If there is an equivalent state, verifier can prune the search at #4 and declare
      this path to be safe as well.
      In other words two states at #4 are equivalent if execution of 1, 2, 3, 4 insns
      and 1, 2, 4 insns produces equivalent registers and stack.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1bca824
  4. 01 10月, 2014 4 次提交
    • J
      PM / hibernate: Iterate over set bits instead of PFNs in swsusp_free() · fdd64ed5
      Joerg Roedel 提交于
      The existing implementation of swsusp_free iterates over all
      pfns in the system and checks every bit in the two memory
      bitmaps.
      
      This doesn't scale very well with large numbers of pfns,
      especially when the bitmaps are not populated very densly.
      Change the algorithm to iterate over the set bits in the
      bitmaps instead to make it scale better in large memory
      configurations.
      
      Also add a memory_bm_clear_current() helper function that
      clears the bit for the last position returned from the
      memory bitmap.
      
      This new version adds a !NULL check for the memory bitmaps
      before they are walked. Not doing so causes a kernel crash
      when the bitmaps are NULL.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      fdd64ed5
    • R
      ACPI / sleep: Rework the handling of ACPI GPE wakeup from suspend-to-idle · a8d46b9e
      Rafael J. Wysocki 提交于
      The ACPI GPE wakeup from suspend-to-idle is currently based on using
      the IRQF_NO_SUSPEND flag for the ACPI SCI, but that is problematic
      for a couple of reasons.  First, in principle the ACPI SCI may be
      shared and IRQF_NO_SUSPEND does not really work well with shared
      interrupts.  Second, it may require the ACPI subsystem to special-case
      the handling of device notifications depending on whether or not
      they are received during suspend-to-idle in some places which would
      lead to fragile code.  Finally, it's better the handle ACPI wakeup
      interrupts consistently with wakeup interrupts from other sources.
      
      For this reason, remove the IRQF_NO_SUSPEND flag from the ACPI SCI
      and use enable_irq_wake()/disable_irq_wake() with it instead, which
      requires two additional platform hooks to be added to struct
      platform_freeze_ops.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      a8d46b9e
    • R
      PM / sleep: Rename platform suspend/resume functions in suspend.c · ebc3e41e
      Rafael J. Wysocki 提交于
      Rename several local functions related to platform handling during
      system suspend resume in suspend.c so that their names better
      reflect their roles.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ebc3e41e
    • R
      PM / sleep: Export dpm_suspend_late/noirq() and dpm_resume_early/noirq() · 2a8a8ce6
      Rafael J. Wysocki 提交于
      Subsequent change sets will add platform-related operations between
      dpm_suspend_late() and dpm_suspend_noirq() as well as between
      dpm_resume_noirq() and dpm_resume_early() in suspend_enter(), so
      export these functions for suspend_enter() to be able to call them
      separately and split the invocations of dpm_suspend_end() and
      dpm_resume_start() in there accordingly.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      2a8a8ce6
  5. 28 9月, 2014 1 次提交
  6. 27 9月, 2014 11 次提交
    • A
      bpf: mini eBPF library, test stubs and verifier testsuite · 3c731eba
      Alexei Starovoitov 提交于
      1.
      the library includes a trivial set of BPF syscall wrappers:
      int bpf_create_map(int key_size, int value_size, int max_entries);
      int bpf_update_elem(int fd, void *key, void *value);
      int bpf_lookup_elem(int fd, void *key, void *value);
      int bpf_delete_elem(int fd, void *key);
      int bpf_get_next_key(int fd, void *key, void *next_key);
      int bpf_prog_load(enum bpf_prog_type prog_type,
      		  const struct sock_filter_int *insns, int insn_len,
      		  const char *license);
      bpf_prog_load() stores verifier log into global bpf_log_buf[] array
      
      and BPF_*() macros to build instructions
      
      2.
      test stubs configure eBPF infra with 'unspec' map and program types.
      These are fake types used by user space testsuite only.
      
      3.
      verifier tests valid and invalid programs and expects predefined
      error log messages from kernel.
      40 tests so far.
      
      $ sudo ./test_verifier
       #0 add+sub+mul OK
       #1 unreachable OK
       #2 unreachable2 OK
       #3 out of range jump OK
       #4 out of range jump2 OK
       #5 test1 ld_imm64 OK
       ...
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c731eba
    • A
      bpf: verifier (add verifier core) · 17a52670
      Alexei Starovoitov 提交于
      This patch adds verifier core which simulates execution of every insn and
      records the state of registers and program stack. Every branch instruction seen
      during simulation is pushed into state stack. When verifier reaches BPF_EXIT,
      it pops the state from the stack and continues until it reaches BPF_EXIT again.
      For program:
      1: bpf_mov r1, xxx
      2: if (r1 == 0) goto 5
      3: bpf_mov r0, 1
      4: goto 6
      5: bpf_mov r0, 2
      6: bpf_exit
      The verifier will walk insns: 1, 2, 3, 4, 6
      then it will pop the state recorded at insn#2 and will continue: 5, 6
      
      This way it walks all possible paths through the program and checks all
      possible values of registers. While doing so, it checks for:
      - invalid instructions
      - uninitialized register access
      - uninitialized stack access
      - misaligned stack access
      - out of range stack access
      - invalid calling convention
      - instruction encoding is not using reserved fields
      
      Kernel subsystem configures the verifier with two callbacks:
      
      - bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
        that provides information to the verifer which fields of 'ctx'
        are accessible (remember 'ctx' is the first argument to eBPF program)
      
      - const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
        returns argument constraints of kernel helper functions that eBPF program
        may call, so that verifier can checks that R1-R5 types match the prototype
      
      More details in Documentation/networking/filter.txt and in kernel/bpf/verifier.c
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17a52670
    • A
      bpf: verifier (add branch/goto checks) · 475fb78f
      Alexei Starovoitov 提交于
      check that control flow graph of eBPF program is a directed acyclic graph
      
      check_cfg() does:
      - detect loops
      - detect unreachable instructions
      - check that program terminates with BPF_EXIT insn
      - check that all branches are within program boundary
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      475fb78f
    • A
      bpf: handle pseudo BPF_LD_IMM64 insn · 0246e64d
      Alexei Starovoitov 提交于
      eBPF programs passed from userspace are using pseudo BPF_LD_IMM64 instructions
      to refer to process-local map_fd. Scan the program for such instructions and
      if FDs are valid, convert them to 'struct bpf_map' pointers which will be used
      by verifier to check access to maps in bpf_map_lookup/update() calls.
      If program passes verifier, convert pseudo BPF_LD_IMM64 into generic by dropping
      BPF_PSEUDO_MAP_FD flag.
      
      Note that eBPF interpreter is generic and knows nothing about pseudo insns.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0246e64d
    • A
      bpf: verifier (add ability to receive verification log) · cbd35700
      Alexei Starovoitov 提交于
      add optional attributes for BPF_PROG_LOAD syscall:
      union bpf_attr {
          struct {
      	...
      	__u32         log_level; /* verbosity level of eBPF verifier */
      	__u32         log_size;  /* size of user buffer */
      	__aligned_u64 log_buf;   /* user supplied 'char *buffer' */
          };
      };
      
      when log_level > 0 the verifier will return its verification log in the user
      supplied buffer 'log_buf' which can be used by program author to analyze why
      verifier rejected given program.
      
      'Understanding eBPF verifier messages' section of Documentation/networking/filter.txt
      provides several examples of these messages, like the program:
      
        BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
        BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
        BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
        BPF_LD_MAP_FD(BPF_REG_1, 0),
        BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
        BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
        BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
        BPF_EXIT_INSN(),
      
      will be rejected with the following multi-line message in log_buf:
      
        0: (7a) *(u64 *)(r10 -8) = 0
        1: (bf) r2 = r10
        2: (07) r2 += -8
        3: (b7) r1 = 0
        4: (85) call 1
        5: (15) if r0 == 0x0 goto pc+1
         R0=map_ptr R10=fp
        6: (7a) *(u64 *)(r0 +4) = 0
        misaligned access off 4 size 8
      
      The format of the output can change at any time as verifier evolves.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbd35700
    • A
      bpf: verifier (add docs) · 51580e79
      Alexei Starovoitov 提交于
      this patch adds all of eBPF verfier documentation and empty bpf_check()
      
      The end goal for the verifier is to statically check safety of the program.
      
      Verifier will catch:
      - loops
      - out of range jumps
      - unreachable instructions
      - invalid instructions
      - uninitialized register access
      - uninitialized stack access
      - misaligned stack access
      - out of range stack access
      - invalid calling convention
      
      More details in Documentation/networking/filter.txt
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51580e79
    • A
      bpf: handle pseudo BPF_CALL insn · 0a542a86
      Alexei Starovoitov 提交于
      in native eBPF programs userspace is using pseudo BPF_CALL instructions
      which encode one of 'enum bpf_func_id' inside insn->imm field.
      Verifier checks that program using correct function arguments to given func_id.
      If all checks passed, kernel needs to fixup BPF_CALL->imm fields by
      replacing func_id with in-kernel function pointer.
      eBPF interpreter just calls the function.
      
      In-kernel eBPF users continue to use generic BPF_CALL.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a542a86
    • A
      bpf: expand BPF syscall with program load/unload · 09756af4
      Alexei Starovoitov 提交于
      eBPF programs are similar to kernel modules. They are loaded by the user
      process and automatically unloaded when process exits. Each eBPF program is
      a safe run-to-completion set of instructions. eBPF verifier statically
      determines that the program terminates and is safe to execute.
      
      The following syscall wrapper can be used to load the program:
      int bpf_prog_load(enum bpf_prog_type prog_type,
                        const struct bpf_insn *insns, int insn_cnt,
                        const char *license)
      {
          union bpf_attr attr = {
              .prog_type = prog_type,
              .insns = ptr_to_u64(insns),
              .insn_cnt = insn_cnt,
              .license = ptr_to_u64(license),
          };
      
          return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
      }
      where 'insns' is an array of eBPF instructions and 'license' is a string
      that must be GPL compatible to call helper functions marked gpl_only
      
      Upon succesful load the syscall returns prog_fd.
      Use close(prog_fd) to unload the program.
      
      User space tests and examples follow in the later patches
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09756af4
    • A
      bpf: add lookup/update/delete/iterate methods to BPF maps · db20fd2b
      Alexei Starovoitov 提交于
      'maps' is a generic storage of different types for sharing data between kernel
      and userspace.
      
      The maps are accessed from user space via BPF syscall, which has commands:
      
      - create a map with given type and attributes
        fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
        returns fd or negative error
      
      - lookup key in a given map referenced by fd
        err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->value
        returns zero and stores found elem into value or negative error
      
      - create or update key/value pair in a given map
        err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->value
        returns zero or negative error
      
      - find and delete element by key in a given map
        err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key
      
      - iterate map elements (based on input key return next_key)
        err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->next_key
      
      - close(fd) deletes the map
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db20fd2b
    • A
      bpf: enable bpf syscall on x64 and i386 · 749730ce
      Alexei Starovoitov 提交于
      done as separate commit to ease conflict resolution
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      749730ce
    • A
      bpf: introduce BPF syscall and maps · 99c55f7d
      Alexei Starovoitov 提交于
      BPF syscall is a multiplexor for a range of different operations on eBPF.
      This patch introduces syscall with single command to create a map.
      Next patch adds commands to access maps.
      
      'maps' is a generic storage of different types for sharing data between kernel
      and userspace.
      
      Userspace example:
      /* this syscall wrapper creates a map with given type and attributes
       * and returns map_fd on success.
       * use close(map_fd) to delete the map
       */
      int bpf_create_map(enum bpf_map_type map_type, int key_size,
                         int value_size, int max_entries)
      {
          union bpf_attr attr = {
              .map_type = map_type,
              .key_size = key_size,
              .value_size = value_size,
              .max_entries = max_entries
          };
      
          return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
      }
      
      'union bpf_attr' is backwards compatible with future extensions.
      
      More details in Documentation/networking/filter.txt and in manpage
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99c55f7d
  7. 26 9月, 2014 2 次提交
  8. 25 9月, 2014 5 次提交
    • V
      irq: Export handle_fasteoi_irq · 4fdea267
      Vincent Stehlé 提交于
      Export handle_fasteoi_irq to be able to use it in e.g. the Zynq gpio driver
      since commit 6dd85950 ("gpio: zynq: Fix IRQ handlers").
      
      This fixes the following link issue:
      
        ERROR: "handle_fasteoi_irq" [drivers/gpio/gpio-zynq.ko] undefined!
      Signed-off-by: NVincent Stehlé <vincent.stehle@laposte.net>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Vincent Stehle <vincent.stehle@laposte.net>
      Cc: Lars-Peter Clausen <lars@metafoo.de>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Link: http://lkml.kernel.org/r/1408663880-29179-1-git-send-email-vincent.stehle@laposte.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4fdea267
    • N
      SCHED: add some "wait..on_bit...timeout()" interfaces. · cbbce822
      NeilBrown 提交于
      In commit c1221321
         sched: Allow wait_on_bit_action() functions to support a timeout
      
      I suggested that a "wait_on_bit_timeout()" interface would not meet my
      need.  This isn't true - I was just over-engineering.
      
      Including a 'private' field in wait_bit_key instead of a focused
      "timeout" field was just premature generalization.  If some other
      use is ever found, it can be generalized or added later.
      
      So this patch renames "private" to "timeout" with a meaning "stop
      waiting when "jiffies" reaches or passes "timeout",
      and adds two of the many possible wait..bit..timeout() interfaces:
      
      wait_on_page_bit_killable_timeout(), which is the one I want to use,
      and out_of_line_wait_on_bit_timeout() which is a reasonably general
      example.  Others can be added as needed.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      cbbce822
    • Z
      cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags · 2ad654bc
      Zefan Li 提交于
      When we change cpuset.memory_spread_{page,slab}, cpuset will flip
      PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
      This should be done using atomic bitops, but currently we don't,
      which is broken.
      
      Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
      when one thread tried to clear PF_USED_MATH while at the same time another
      thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
      the same task.
      
      Here's the full report:
      https://lkml.org/lkml/2014/9/19/230
      
      To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.
      
      v4:
      - updated mm/slab.c. (Fengguang Wu)
      - updated Documentation.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: Kees Cook <keescook@chromium.org>
      Fixes: 950592f7 ("cpusets: update tasks' page/slab spread flags in time")
      Cc: <stable@vger.kernel.org> # 2.6.31+
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2ad654bc
    • R
      Revert "PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()" · 5c4dd348
      Rafael J. Wysocki 提交于
      Revert commit 6efde38f (PM / Hibernate: Iterate over set bits
      instead of PFNs in swsusp_free()) that introduced a NULL pointer
      dereference during system resume from hibernation:
      
      BUG: unable to handle kernel NULL pointer dereference at (null)
      IP: [<ffffffff810a8cc1>] swsusp_free+0x21/0x190
      PGD b39c2067 PUD b39c1067 PMD 0
      Oops: 0000 [#1] SMP
      Modules linked in: <irrelevant list of modules>
      CPU: 1 PID: 4898 Comm: s2disk Tainted: G         C     3.17-rc5-amd64 #1 Debian 3.17~rc5-1~exp1
      Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011
      task: ffff88023155ea40 ti: ffff8800b3b14000 task.ti: ffff8800b3b14000
      RIP: 0010:[<ffffffff810a8cc1>]  [<ffffffff810a8cc1>]
      swsusp_free+0x21/0x190
      RSP: 0018:ffff8800b3b17ea8  EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff8800b39bab00 RCX: 0000000000000001
      RDX: ffff8800b39bab10 RSI: ffff8800b39bab00 RDI: 0000000000000000
      RBP: 0000000000000010 R08: 0000000000000000 R09: 0000000000000000
      R10: ffff8800b39bab10 R11: 0000000000000246 R12: ffffea0000000000
      R13: ffff880232f485a0 R14: ffff88023ac27cd8 R15: ffff880232927590
      FS:  00007f406d83b700(0000) GS:ffff88023bc80000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000000 CR3: 00000000b3a62000 CR4: 00000000000007e0
      Stack:
       ffff8800b39bab00 0000000000000010 ffff880232927590 ffffffff810acb4a
       ffff8800b39bab00 ffffffff811a955a ffff8800b39bab10 0000000000000000
       ffff88023155f098 ffffffff81a6b8c0 ffff88023155ea40 0000000000000007
      Call Trace:
       [<ffffffff810acb4a>] ? snapshot_release+0x2a/0xb0
       [<ffffffff811a955a>] ? __fput+0xca/0x1d0
       [<ffffffff81080627>] ? task_work_run+0x97/0xd0
       [<ffffffff81012d89>] ? do_notify_resume+0x69/0xa0
       [<ffffffff8151452a>] ? int_signal+0x12/0x17
      Code: 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 54 48 8b 05 ba 62 9c 00 49 bc 00 00 00 00 00 ea ff ff 48 8b 3d a1 62 9c 00 55 53 <48> 8b 10 48 89 50 18 48 8b 52 20 48 c7 40 28 00 00 00 00 c7 40
      RIP  [<ffffffff810a8cc1>] swsusp_free+0x21/0x190
       RSP <ffff8800b3b17ea8>
      CR2: 0000000000000000
      ---[ end trace f02be86a1ec0cccb ]---
      
      due to forbidden_pages_map being NULL in swsusp_free().
      
      Fixes: 6efde38f "PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()"
      Reported-by: NBjørn Mork <bjorn@mork.no>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5c4dd348
    • T
      percpu_ref: add PERCPU_REF_INIT_* flags · 2aad2a86
      Tejun Heo 提交于
      With the recent addition of percpu_ref_reinit(), percpu_ref now can be
      used as a persistent switch which can be turned on and off repeatedly
      where turning off maps to killing the ref and waiting for it to drain;
      however, there currently isn't a way to initialize a percpu_ref in its
      off (killed and drained) state, which can be inconvenient for certain
      persistent switch use cases.
      
      Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
      selection of operation mode; however, currently a newly initialized
      percpu_ref is always in percpu mode making it impossible to avoid the
      latency overhead of switching to atomic mode.
      
      This patch adds @flags to percpu_ref_init() and implements the
      following flags.
      
      * PERCPU_REF_INIT_ATOMIC	: start ref in atomic mode
      * PERCPU_REF_INIT_DEAD		: start ref killed and drained
      
      These flags should be able to serve the above two use cases.
      
      v2: target_core_tpg.c conversion was missing.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      2aad2a86
  9. 22 9月, 2014 1 次提交