1. 09 8月, 2014 22 次提交
    • V
      resource: provide new functions to walk through resources · 8c86e70a
      Vivek Goyal 提交于
      I have added two more functions to walk through resources.
      
      Currently walk_system_ram_range() deals with pfn and /proc/iomem can
      contain partial pages.  By dealing in pfn, callback function loses the
      info that last page of a memory range is a partial page and not the full
      page.  So I implemented walk_system_ram_res() which returns u64 values to
      callback functions and now it properly return start and end address.
      
      walk_system_ram_range() uses find_next_system_ram() to find the next ram
      resource.  This in turn only travels through siblings of top level child
      and does not travers through all the nodes of the resoruce tree.  I also
      need another function where I can walk through all the resources, for
      example figure out where "GART" aperture is.  Figure out where ACPI memory
      is.
      
      So I wrote another function walk_iomem_res() which walks through all
      /proc/iomem resources and returns matches as asked by caller.  Caller can
      specify "name" of resource, start and end and flags.
      
      Got rid of find_next_system_ram_res() and instead implemented more generic
      find_next_iomem_res() which can be used to traverse top level children
      only based on an argument.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c86e70a
    • V
      kexec: use common function for kimage_normal_alloc() and kimage_crash_alloc() · 255aedd9
      Vivek Goyal 提交于
      kimage_normal_alloc() and kimage_crash_alloc() are doing lot of similar
      things and differ only little.  So instead of having two separate
      functions create a common function kimage_alloc_init() and pass it the
      "flags" argument which tells whether it is normal kexec or kexec_on_panic.
       And this function should be able to deal with both the cases.
      
      This consolidation also helps later where we can use a common function
      kimage_file_alloc_init() to handle normal and crash cases for new file
      based kexec syscall.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      255aedd9
    • V
      kexec: move segment verification code in a separate function · dabe7862
      Vivek Goyal 提交于
      Previously do_kimage_alloc() will allocate a kimage structure, copy
      segment list from user space and then do the segment list sanity
      verification.
      
      Break down this function in 3 parts.  do_kimage_alloc_init() to do actual
      allocation and basic initialization of kimage structure.
      copy_user_segment_list() to copy segment list from user space and
      sanity_check_segment_list() to verify the sanity of segment list as passed
      by user space.
      
      In later patches, I need to only allocate kimage and not copy segment list
      from user space.  So breaking down in smaller functions enables re-use of
      code at other places.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dabe7862
    • V
      kexec: rename unusebale_pages to unusable_pages · 7d3e2bca
      Vivek Goyal 提交于
      Let's use the more common "unusable".
      
      This patch was originally written and posted by Boris. I am including it
      in this patch series.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d3e2bca
    • V
      bin2c: move bin2c in scripts/basic · 8370edea
      Vivek Goyal 提交于
      This patch series does not do kernel signature verification yet.  I plan
      to post another patch series for that.  Now distributions are already
      signing PE/COFF bzImage with PKCS7 signature I plan to parse and verify
      those signatures.
      
      Primary goal of this patchset is to prepare groundwork so that kernel
      image can be signed and signatures be verified during kexec load.  This
      should help with two things.
      
      - It should allow kexec/kdump on secureboot enabled machines.
      
      - In general it can help even without secureboot. By being able to verify
        kernel image signature in kexec, it should help with avoiding module
        signing restrictions. Matthew Garret showed how to boot into a custom
        kernel, modify first kernel's memory and then jump back to old kernel and
        bypass any policy one wants to.
      
      This patch (of 15):
      
      Kexec wants to use bin2c and it wants to use it really early in the build
      process. See arch/x86/purgatory/ code in later patches.
      
      So move bin2c in scripts/basic so that it can be built very early and
      be usable by arch/x86/purgatory/
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8370edea
    • D
      shm: add memfd_create() syscall · 9183df25
      David Herrmann 提交于
      memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
      that you can pass to mmap().  It can support sealing and avoids any
      connection to user-visible mount-points.  Thus, it's not subject to quotas
      on mounted file-systems, but can be used like malloc()'ed memory, but with
      a file-descriptor to it.
      
      memfd_create() returns the raw shmem file, so calls like ftruncate() can
      be used to modify the underlying inode.  Also calls like fstat() will
      return proper information and mark the file as regular file.  If you want
      sealing, you can specify MFD_ALLOW_SEALING.  Otherwise, sealing is not
      supported (like on all other regular files).
      
      Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
      subject to a filesystem size limit.  It is still properly accounted to
      memcg limits, though, and to the same overcommit or no-overcommit
      accounting as all user memory.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9183df25
    • D
      mm: allow drivers to prevent new writable mappings · 4bb5f5d9
      David Herrmann 提交于
      This patch (of 6):
      
      The i_mmap_writable field counts existing writable mappings of an
      address_space.  To allow drivers to prevent new writable mappings, make
      this counter signed and prevent new writable mappings if it is negative.
      This is modelled after i_writecount and DENYWRITE.
      
      This will be required by the shmem-sealing infrastructure to prevent any
      new writable mappings after the WRITE seal has been set.  In case there
      exists a writable mapping, this operation will fail with EBUSY.
      
      Note that we rely on the fact that iff you already own a writable mapping,
      you can increase the counter without using the helpers.  This is the same
      that we do for i_writecount.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bb5f5d9
    • I
    • J
      shm: make exit_shm work proportional to task activity · ab602f79
      Jack Miller 提交于
      This is small set of patches our team has had kicking around for a few
      versions internally that fixes tasks getting hung on shm_exit when there
      are many threads hammering it at once.
      
      Anton wrote a simple test to cause the issue:
      
        http://ozlabs.org/~anton/junkcode/bust_shm_exit.c
      
      Before applying this patchset, this test code will cause either hanging
      tracebacks or pthread out of memory errors.
      
      After this patchset, it will still produce output like:
      
        root@somehost:~# ./bust_shm_exit 1024 160
        ...
        INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113)
        INFO: Stall ended before state dump start
        ...
      
      But the task will continue to run along happily, so we consider this an
      improvement over hanging, even if it's a bit noisy.
      
      This patch (of 3):
      
      exit_shm obtains the ipc_ns shm rwsem for write and holds it while it
      walks every shared memory segment in the namespace.  Thus the amount of
      work is related to the number of shm segments in the namespace not the
      number of segments that might need to be cleaned.
      
      In addition, this occurs after the task has been notified the thread has
      exited, so the number of tasks waiting for the ns shm rwsem can grow
      without bound until memory is exausted.
      
      Add a list to the task struct of all shmids allocated by this task.  Init
      the list head in copy_process.  Use the ns->rwsem for locking.  Add
      segments after id is added, remove before removing from id.
      
      On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar
      to handling of semaphore undo.
      
      I chose a define for the init sequence since its a simple list init,
      otherwise it would require a function call to avoid include loops between
      the semaphore code and the task struct.  Converting the list_del to
      list_del_init for the unshare cases would remove the exit followed by
      init, but I left it blow up if not inited.
      Signed-off-by: NMilton Miller <miltonm@bga.com>
      Signed-off-by: NJack Miller <millerjo@us.ibm.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab602f79
    • J
      panic: add TAINT_SOFTLOCKUP · 69361eef
      Josh Hunt 提交于
      This taint flag will be set if the system has ever entered a softlockup
      state.  Similar to TAINT_WARN it is useful to know whether or not the
      system has been in a softlockup state when debugging.
      
      [akpm@linux-foundation.org: apply the taint before calling panic()]
      Signed-off-by: NJosh Hunt <johunt@akamai.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69361eef
    • F
      kernel/gcov/fs.c: remove unnecessary null test before debugfs_remove · 834b18b2
      Fabian Frederick 提交于
      This fixes checkpatch warning:
      
        WARNING: debugfs_remove(NULL) is safe this check is probably not required
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      834b18b2
    • V
      kernel/fork.c: make mm_init_owner static · 33144e84
      Vladimir Davydov 提交于
      It's only used in fork.c:mm_init().
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33144e84
    • V
      fork: copy mm's vm usage counters under mmap_sem · 4f7d4614
      Vladimir Davydov 提交于
      If a forking process has a thread calling (un)mmap (silly but still),
      the child process may have some of its mm's vm usage counters (total_vm
      and friends) screwed up, because currently they are copied from oldmm
      w/o holding any locks (memcpy in dup_mm).
      
      This patch moves the counters initialization to dup_mmap() to be called
      under oldmm->mmap_sem, which eliminates any possibility of race.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f7d4614
    • V
      fork: reset mm->pinned_vm · ce65cefa
      Vladimir Davydov 提交于
      mm->pinned_vm counts pages of mm's address space that were permanently
      pinned in memory by increasing their reference counter. The counter was
      introduced by commit bc3e53f6 ("mm: distinguish between mlocked and
      pinned pages"), while before it locked_vm had been used for such pages.
      
      Obviously, we should reset the counter on fork if !CLONE_VM, just like
      we do with locked_vm, but currently we don't. Let's fix it.
      
      This patch will fix the contents of /proc/pid/status:VmPin.
      
      ib_umem_get[infiniband] and perf_mmap still check pinned_vm against
      RLIMIT_MEMLOCK.  It's left from the times when pinned pages were accounted
      under locked_vm, but today it looks wrong.  It isn't clear how we should
      deal with it.
      
      We still have some drivers accounting pinned pages under mm->locked_vm -
      this is what commit bc3e53f6 was fighting against.  It's
      infiniband/usnic and vfio.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce65cefa
    • V
      fork/exec: cleanup mm initialization · 41f727fd
      Vladimir Davydov 提交于
      mm initialization on fork/exec is spread all over the place, which makes
      the code look inconsistent.
      
      We have mm_init(), which is supposed to init/nullify mm's internals, but
      it doesn't init all the fields it should:
      
       - on fork ->mmap,mm_rb,vmacache_seqnum,map_count,mm_cpumask,locked_vm
         are zeroed in dup_mmap();
      
       - on fork ->pmd_huge_pte is zeroed in dup_mm(), immediately before
         calling mm_init();
      
       - ->cpu_vm_mask_var ptr is initialized by mm_init_cpumask(), which is
         called before mm_init() on both fork and exec;
      
       - ->context is initialized by init_new_context(), which is called after
         mm_init() on both fork and exec;
      
      Let's consolidate all the initializations in mm_init() to make the code
      look cleaner.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41f727fd
    • F
      proc: constify seq_operations · ccf94f1b
      Fabian Frederick 提交于
      proc_uid_seq_operations, proc_gid_seq_operations and
      proc_projid_seq_operations are only called in proc_id_map_open with
      seq_open as const struct seq_operations so we can constify the 3
      structures and update proc_id_map_open prototype.
      
         text    data     bss     dec     hex filename
         6817     404    1984    9205    23f5 kernel/user_namespace.o-before
         6913     308    1984    9205    23f5 kernel/user_namespace.o-after
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ccf94f1b
    • I
      kernel/exit.c: fix coding style warnings and errors · a0be55de
      Ionut Alexa 提交于
      Fixed coding style warnings and errors.
      Signed-off-by: NIonut Alexa <ionut.m.alexa@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0be55de
    • F
      kernel/test_kprobes.c: use current logging functions · 4878b14b
      Fabian Frederick 提交于
      - Add pr_fmt
      - Coalesce formats
      - Use current pr_foo() functions instead of printk
      - Remove unnecessary "failed" display (already in log level).
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4878b14b
    • N
      kernel/kallsyms.c: fix %pB when there's no symbol at the address · b86280aa
      Namhyung Kim 提交于
      __sprint_symbol() should restore original address when kallsyms_lookup()
      failed to find a symbol.  It's reported when dumpstack shows an address in
      a dynamically allocated trampoline for ftrace.
      
        [ 1314.612287]  [<ffffffff81700312>] dump_stack+0x45/0x56
        [ 1314.612290]  [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
        [ 1314.612293]  [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
        [ 1314.612306]  [<ffffffffa00160c4>] 0xffffffffa00160c3
      
      You can see a difference in the hex address - c4 and c3.  Fix it.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Reported-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b86280aa
    • V
      page-cgroup: get rid of NR_PCG_FLAGS · 9a3f4d85
      Vladimir Davydov 提交于
      It's not used anywhere today, so let's remove it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a3f4d85
    • J
      mm: memcontrol: use page lists for uncharge batching · 747db954
      Johannes Weiner 提交于
      Pages are now uncharged at release time, and all sources of batched
      uncharges operate on lists of pages.  Directly use those lists, and
      get rid of the per-task batching state.
      
      This also batches statistics accounting, in addition to the res
      counter charges, to reduce IRQ-disabling and re-enabling.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      747db954
    • J
      mm: memcontrol: rewrite charge API · 00501b53
      Johannes Weiner 提交于
      These patches rework memcg charge lifetime to integrate more naturally
      with the lifetime of user pages.  This drastically simplifies the code and
      reduces charging and uncharging overhead.  The most expensive part of
      charging and uncharging is the page_cgroup bit spinlock, which is removed
      entirely after this series.
      
      Here are the top-10 profile entries of a stress test that reads a 128G
      sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
       executing in the root memcg).  Before:
      
          15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.31%              cat  [kernel.kallsyms]   [k] memset
          11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
           4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.38%              cat  [kernel.kallsyms]   [k] put_page
           2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
           2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
           1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
      
      After:
      
          15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.48%           cat  [kernel.kallsyms]   [k] memset
          11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
           3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.46%           cat  [kernel.kallsyms]   [k] put_page
           2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
           1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
           1.30%           cat  [kernel.kallsyms]   [k] kfree
      
      As you can see, the memcg footprint has shrunk quite a bit.
      
         text    data     bss     dec     hex filename
        37970    9892     400   48262    bc86 mm/memcontrol.o.old
        35239    9892     400   45531    b1db mm/memcontrol.o
      
      This patch (of 4):
      
      The memcg charge API charges pages before they are rmapped - i.e.  have an
      actual "type" - and so every callsite needs its own set of charge and
      uncharge functions to know what type is being operated on.  Worse,
      uncharge has to happen from a context that is still type-specific, rather
      than at the end of the page's lifetime with exclusive access, and so
      requires a lot of synchronization.
      
      Rewrite the charge API to provide a generic set of try_charge(),
      commit_charge() and cancel_charge() transaction operations, much like
      what's currently done for swap-in:
      
        mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
        pages from the memcg if necessary.
      
        mem_cgroup_commit_charge() commits the page to the charge once it
        has a valid page->mapping and PageAnon() reliably tells the type.
      
        mem_cgroup_cancel_charge() aborts the transaction.
      
      This reduces the charge API and enables subsequent patches to
      drastically simplify uncharging.
      
      As pages need to be committed after rmap is established but before they
      are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
      additions again.  Revive lru_cache_add_active_or_unevictable().
      
      [hughd@google.com: fix shmem_unuse]
      [hughd@google.com: Add comments on the private use of -EAGAIN]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00501b53
  2. 07 8月, 2014 16 次提交
    • N
      kernel/printk/printk.c: fix bool assignements · d25d9fec
      Neil Zhang 提交于
      Fix coccinelle warnings.
      Signed-off-by: NNeil Zhang <zhangwm@marvell.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d25d9fec
    • J
      printk: enable interrupts before calling console_trylock_for_printk() · 5874af20
      Jan Kara 提交于
      We need interrupts disabled when calling console_trylock_for_printk()
      only so that cpu id we pass to can_use_console() remains valid (for
      other things console_sem provides all the exclusion we need and
      deadlocks on console_sem due to interrupts are impossible because we use
      down_trylock()).  However if we are rescheduled, we are guaranteed to
      run on an online cpu so we can easily just get the cpu id in
      can_use_console().
      
      We can lose a bit of performance when we enable interrupts in
      vprintk_emit() and then disable them again in console_unlock() but OTOH
      it can somewhat reduce interrupt latency caused by console_unlock().
      
      We differ from (reverted) commit 939f04be in that we avoid calling
      console_unlock() from vprintk_emit() with lockdep enabled as that has
      unveiled quite some bugs leading to system freezes during boot (e.g.
        https://lkml.org/lkml/2014/5/30/242,
        https://lkml.org/lkml/2014/6/28/521).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Tested-by: NAndreas Bombe <aeb@debian.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5874af20
    • A
      printk: miscellaneous cleanups · 249771b8
      Alex Elder 提交于
      Some small cleanups to kernel/printk/printk.c.  None of them should
      cause any change in behavior.
      
        - When CONFIG_PRINTK is defined, parenthesize the value of LOG_LINE_MAX.
        - When CONFIG_PRINTK is *not* defined, there is an extra LOG_LINE_MAX
          definition; delete it.
        - Pull an assignment out of a conditional expression in console_setup().
        - Use isdigit() in console_setup() rather than open coding it.
        - In update_console_cmdline(), drop a NUL-termination assignment;
          the strlcpy() call that precedes it guarantees it's not needed.
        - Simplify some logic in printk_timed_ratelimit().
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Reviewed-by: NPetr Mladek <pmladek@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      249771b8
    • A
      printk: use a clever macro · e99aa461
      Alex Elder 提交于
      Use the IS_ENABLED() macro rather than #ifdef blocks to set certain
      global values.
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NPetr Mladek <pmladek@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e99aa461
    • A
      printk: fix some comments · 0b90fec3
      Alex Elder 提交于
      Fix a few comments that don't accurately describe their corresponding
      code.  It also fixes some minor typographical errors.
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Reviewed-by: NPetr Mladek <pmladek@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b90fec3
    • A
      printk: rename DEFAULT_MESSAGE_LOGLEVEL · 42a9dc0b
      Alex Elder 提交于
      Commit a8fe19eb ("kernel/printk: use symbolic defines for console
      loglevels") makes consistent use of symbolic values for printk() log
      levels.
      
      The naming scheme used is different from the one used for
      DEFAULT_MESSAGE_LOGLEVEL though.  Change that symbol name to be
      MESSAGE_LOGLEVEL_DEFAULT for consistency.  And because the value of that
      symbol comes from a similarly-named config option, rename
      CONFIG_DEFAULT_MESSAGE_LOGLEVEL as well.
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42a9dc0b
    • A
      printk: tweak do_syslog() to match comments · e97e1267
      Alex Elder 提交于
      In do_syslog() there's a path used by kmsg_poll() and kmsg_read() that
      only needs to know whether there's any data available to read (and not
      its size).  These callers only check for non-zero return.  As a
      shortcut, do_syslog() returns the difference between what has been
      logged and what has been "seen."
      
      The comments say that the "count of records" should be returned but it's
      not.  Instead it returns (log_next_idx - syslog_idx), which is a
      difference between buffer offsets--and the result could be negative.
      
      The behavior is the same (it'll be zero or not in the same cases), but
      the count of records is more meaningful and it matches what the comments
      say.  So change the code to return that.
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e97e1267
    • L
      printk: allow increasing the ring buffer depending on the number of CPUs · 23b2899f
      Luis R. Rodriguez 提交于
      The default size of the ring buffer is too small for machines with a
      large amount of CPUs under heavy load.  What ends up happening when
      debugging is the ring buffer overlaps and chews up old messages making
      debugging impossible unless the size is passed as a kernel parameter.
      An idle system upon boot up will on average spew out only about one or
      two extra lines but where this really matters is on heavy load and that
      will vary widely depending on the system and environment.
      
      There are mechanisms to help increase the kernel ring buffer for tracing
      through debugfs, and those interfaces even allow growing the kernel ring
      buffer per CPU.  We also have a static value which can be passed upon
      boot.  Relying on debugfs however is not ideal for production, and
      relying on the value passed upon bootup is can only used *after* an
      issue has creeped up.  Instead of being reactive this adds a proactive
      measure which lets you scale the amount of contributions you'd expect to
      the kernel ring buffer under load by each CPU in the worst case
      scenario.
      
      We use num_possible_cpus() to avoid complexities which could be
      introduced by dynamically changing the ring buffer size at run time,
      num_possible_cpus() lets us use the upper limit on possible number of
      CPUs therefore avoiding having to deal with hotplugging CPUs on and off.
      This introduces the kernel configuration option LOG_CPU_MAX_BUF_SHIFT
      which is used to specify the maximum amount of contributions to the
      kernel ring buffer in the worst case before the kernel ring buffer flips
      over, the size is specified as a power of 2.  The total amount of
      contributions made by each CPU must be greater than half of the default
      kernel ring buffer size (1 << LOG_BUF_SHIFT bytes) in order to trigger
      an increase upon bootup.  The kernel ring buffer is increased to the
      next power of two that would fit the required minimum kernel ring buffer
      size plus the additional CPU contribution.  For example if LOG_BUF_SHIFT
      is 18 (256 KB) you'd require at least 128 KB contributions by other CPUs
      in order to trigger an increase of the kernel ring buffer.  With a
      LOG_CPU_BUF_SHIFT of 12 (4 KB) you'd require at least anything over > 64
      possible CPUs to trigger an increase.  If you had 128 possible CPUs the
      amount of minimum required kernel ring buffer bumps to:
      
         ((1 << 18) + ((128 - 1) * (1 << 12))) / 1024 = 764 KB
      
      Since we require the ring buffer to be a power of two the new required
      size would be 1024 KB.
      
      This CPU contributions are ignored when the "log_buf_len" kernel
      parameter is used as it forces the exact size of the ring buffer to an
      expected power of two value.
      
      [pmladek@suse.cz: fix build]
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.cz>
      Tested-by: NDavidlohr Bueso <davidlohr@hp.com>
      Tested-by: NPetr Mladek <pmladek@suse.cz>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23b2899f
    • L
      printk: make dynamic units clear for the kernel ring buffer · f5405172
      Luis R. Rodriguez 提交于
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Suggested-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5405172
    • L
      printk: move power of 2 practice of ring buffer size to a helper · c0a318a3
      Luis R. Rodriguez 提交于
      In practice the power of 2 practice of the size of the kernel ring
      buffer remains purely historical but not a requirement, specially now
      that we have LOG_ALIGN and use it for both static and dynamic
      allocations.  It could have helped with implicit alignment back in the
      days given the even the dynamically sized ring buffer was guaranteed to
      be aligned so long as CONFIG_LOG_BUF_SHIFT was set to produce a
      __LOG_BUF_LEN which is architecture aligned, since log_buf_len=n would
      be allowed only if it was > __LOG_BUF_LEN and we always ended up
      rounding the log_buf_len=n to the next power of 2 with
      roundup_pow_of_two(), any multiple of 2 then should be also architecture
      aligned.  These assumptions of course relied heavily on
      CONFIG_LOG_BUF_SHIFT producing an aligned value but users can always
      change this.
      
      We now have precise alignment requirements set for the log buffer size
      for both static and dynamic allocations, but lets upkeep the old
      practice of using powers of 2 for its size to help with easy expected
      scalable values and the allocators for dynamic allocations.  We'll reuse
      this later so move this into a helper.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0a318a3
    • L
      printk: make dynamic kernel ring buffer alignment explicit · 70300177
      Luis R. Rodriguez 提交于
      We have to consider alignment for the ring buffer both for the default
      static size, and then also for when an dynamic allocation is made when
      the log_buf_len=n kernel parameter is passed to set the size
      specifically to a size larger than the default size set by the
      architecture through CONFIG_LOG_BUF_SHIFT.
      
      The default static kernel ring buffer can be aligned properly if
      architectures set CONFIG_LOG_BUF_SHIFT properly, we provide ranges for
      the size though so even if CONFIG_LOG_BUF_SHIFT has a sensible aligned
      value it can be reduced to a non aligned value.  Commit 6ebb017d
      ("printk: Fix alignment of buf causing crash on ARM EABI") by Andrew
      Lunn ensures the static buffer is always aligned and the decision of
      alignment is done by the compiler by using __alignof__(struct log).
      
      When log_buf_len=n is used we allocate the ring buffer dynamically.
      Dynamic allocation varies, for the early allocation called before
      setup_arch() memblock_virt_alloc() requests a page aligment and for the
      default kernel allocation memblock_virt_alloc_nopanic() requests no
      special alignment, which in turn ends up aligning the allocation to
      SMP_CACHE_BYTES, which is L1 cache aligned.
      
      Since we already have the required alignment for the kernel ring buffer
      though we can do better and request explicit alignment for LOG_ALIGN.
      This does that to be safe and make dynamic allocation alignment
      explicit.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Tested-by: NPetr Mladek <pmladek@suse.cz>
      Acked-by: NPetr Mladek <pmladek@suse.cz>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70300177
    • S
      kernel/smp.c:on_each_cpu_cond(): fix warning in fallback path · 618fde87
      Sasha Levin 提交于
      The rarely-executed memry-allocation-failed callback path generates a
      WARN_ON_ONCE() when smp_call_function_single() succeeds.  Presumably
      it's supposed to warn on failures.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@gentwo.org>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      618fde87
    • D
      mm, oom: remove unnecessary exit_state check · fb794bcb
      David Rientjes 提交于
      The oom killer scans each process and determines whether it is eligible
      for oom kill or whether the oom killer should abort because of
      concurrent memory freeing.  It will abort when an eligible process is
      found to have TIF_MEMDIE set, meaning it has already been oom killed and
      we're waiting for it to exit.
      
      Processes with task->mm == NULL should not be considered because they
      are either kthreads or have already detached their memory and killing
      them would not lead to memory freeing.  That memory is only freed after
      exit_mm() has returned, however, and not when task->mm is first set to
      NULL.
      
      Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
      is no longer considered for oom kill, but only until exit_mm() has
      returned.  This was fragile in the past because it relied on
      exit_notify() to be reached before no longer considering TIF_MEMDIE
      processes.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb794bcb
    • D
      mm, hugetlb: remove hugetlb_zero and hugetlb_infinity · ed4d4902
      David Rientjes 提交于
      They are unnecessary: "zero" can be used in place of "hugetlb_zero" and
      passing extra2 == NULL is equivalent to infinity.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed4d4902
    • F
      kernel/watchdog.c: convert printk/pr_warning to pr_foo() · 656c3b79
      Fabian Frederick 提交于
      Replace some obsolete functions.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      656c3b79
    • F
      kernel/auditfilter.c: replace count*size kmalloc by kcalloc · bab5e2d6
      Fabian Frederick 提交于
      kcalloc manages count*sizeof overflow.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bab5e2d6
  3. 03 8月, 2014 2 次提交
    • A
      net: filter: split 'struct sk_filter' into socket and bpf parts · 7ae457c1
      Alexei Starovoitov 提交于
      clean up names related to socket filtering and bpf in the following way:
      - everything that deals with sockets keeps 'sk_*' prefix
      - everything that is pure BPF is changed to 'bpf_*' prefix
      
      split 'struct sk_filter' into
      struct sk_filter {
      	atomic_t        refcnt;
      	struct rcu_head rcu;
      	struct bpf_prog *prog;
      };
      and
      struct bpf_prog {
              u32                     jited:1,
                                      len:31;
              struct sock_fprog_kern  *orig_prog;
              unsigned int            (*bpf_func)(const struct sk_buff *skb,
                                                  const struct bpf_insn *filter);
              union {
                      struct sock_filter      insns[0];
                      struct bpf_insn         insnsi[0];
                      struct work_struct      work;
              };
      };
      so that 'struct bpf_prog' can be used independent of sockets and cleans up
      'unattached' bpf use cases
      
      split SK_RUN_FILTER macro into:
          SK_RUN_FILTER to be used with 'struct sk_filter *' and
          BPF_PROG_RUN to be used with 'struct bpf_prog *'
      
      __sk_filter_release(struct sk_filter *) gains
      __bpf_prog_release(struct bpf_prog *) helper function
      
      also perform related renames for the functions that work
      with 'struct bpf_prog *', since they're on the same lines:
      
      sk_filter_size -> bpf_prog_size
      sk_filter_select_runtime -> bpf_prog_select_runtime
      sk_filter_free -> bpf_prog_free
      sk_unattached_filter_create -> bpf_prog_create
      sk_unattached_filter_destroy -> bpf_prog_destroy
      sk_store_orig_filter -> bpf_prog_store_orig_filter
      sk_release_orig_filter -> bpf_release_orig_filter
      __sk_migrate_filter -> bpf_migrate_filter
      __sk_prepare_filter -> bpf_prepare_filter
      
      API for attaching classic BPF to a socket stays the same:
      sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
      and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
      which is used by sockets, tun, af_packet
      
      API for 'unattached' BPF programs becomes:
      bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
      and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
      which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ae457c1
    • A
      net: filter: rename sk_convert_filter() -> bpf_convert_filter() · 8fb575ca
      Alexei Starovoitov 提交于
      to indicate that this function is converting classic BPF into eBPF
      and not related to sockets
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fb575ca