1. 10 6月, 2018 1 次提交
  2. 08 6月, 2018 3 次提交
    • T
      kernel/hung_task.c: show all hung tasks before panic · 401c636a
      Tetsuo Handa 提交于
      When we get a hung task it can often be valuable to see _all_ the hung
      tasks on the system before calling panic().
      
      Quoting from https://syzkaller.appspot.com/text?tag=CrashReport&id=5316056503549952
      ----------------------------------------
      INFO: task syz-executor0:6540 blocked for more than 120 seconds.
            Not tainted 4.16.0+ #13
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      syz-executor0   D23560  6540   4521 0x80000004
      Call Trace:
       context_switch kernel/sched/core.c:2848 [inline]
       __schedule+0x8fb/0x1ef0 kernel/sched/core.c:3490
       schedule+0xf5/0x430 kernel/sched/core.c:3549
       schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3607
       __mutex_lock_common kernel/locking/mutex.c:833 [inline]
       __mutex_lock+0xb7f/0x1810 kernel/locking/mutex.c:893
       mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
       lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355
       __blkdev_driver_ioctl block/ioctl.c:303 [inline]
       blkdev_ioctl+0x1759/0x1e00 block/ioctl.c:601
       ioctl_by_bdev+0xa5/0x110 fs/block_dev.c:2060
       isofs_get_last_session fs/isofs/inode.c:567 [inline]
       isofs_fill_super+0x2ba9/0x3bc0 fs/isofs/inode.c:660
       mount_bdev+0x2b7/0x370 fs/super.c:1119
       isofs_mount+0x34/0x40 fs/isofs/inode.c:1560
       mount_fs+0x66/0x2d0 fs/super.c:1222
       vfs_kern_mount.part.26+0xc6/0x4a0 fs/namespace.c:1037
       vfs_kern_mount fs/namespace.c:2514 [inline]
       do_new_mount fs/namespace.c:2517 [inline]
       do_mount+0xea4/0x2b90 fs/namespace.c:2847
       ksys_mount+0xab/0x120 fs/namespace.c:3063
       SYSC_mount fs/namespace.c:3077 [inline]
       SyS_mount+0x39/0x50 fs/namespace.c:3074
       do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7
      (...snipped...)
      Showing all locks held in the system:
      (...snipped...)
      2 locks held by syz-executor0/6540:
       #0: 00000000566d4c39 (&type->s_umount_key#49/1){+.+.}, at: alloc_super fs/super.c:211 [inline]
       #0: 00000000566d4c39 (&type->s_umount_key#49/1){+.+.}, at: sget_userns+0x3b2/0xe60 fs/super.c:502 /* down_write_nested(&s->s_umount, SINGLE_DEPTH_NESTING); */
       #1: 0000000043ca8836 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 /* mutex_lock_nested(&lo->lo_ctl_mutex, 1); */
      (...snipped...)
      3 locks held by syz-executor7/6541:
       #0: 0000000043ca8836 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 /* mutex_lock_nested(&lo->lo_ctl_mutex, 1); */
       #1: 000000007bf3d3f9 (&bdev->bd_mutex){+.+.}, at: blkdev_reread_part+0x1e/0x40 block/ioctl.c:192
       #2: 00000000566d4c39 (&type->s_umount_key#50){.+.+}, at: __get_super.part.10+0x1d3/0x280 fs/super.c:663 /* down_read(&sb->s_umount); */
      ----------------------------------------
      
      When reporting an AB-BA deadlock like shown above, it would be nice if
      trace of PID=6541 is printed as well as trace of PID=6540 before calling
      panic().
      
      Showing hung tasks up to /proc/sys/kernel/hung_task_warnings could delay
      calling panic() but normally there should not be so many hung tasks.
      
      Link: http://lkml.kernel.org/r/201804050705.BHE57833.HVFOFtSOMQJFOL@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      401c636a
    • M
      mm: split page_type out from _mapcount · 6e292b9b
      Matthew Wilcox 提交于
      We're already using a union of many fields here, so stop abusing the
      _mapcount and make page_type its own field.  That implies renaming some of
      the machinery that creates PageBuddy, PageBalloon and PageKmemcg; bring
      back the PG_buddy, PG_balloon and PG_kmemcg names.
      
      As suggested by Kirill, make page_type a bitmask.  Because it starts out
      life as -1 (thanks to sharing the storage with _mapcount), setting a page
      flag means clearing the appropriate bit.  This gives us space for probably
      twenty or so extra bits (depending how paranoid we want to be about
      _mapcount underflow).
      
      Link: http://lkml.kernel.org/r/20180518194519.3820-3-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e292b9b
    • Y
      mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct · 88aa7cc6
      Yang Shi 提交于
      mmap_sem is on the hot path of kernel, and it very contended, but it is
      abused too.  It is used to protect arg_start|end and evn_start|end when
      reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
      sense since those proc files just expect to read 4 values atomically and
      not related to VM, they could be set to arbitrary values by C/R.
      
      And, the mmap_sem contention may cause unexpected issue like below:
      
      INFO: task ps:14018 blocked for more than 120 seconds.
             Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
      message.
       ps              D    0 14018      1 0x00000004
       Call Trace:
         schedule+0x36/0x80
         rwsem_down_read_failed+0xf0/0x150
         call_rwsem_down_read_failed+0x18/0x30
         down_read+0x20/0x40
         proc_pid_cmdline_read+0xd9/0x4e0
         __vfs_read+0x37/0x150
         vfs_read+0x96/0x130
         SyS_read+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xc5
      
      Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
      for them to mitigate the abuse of mmap_sem.
      
      So, introduce a new spinlock in mm_struct to protect the concurrent
      access to arg_start|end, env_start|end and others, as well as replace
      write map_sem to read to protect the race condition between prctl and
      sys_brk which might break check_data_rlimit(), and makes prctl more
      friendly to other VM operations.
      
      This patch just eliminates the abuse of mmap_sem, but it can't resolve
      the above hung task warning completely since the later
      access_remote_vm() call needs acquire mmap_sem.  The mmap_sem
      scalability issue will be solved in the future.
      
      [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
        Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88aa7cc6
  3. 07 6月, 2018 1 次提交
    • K
      treewide: Use struct_size() for kmalloc()-family · acafe7e3
      Kees Cook 提交于
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct foo {
          int stuff;
          void *entry[];
      };
      
      instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
      instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
      uses. It was done via automatic conversion with manual review for the
      "CHECKME" non-standard cases noted below, using the following Coccinelle
      script:
      
      // pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
      //                      sizeof *pkey_cache->table, GFP_KERNEL);
      @@
      identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
      expression GFP;
      identifier VAR, ELEMENT;
      expression COUNT;
      @@
      
      - alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
      + alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
      
      // mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
      @@
      identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
      expression GFP;
      identifier VAR, ELEMENT;
      expression COUNT;
      @@
      
      - alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
      + alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
      
      // Same pattern, but can't trivially locate the trailing element name,
      // or variable name.
      @@
      identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
      expression GFP;
      expression SOMETHING, COUNT, ELEMENT;
      @@
      
      - alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
      + alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)
      Signed-off-by: NKees Cook <keescook@chromium.org>
      acafe7e3
  4. 06 6月, 2018 2 次提交
  5. 05 6月, 2018 3 次提交
  6. 04 6月, 2018 1 次提交
    • Y
      bpf: implement bpf_get_current_cgroup_id() helper · bf6fa2c8
      Yonghong Song 提交于
      bpf has been used extensively for tracing. For example, bcc
      contains an almost full set of bpf-based tools to trace kernel
      and user functions/events. Most tracing tools are currently
      either filtered based on pid or system-wide.
      
      Containers have been used quite extensively in industry and
      cgroup is often used together to provide resource isolation
      and protection. Several processes may run inside the same
      container. It is often desirable to get container-level tracing
      results as well, e.g. syscall count, function count, I/O
      activity, etc.
      
      This patch implements a new helper, bpf_get_current_cgroup_id(),
      which will return cgroup id based on the cgroup within which
      the current task is running.
      
      The later patch will provide an example to show that
      userspace can get the same cgroup id so it could
      configure a filter or policy in the bpf program based on
      task cgroup id.
      
      The helper is currently implemented for tracing. It can
      be added to other program types as well when needed.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      bf6fa2c8
  7. 03 6月, 2018 9 次提交
    • J
      bpf/xdp: devmap can avoid calling ndo_xdp_flush · c1ece6b2
      Jesper Dangaard Brouer 提交于
      The XDP_REDIRECT map devmap can avoid using ndo_xdp_flush, by instead
      instructing ndo_xdp_xmit to flush via XDP_XMIT_FLUSH flag in
      appropriate places.
      
      Notice after this patch it is possible to remove ndo_xdp_flush
      completely, as this is the last user of ndo_xdp_flush. This is left
      for later patches, to keep driver changes separate.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c1ece6b2
    • J
      xdp: add flags argument to ndo_xdp_xmit API · 42b33468
      Jesper Dangaard Brouer 提交于
      This patch only change the API and reject any use of flags. This is an
      intermediate step that allows us to implement the flush flag operation
      later, for each individual driver in a separate patch.
      
      The plan is to implement flush operation via XDP_XMIT_FLUSH flag
      and then remove XDP_XMIT_FLAGS_NONE when done.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      42b33468
    • D
      bpf: fix context access in tracing progs on 32 bit archs · bc23105c
      Daniel Borkmann 提交于
      Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT
      program type in test_verifier report the following errors on x86_32:
      
        172/p unpriv: spill/fill of different pointers ldx FAIL
        Unexpected error message!
        0: (bf) r6 = r10
        1: (07) r6 += -8
        2: (15) if r1 == 0x0 goto pc+3
        R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1
        3: (bf) r2 = r10
        4: (07) r2 += -76
        5: (7b) *(u64 *)(r6 +0) = r2
        6: (55) if r1 != 0x0 goto pc+1
        R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp
        7: (7b) *(u64 *)(r6 +0) = r1
        8: (79) r1 = *(u64 *)(r6 +0)
        9: (79) r1 = *(u64 *)(r1 +68)
        invalid bpf_context access off=68 size=8
      
        378/p check bpf_perf_event_data->sample_period byte load permitted FAIL
        Failed to load prog 'Permission denied'!
        0: (b7) r0 = 0
        1: (71) r0 = *(u8 *)(r1 +68)
        invalid bpf_context access off=68 size=1
      
        379/p check bpf_perf_event_data->sample_period half load permitted FAIL
        Failed to load prog 'Permission denied'!
        0: (b7) r0 = 0
        1: (69) r0 = *(u16 *)(r1 +68)
        invalid bpf_context access off=68 size=2
      
        380/p check bpf_perf_event_data->sample_period word load permitted FAIL
        Failed to load prog 'Permission denied'!
        0: (b7) r0 = 0
        1: (61) r0 = *(u32 *)(r1 +68)
        invalid bpf_context access off=68 size=4
      
        381/p check bpf_perf_event_data->sample_period dword load permitted FAIL
        Failed to load prog 'Permission denied'!
        0: (b7) r0 = 0
        1: (79) r0 = *(u64 *)(r1 +68)
        invalid bpf_context access off=68 size=8
      
      Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte
      boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok()
      will then bail out saying that off & (size_default - 1) which is 68 & 7
      doesn't cleanly align in the case of sample_period access from struct
      bpf_perf_event_data, hence verifier wrongly thinks we might be doing an
      unaligned access here though underlying arch can handle it just fine.
      Therefore adjust this down to machine size and check and rewrite the
      offset for narrow access on that basis. We also need to fix corresponding
      pe_prog_is_valid_access(), since we hit the check for off % size != 0
      (e.g. 68 % 8 -> 4) in the first and last test. With that in place, progs
      for tracing work on x86_32.
      Reported-by: NWang YanQing <udknight@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Tested-by: NWang YanQing <udknight@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      bc23105c
    • D
      bpf: avoid retpoline for lookup/update/delete calls on maps · 09772d92
      Daniel Borkmann 提交于
      While some of the BPF map lookup helpers provide a ->map_gen_lookup()
      callback for inlining the map lookup altogether it is not available
      for every map, so the remaining ones have to call bpf_map_lookup_elem()
      helper which does a dispatch to map->ops->map_lookup_elem(). In
      times of retpolines, this will control and trap speculative execution
      rather than letting it do its work for the indirect call and will
      therefore cause a slowdown. Likewise, bpf_map_update_elem() and
      bpf_map_delete_elem() do not have an inlined version and need to call
      into their map->ops->map_update_elem() resp. map->ops->map_delete_elem()
      handlers.
      
      Before:
      
        # bpftool prog dump xlated id 1
          0: (bf) r2 = r10
          1: (07) r2 += -8
          2: (7a) *(u64 *)(r2 +0) = 0
          3: (18) r1 = map[id:1]
          5: (85) call __htab_map_lookup_elem#232656
          6: (15) if r0 == 0x0 goto pc+4
          7: (71) r1 = *(u8 *)(r0 +35)
          8: (55) if r1 != 0x0 goto pc+1
          9: (72) *(u8 *)(r0 +35) = 1
         10: (07) r0 += 56
         11: (15) if r0 == 0x0 goto pc+4
         12: (bf) r2 = r0
         13: (18) r1 = map[id:1]
         15: (85) call bpf_map_delete_elem#215008  <-- indirect call via
         16: (95) exit                                 helper
      
      After:
      
        # bpftool prog dump xlated id 1
          0: (bf) r2 = r10
          1: (07) r2 += -8
          2: (7a) *(u64 *)(r2 +0) = 0
          3: (18) r1 = map[id:1]
          5: (85) call __htab_map_lookup_elem#233328
          6: (15) if r0 == 0x0 goto pc+4
          7: (71) r1 = *(u8 *)(r0 +35)
          8: (55) if r1 != 0x0 goto pc+1
          9: (72) *(u8 *)(r0 +35) = 1
         10: (07) r0 += 56
         11: (15) if r0 == 0x0 goto pc+4
         12: (bf) r2 = r0
         13: (18) r1 = map[id:1]
         15: (85) call htab_lru_map_delete_elem#238240  <-- direct call
         16: (95) exit
      
      In all three lookup/update/delete cases however we can use the actual
      address of the map callback directly if we find that there's only a
      single path with a map pointer leading to the helper call, meaning
      when the map pointer has not been poisoned from verifier side.
      Example code can be seen above for the delete case.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      09772d92
    • D
      bpf: show prog and map id in fdinfo · 4316b409
      Daniel Borkmann 提交于
      Its trivial and straight forward to expose it for scripts that can
      then use it along with bpftool in order to inspect an individual
      application's used maps and progs. Right now we dump some basic
      information in the fdinfo file but with the help of the map/prog
      id full introspection becomes possible now.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      4316b409
    • D
      bpf: fixup error message from gpl helpers on license mismatch · 3fe2867c
      Daniel Borkmann 提交于
      Stating 'proprietary program' in the error is just silly since it
      can also be a different open source license than that which is just
      not compatible.
      
      Reference: https://twitter.com/majek04/status/998531268039102465Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3fe2867c
    • D
      libnvdimm, e820: Register all pmem resources · d76401ad
      Dan Williams 提交于
      There is currently a mismatch between the resources that will trigger
      the e820_pmem driver to register/load and the resources that will
      actually be surfaced as pmem ranges. register_e820_pmem() uses
      walk_iomem_res_desc() which includes children and siblings. In contrast,
      e820_pmem_probe() only considers top level resources. For example the
      following resource tree results in the driver being loaded, but no
      resources being registered:
      
          398000000000-39bfffffffff : PCI Bus 0000:ae
            39be00000000-39bf07ffffff : PCI Bus 0000:af
              39be00000000-39beffffffff : 0000:af:00.0
                39be10000000-39beffffffff : Persistent Memory (legacy)
      
      Fix this up to allow definitions of "legacy" pmem ranges anywhere in
      system-physical address space. Not that it is a recommended or safe to
      define a pmem range in PCI space, but it is useful for debug /
      experimentation, and the restriction on being a top-level resource was
      arbitrary.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      d76401ad
    • M
      bpf: btf: Ensure t->type == 0 for BTF_KIND_FWD · 8175383f
      Martin KaFai Lau 提交于
      The t->type in BTF_KIND_FWD is not used.  It must be 0.
      This patch ensures that and also adds a test case in test_btf.c
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      8175383f
    • M
      bpf: btf: Check array t->size · b9308ae6
      Martin KaFai Lau 提交于
      This patch ensures array's t->size is 0.
      
      The array size is decided by its individual elem's size and the
      number of elements.  Hence, t->size is not used and
      it must be 0.
      
      A test case is added to test_btf.c
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b9308ae6
  8. 31 5月, 2018 5 次提交
    • D
      sched/headers: Fix typo · 595058b6
      Davidlohr Bueso 提交于
      I cannot spell 'throttling'.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180530224940.17839-1-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      595058b6
    • J
      sched/deadline: Fix missing clock update · ecda2b66
      Juri Lelli 提交于
      A missing clock update is causing the following warning:
      
       rq->clock_update_flags < RQCF_ACT_SKIP
       WARNING: CPU: 10 PID: 0 at kernel/sched/sched.h:963 inactive_task_timer+0x5d6/0x720
       Call Trace:
        <IRQ>
        __hrtimer_run_queues+0x10f/0x530
        hrtimer_interrupt+0xe5/0x240
        smp_apic_timer_interrupt+0x79/0x2b0
        apic_timer_interrupt+0xf/0x20
        </IRQ>
        do_idle+0x203/0x280
        cpu_startup_entry+0x6f/0x80
        start_secondary+0x1b0/0x200
        secondary_startup_64+0xa5/0xb0
       hardirqs last  enabled at (793919): [<ffffffffa27c5f6e>] cpuidle_enter_state+0x9e/0x360
       hardirqs last disabled at (793920): [<ffffffffa2a0096e>] interrupt_entry+0xce/0xe0
       softirqs last  enabled at (793922): [<ffffffffa20bef78>] irq_enter+0x68/0x70
       softirqs last disabled at (793921): [<ffffffffa20bef5d>] irq_enter+0x4d/0x70
      
      This happens because inactive_task_timer() calls sub_running_bw() (if
      TASK_DEAD and non_contending) that might trigger a schedutil update,
      which might access the clock. Clock is however currently updated only
      later in inactive_task_timer() function.
      
      Fix the problem by updating the clock right after task_rq_lock().
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180530160809.9074-1-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ecda2b66
    • P
      sched/core: Require cpu_active() in select_task_rq(), for user tasks · 7af443ee
      Paul Burton 提交于
      select_task_rq() is used in a few paths to select the CPU upon which a
      thread should be run - for example it is used by try_to_wake_up() & by
      fork or exec balancing. As-is it allows use of any online CPU that is
      present in the task's cpus_allowed mask.
      
      This presents a problem because there is a period whilst CPUs are
      brought online where a CPU is marked online, but is not yet fully
      initialized - ie. the period where CPUHP_AP_ONLINE_IDLE <= state <
      CPUHP_ONLINE. Usually we don't run any user tasks during this window,
      but there are corner cases where this can happen. An example observed
      is:
      
        - Some user task A, running on CPU X, forks to create task B.
      
        - sched_fork() calls __set_task_cpu() with cpu=X, setting task B's
          task_struct::cpu field to X.
      
        - CPU X is offlined.
      
        - Task A, currently somewhere between the __set_task_cpu() in
          copy_process() and the call to wake_up_new_task(), is migrated to
          CPU Y by migrate_tasks() when CPU X is offlined.
      
        - CPU X is onlined, but still in the CPUHP_AP_ONLINE_IDLE state. The
          scheduler is now active on CPU X, but there are no user tasks on
          the runqueue.
      
        - Task A runs on CPU Y & reaches wake_up_new_task(). This calls
          select_task_rq() with cpu=X, taken from task B's task_struct,
          and select_task_rq() allows CPU X to be returned.
      
        - Task A enqueues task B on CPU X's runqueue, via activate_task() &
          enqueue_task().
      
        - CPU X now has a user task on its runqueue before it has reached the
          CPUHP_ONLINE state.
      
      In most cases, the user tasks that schedule on the newly onlined CPU
      have no idea that anything went wrong, but one case observed to be
      problematic is if the task goes on to invoke the sched_setaffinity
      syscall. The newly onlined CPU reaches the CPUHP_AP_ONLINE_IDLE state
      before the CPU that brought it online calls stop_machine_unpark(). This
      means that for a portion of the window of time between
      CPUHP_AP_ONLINE_IDLE & CPUHP_ONLINE the newly onlined CPU's struct
      cpu_stopper has its enabled field set to false. If a user thread is
      executed on the CPU during this window and it invokes sched_setaffinity
      with a CPU mask that does not include the CPU it's running on, then when
      __set_cpus_allowed_ptr() calls stop_one_cpu() intending to invoke
      migration_cpu_stop() and perform the actual migration away from the CPU
      it will simply return -ENOENT rather than calling migration_cpu_stop().
      We then return from the sched_setaffinity syscall back to the user task
      that is now running on a CPU which it just asked not to run on, and
      which is not present in its cpus_allowed mask.
      
      This patch resolves the problem by having select_task_rq() enforce that
      user tasks run on CPUs that are active - the same requirement that
      select_fallback_rq() already enforces. This should ensure that newly
      onlined CPUs reach the CPUHP_AP_ACTIVE state before being able to
      schedule user tasks, and also implies that bringup_wait_for_ap() will
      have called stop_machine_unpark() which resolves the sched_setaffinity
      issue above.
      
      I haven't yet investigated them, but it may be of interest to review
      whether any of the actions performed by hotplug states between
      CPUHP_AP_ONLINE_IDLE & CPUHP_AP_ACTIVE could have similar unintended
      effects on user tasks that might schedule before they are reached, which
      might widen the scope of the problem from just affecting the behaviour
      of sched_setaffinity.
      Signed-off-by: NPaul Burton <paul.burton@mips.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180526154648.11635-2-paul.burton@mips.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7af443ee
    • P
      sched/core: Fix rules for running on online && !active CPUs · 175f0e25
      Peter Zijlstra 提交于
      As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules
      for running on an online && !active CPU are stricter than just being a
      kthread, you need to be a per-cpu kthread.
      
      If you're not strictly per-CPU, you have better CPUs to run on and
      don't need the partially booted one to get your work done.
      
      The exception is to allow smpboot threads to bootstrap the CPU itself
      and get kernel 'services' initialized before we allow userspace on it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 955dbdf4 ("sched: Allow migrating kthreads into online but inactive CPUs")
      Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      175f0e25
    • C
      bpf: devmap: remove redundant assignment of dev = dev · 71b2c87d
      Colin Ian King 提交于
      The assignment dev = dev is redundant and should be removed.
      
      Detected by CoverityScan, CID#1469486 ("Evaluation order violation")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      71b2c87d
  9. 30 5月, 2018 2 次提交
  10. 29 5月, 2018 11 次提交
  11. 28 5月, 2018 2 次提交
    • A
      bpf: Hooks for sys_sendmsg · 1cedee13
      Andrey Ignatov 提交于
      In addition to already existing BPF hooks for sys_bind and sys_connect,
      the patch provides new hooks for sys_sendmsg.
      
      It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR`
      that provides access to socket itlself (properties like family, type,
      protocol) and user-passed `struct sockaddr *` so that BPF program can
      override destination IP and port for system calls such as sendto(2) or
      sendmsg(2) and/or assign source IP to the socket.
      
      The hooks are implemented as two new attach types:
      `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and
      UDPv6 correspondingly.
      
      UDPv4 and UDPv6 separate attach types for same reason as sys_bind and
      sys_connect hooks, i.e. to prevent reading from / writing to e.g.
      user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound.
      
      The difference with already existing hooks is sys_sendmsg are
      implemented only for unconnected UDP.
      
      For TCP it doesn't make sense to change user-provided `struct sockaddr *`
      at sendto(2)/sendmsg(2) time since socket either was already connected
      and has source/destination set or wasn't connected and call to
      sendto(2)/sendmsg(2) would lead to ENOTCONN anyway.
      
      Connected UDP is already handled by sys_connect hooks that can override
      source/destination at connect time and use fast-path later, i.e. these
      hooks don't affect UDP fast-path.
      
      Rewriting source IP is implemented differently than that in sys_connect
      hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to
      just bind socket to desired local IP address since source IP can be set
      on per-packet basis by using ancillary data (cmsg(3)). So no matter if
      socket is bound or not, source IP has to be rewritten on every call to
      sys_sendmsg.
      
      To do so two new fields are added to UAPI `struct bpf_sock_addr`;
      * `msg_src_ip4` to set source IPv4 for UDPv4;
      * `msg_src_ip6` to set source IPv6 for UDPv6.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1cedee13
    • A
      bpf: avoid -Wmaybe-uninitialized warning · dc3b8ae9
      Arnd Bergmann 提交于
      The stack_map_get_build_id_offset() function is too long for gcc to track
      whether 'work' may or may not be initialized at the end of it, leading
      to a false-positive warning:
      
      kernel/bpf/stackmap.c: In function 'stack_map_get_build_id_offset':
      kernel/bpf/stackmap.c:334:13: error: 'work' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      This removes the 'in_nmi_ctx' flag and uses the state of that variable
      itself to see if it got initialized.
      
      Fixes: bae77c5e ("bpf: enable stackmap with build_id in nmi context")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      dc3b8ae9