1. 26 1月, 2019 40 次提交
    • J
      quota: Lock s_umount in exclusive mode for Q_XQUOTA{ON,OFF} quotactls. · 876b79b9
      Javier Barrio 提交于
      [ Upstream commit 41c4f85cdac280d356df1f483000ecec4a8868be ]
      
      Commit 1fa5efe3 (ext4: Use generic helpers for quotaon
      and quotaoff) made possible to call quotactl(Q_XQUOTAON/OFF) on ext4 filesystems
      with sysfile quota support. This leads to calling dquot_enable/disable without s_umount
      held in excl. mode, because quotactl_cmd_onoff checks only for Q_QUOTAON/OFF.
      
      The following WARN_ON_ONCE triggers (in this case for dquot_enable, ext4, latest Linus' tree):
      
      [  117.807056] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: quota,prjquota
      
      [...]
      
      [  155.036847] WARNING: CPU: 0 PID: 2343 at fs/quota/dquot.c:2469 dquot_enable+0x34/0xb9
      [  155.036851] Modules linked in: quota_v2 quota_tree ipv6 af_packet joydev mousedev psmouse serio_raw pcspkr i2c_piix4 intel_agp intel_gtt e1000 ttm drm_kms_helper drm agpgart fb_sys_fops syscopyarea sysfillrect sysimgblt i2c_core input_leds kvm_intel kvm irqbypass qemu_fw_cfg floppy evdev parport_pc parport button crc32c_generic dm_mod ata_generic pata_acpi ata_piix libata loop ext4 crc16 mbcache jbd2 usb_storage usbcore sd_mod scsi_mod
      [  155.036901] CPU: 0 PID: 2343 Comm: qctl Not tainted 4.20.0-rc6-00025-gf5d582777bcb #9
      [  155.036903] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [  155.036911] RIP: 0010:dquot_enable+0x34/0xb9
      [  155.036915] Code: 41 56 41 55 41 54 55 53 4c 8b 6f 28 74 02 0f 0b 4d 8d 7d 70 49 89 fc 89 cb 41 89 d6 89 f5 4c 89 ff e8 23 09 ea ff 85 c0 74 0a <0f> 0b 4c 89 ff e8 8b 09 ea ff 85 db 74 6a 41 8b b5 f8 00 00 00 0f
      [  155.036918] RSP: 0018:ffffb09b00493e08 EFLAGS: 00010202
      [  155.036922] RAX: 0000000000000001 RBX: 0000000000000008 RCX: 0000000000000008
      [  155.036924] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9781b67cd870
      [  155.036926] RBP: 0000000000000002 R08: 0000000000000000 R09: 61c8864680b583eb
      [  155.036929] R10: ffffb09b00493e48 R11: ffffffffff7ce7d4 R12: ffff9781b7ee8d78
      [  155.036932] R13: ffff9781b67cd800 R14: 0000000000000004 R15: ffff9781b67cd870
      [  155.036936] FS:  00007fd813250b88(0000) GS:ffff9781ba000000(0000) knlGS:0000000000000000
      [  155.036939] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  155.036942] CR2: 00007fd812ff61d6 CR3: 000000007c882000 CR4: 00000000000006b0
      [  155.036951] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  155.036953] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  155.036955] Call Trace:
      [  155.037004]  dquot_quota_enable+0x8b/0xd0
      [  155.037011]  kernel_quotactl+0x628/0x74e
      [  155.037027]  ? do_mprotect_pkey+0x2a6/0x2cd
      [  155.037034]  __x64_sys_quotactl+0x1a/0x1d
      [  155.037041]  do_syscall_64+0x55/0xe4
      [  155.037078]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [  155.037105] RIP: 0033:0x7fd812fe1198
      [  155.037109] Code: 02 77 0d 48 89 c1 48 c1 e9 3f 75 04 48 8b 04 24 48 83 c4 50 5b c3 48 83 ec 08 49 89 ca 48 63 d2 48 63 ff b8 b3 00 00 00 0f 05 <48> 89 c7 e8 c1 eb ff ff 5a c3 48 63 ff b8 bb 00 00 00 0f 05 48 89
      [  155.037112] RSP: 002b:00007ffe8cd7b050 EFLAGS: 00000206 ORIG_RAX: 00000000000000b3
      [  155.037116] RAX: ffffffffffffffda RBX: 00007ffe8cd7b148 RCX: 00007fd812fe1198
      [  155.037119] RDX: 0000000000000000 RSI: 00007ffe8cd7cea9 RDI: 0000000000580102
      [  155.037121] RBP: 00007ffe8cd7b0f0 R08: 000055fc8eba8a9d R09: 0000000000000000
      [  155.037124] R10: 00007ffe8cd7b074 R11: 0000000000000206 R12: 00007ffe8cd7b168
      [  155.037126] R13: 000055fc8eba8897 R14: 0000000000000000 R15: 0000000000000000
      [  155.037131] ---[ end trace 210f864257175c51 ]---
      
      and then the syscall proceeds without s_umount locking.
      
      This patch locks the superblock ->s_umount sem. in exclusive mode for all Q_XQUOTAON/OFF
      quotactls too in addition to Q_QUOTAON/OFF.
      
      AFAICT, other than ext4, only xfs and ocfs2 are affected by this change.
      The VFS will now call in xfs_quota_* functions with s_umount held, which wasn't the case
      before. This looks good to me but I can not say for sure. Ext4 and ocfs2 where already
      beeing called with s_umount exclusive via quota_quotaon/off which is basically the same.
      Signed-off-by: NJavier Barrio <javier.barrio.mart@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      876b79b9
    • A
      perf tools: Add missing open_memstream() prototype for systems lacking it · 77f14a49
      Arnaldo Carvalho de Melo 提交于
      [ Upstream commit d7a8c4a6a055097a67ccfa3ca7c9ff1b64603a70 ]
      
      There are systems such as the Android NDK API level 24 has the
      open_memstream() function but doesn't provide a prototype, adding noise
      to the build:
      
        builtin-timechart.c: In function 'cat_backtrace':
        builtin-timechart.c:486:2: warning: implicit declaration of function 'open_memstream' [-Wimplicit-function-declaration]
          FILE *f = open_memstream(&p, &p_len);
          ^
        builtin-timechart.c:486:2: warning: nested extern declaration of 'open_memstream' [-Wnested-externs]
        builtin-timechart.c:486:12: warning: initialization makes pointer from integer without a cast
          FILE *f = open_memstream(&p, &p_len);
                    ^
      
      Define a LACKS_OPEN_MEMSTREAM_PROTOTYPE define so that code needing that
      can get a prototype.
      
      Checked in the bionic git repo to be available since level 23:
      
      https://android.googlesource.com/platform/bionic/+/master/libc/include/stdio.h#241
      
        FILE* open_memstream(char** __ptr, size_t* __size_ptr) __INTRODUCED_IN(23);
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-343ashae97e5bq6vizusyfno@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      77f14a49
    • A
      perf tools: Add missing sigqueue() prototype for systems lacking it · e2a1f8d6
      Arnaldo Carvalho de Melo 提交于
      [ Upstream commit 748fe0889c1ff12d378946bd5326e8ee8eacf5cf ]
      
      There are systems such as the Android NDK API level 24 has the
      sigqueue() function but doesn't provide a prototype, adding noise to the
      build:
      
        util/evlist.c: In function 'perf_evlist__prepare_workload':
        util/evlist.c:1494:4: warning: implicit declaration of function 'sigqueue' [-Wimplicit-function-declaration]
            if (sigqueue(getppid(), SIGUSR1, val))
            ^
        util/evlist.c:1494:4: warning: nested extern declaration of 'sigqueue' [-Wnested-externs]
      
      Define a LACKS_SIGQUEUE_PROTOTYPE define so that code needing that can
      get a prototype.
      
      Checked in the bionic git repo to be available since level 23:
      
      https://android.googlesource.com/platform/bionic/+/master/libc/include/signal.h#123
      
        int sigqueue(pid_t __pid, int __signal, const union sigval __value) __INTRODUCED_IN(23);
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-lmhpev1uni9kdrv7j29glyov@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e2a1f8d6
    • L
      perf cs-etm: Correct packets swapping in cs_etm__flush() · 4bc4b575
      Leo Yan 提交于
      [ Upstream commit 43fd56669c28cd354e9228bdb58e4bca1c1a8b66 ]
      
      The structure cs_etm_queue uses 'prev_packet' to point to previous
      packet, this can be used to combine with new coming packet to generate
      samples.
      
      In function cs_etm__flush() it swaps packets only when the flag
      'etm->synth_opts.last_branch' is true, this means that it will not swap
      packets if without option '--itrace=il' to generate last branch entries;
      thus for this case the 'prev_packet' doesn't point to the correct
      previous packet and the stale packet still will be used to generate
      sequential sample.  Thus if dump trace with 'perf script' command we can
      see the incorrect flow with the stale packet's address info.
      
      This patch corrects packets swapping in cs_etm__flush(); except using
      the flag 'etm->synth_opts.last_branch' it also checks the another flag
      'etm->sample_branches', if any flag is true then it swaps packets so can
      save correct content to 'prev_packet'.  Finally this can fix the wrong
      program flow dumping issue.
      
      The patch has a minor refactoring to use 'etm->synth_opts.last_branch'
      instead of 'etmq->etm->synth_opts.last_branch' for condition checking,
      this is consistent with that is done in cs_etm__sample().
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Reviewed-by: NMathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mike Leach <mike.leach@linaro.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Robert Walker <robert.walker@arm.com>
      Cc: coresight@lists.linaro.org
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/1544513908-16805-2-git-send-email-leo.yan@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4bc4b575
    • N
      dm snapshot: Fix excessive memory usage and workqueue stalls · 9e5be33b
      Nikos Tsironis 提交于
      [ Upstream commit 721b1d98fb517ae99ab3b757021cf81db41e67be ]
      
      kcopyd has no upper limit to the number of jobs one can allocate and
      issue. Under certain workloads this can lead to excessive memory usage
      and workqueue stalls. For example, when creating multiple dm-snapshot
      targets with a 4K chunk size and then writing to the origin through the
      page cache. Syncing the page cache causes a large number of BIOs to be
      issued to the dm-snapshot origin target, which itself issues an even
      larger (because of the BIO splitting taking place) number of kcopyd
      jobs.
      
      Running the following test, from the device mapper test suite [1],
      
        dmtest run --suite snapshot -n many_snapshots_of_same_volume_N
      
      , with 8 active snapshots, results in the kcopyd job slab cache growing
      to 10G. Depending on the available system RAM this can lead to the OOM
      killer killing user processes:
      
      [463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
                    nodemask=(null), order=1, oom_score_adj=0
      [463.492894] kthreadd cpuset=/ mems_allowed=0
      [463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
      [463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      [463.492952] Call Trace:
      [463.492964]  dump_stack+0x7d/0xbb
      [463.492973]  dump_header+0x6b/0x2fc
      [463.492987]  ? lockdep_hardirqs_on+0xee/0x190
      [463.493012]  oom_kill_process+0x302/0x370
      [463.493021]  out_of_memory+0x113/0x560
      [463.493030]  __alloc_pages_slowpath+0xf40/0x1020
      [463.493055]  __alloc_pages_nodemask+0x348/0x3c0
      [463.493067]  cache_grow_begin+0x81/0x8b0
      [463.493072]  ? cache_grow_begin+0x874/0x8b0
      [463.493078]  fallback_alloc+0x1e4/0x280
      [463.493092]  kmem_cache_alloc_node+0xd6/0x370
      [463.493098]  ? copy_process.part.31+0x1c5/0x20d0
      [463.493105]  copy_process.part.31+0x1c5/0x20d0
      [463.493115]  ? __lock_acquire+0x3cc/0x1550
      [463.493121]  ? __switch_to_asm+0x34/0x70
      [463.493129]  ? kthread_create_worker_on_cpu+0x70/0x70
      [463.493135]  ? finish_task_switch+0x90/0x280
      [463.493165]  _do_fork+0xe0/0x6d0
      [463.493191]  ? kthreadd+0x19f/0x220
      [463.493233]  kernel_thread+0x25/0x30
      [463.493235]  kthreadd+0x1bf/0x220
      [463.493242]  ? kthread_create_on_cpu+0x90/0x90
      [463.493248]  ret_from_fork+0x3a/0x50
      [463.493279] Mem-Info:
      [463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
      [463.493285]  active_file:80216 inactive_file:80107 isolated_file:435
      [463.493285]  unevictable:0 dirty:51266 writeback:109372 unstable:0
      [463.493285]  slab_reclaimable:31191 slab_unreclaimable:3483521
      [463.493285]  mapped:526 shmem:4903 pagetables:1759 bounce:0
      [463.493285]  free:33623 free_pcp:2392 free_cma:0
      ...
      [463.493489] Unreclaimable slab info:
      [463.493513] Name                      Used          Total
      [463.493522] bio-6                   1028KB       1028KB
      [463.493525] bio-5                   1028KB       1028KB
      [463.493528] dm_snap_pending_exception     236783KB     243789KB
      [463.493531] dm_exception              41KB         42KB
      [463.493534] bio-4                   1216KB       1216KB
      [463.493537] bio-3                 439396KB     439396KB
      [463.493539] kcopyd_job           6973427KB    6973427KB
      ...
      [463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
      [463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
      [463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      Moreover, issuing a large number of kcopyd jobs results in kcopyd
      hogging the CPU, while processing them. As a result, processing of work
      items, queued for execution on the same CPU as the currently running
      kcopyd thread, is stalled for long periods of time, hurting performance.
      Running the aforementioned test we get, in dmesg, messages like the
      following:
      
      [67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
      [67501.195586] Showing busy workqueues and worker pools:
      [67501.195591] workqueue events: flags=0x0
      [67501.195597]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195611]     pending: cache_reap
      [67501.195641] workqueue mm_percpu_wq: flags=0x8
      [67501.195645]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195656]     pending: vmstat_update
      [67501.195682] workqueue kblockd: flags=0x18
      [67501.195687]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
      [67501.195698]     pending: blk_timeout_work
      [67501.195753] workqueue kcopyd: flags=0x8
      [67501.195757]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195768]     pending: do_work [dm_mod]
      [67501.195802] workqueue kcopyd: flags=0x8
      [67501.195806]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195817]     pending: do_work [dm_mod]
      [67501.195834] workqueue kcopyd: flags=0x8
      [67501.195838]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195848]     pending: do_work [dm_mod]
      [67501.195881] workqueue kcopyd: flags=0x8
      [67501.195885]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195896]     pending: do_work [dm_mod]
      [67501.195920] workqueue kcopyd: flags=0x8
      [67501.195924]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
      [67501.195935]     in-flight: 67:do_work [dm_mod]
      [67501.195945]     pending: do_work [dm_mod]
      [67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765
      
      The root cause for these issues is the way dm-snapshot uses kcopyd. In
      particular, the lack of an explicit or implicit limit to the maximum
      number of in-flight COW jobs. The merging path is not affected because
      it implicitly limits the in-flight kcopyd jobs to one.
      
      Fix these issues by using a semaphore to limit the maximum number of
      in-flight kcopyd jobs. We grab the semaphore before allocating a new
      kcopyd job in start_copy() and start_full_bio() and release it after the
      job finishes in copy_callback().
      
      The initial semaphore value is configurable through a module parameter,
      to allow fine tuning the maximum number of in-flight COW jobs. Setting
      this parameter to zero initializes the semaphore to INT_MAX.
      
      A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
      value was decided experimentally as a trade-off between memory
      consumption, stalling the kernel's workqueues and maintaining a high
      enough throughput.
      
      Re-running the aforementioned test:
      
        * Workqueue stalls are eliminated
        * kcopyd's job slab cache uses a maximum of 130MB
        * The time taken by the test to write to the snapshot-origin target is
          reduced from 05m20.48s to 03m26.38s
      
      [1] https://github.com/jthornber/device-mapper-test-suiteSigned-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9e5be33b
    • A
      tools lib subcmd: Don't add the kernel sources to the include path · d9513fdb
      Arnaldo Carvalho de Melo 提交于
      [ Upstream commit ece9804985b57e1ccd83b1fb6288520955a29d51 ]
      
      At some point we decided not to directly include kernel sources files
      when building tools/perf/, but when tools/lib/subcmd/ was forked from
      tools/perf it somehow ended up adding it via these two lines in its
      Makefile:
      
        CFLAGS += -I$(srctree)/include/uapi
        CFLAGS += -I$(srctree)/include
      
      As $(srctree) points to the kernel sources.
      
      Removing those lines and keeping just:
      
        CFLAGS += -I$(srctree)/tools/include/
      
      Is enough to build tools/perf and tools/objtool.
      
      This fixes the build when building from the sources in environments such
      as the Android NDK crossbuilding from a fedora:26 system:
      
        subcmd-util.h:11:15: error: expected ',' or ';' before 'void'
         static inline void report(const char *prefix, const char *err, va_list params)
                       ^
        In file included from /git/perf/include/uapi/linux/stddef.h:2:0,
                         from /git/perf/include/uapi/linux/posix_types.h:5,
                         from /opt/android-ndk-r12b/platforms/android-24/arch-arm/usr/include/sys/types.h:36,
                         from /opt/android-ndk-r12b/platforms/android-24/arch-arm/usr/include/unistd.h:33,
                         from run-command.c:2:
        subcmd-util.h:18:17: error: '__no_instrument_function__' attribute applies only to functions
      
      The /opt/android-ndk-r12b/platforms/android-24/arch-arm/usr/include/sys/types.h
      file that includes linux/posix_types.h ends up getting the one in the kernel
      sources causing the breakage. Fix it.
      
      Test built tools/objtool/ too.
      Reported-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Fixes: 4b6ab94e ("perf subcmd: Create subcmd library")
      Link: https://lkml.kernel.org/n/tip-5lhaoecrj12t0bqwvpiu14sm@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d9513fdb
    • M
      perf stat: Avoid segfaults caused by negated options · 8603cac2
      Michael Petlan 提交于
      [ Upstream commit 51433ead1460fb3f46e1c34f68bb22fd2dd0f5d0 ]
      
      Some 'perf stat' options do not make sense to be negated (event,
      cgroup), some do not have negated path implemented (metrics). Due to
      that, it is better to disable the "no-" prefix for them, since
      otherwise, the later opt-parsing segfaults.
      
      Before:
      
        $ perf stat --no-metrics -- ls
        Segmentation fault (core dumped)
      
      After:
      
        $ perf stat --no-metrics -- ls
         Error: option `no-metrics' isn't available
         Usage: perf stat [<options>] [<command>]
      Signed-off-by: NMichael Petlan <mpetlan@redhat.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      LPU-Reference: 1485912065.62416880.1544457604340.JavaMail.zimbra@redhat.com
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8603cac2
    • N
      dm kcopyd: Fix bug causing workqueue stalls · cbd257f3
      Nikos Tsironis 提交于
      [ Upstream commit d7e6b8dfc7bcb3f4f3a18313581f67486a725b52 ]
      
      When using kcopyd to run callbacks through dm_kcopyd_do_callback() or
      submitting copy jobs with a source size of 0, the jobs are pushed
      directly to the complete_jobs list, which could be under processing by
      the kcopyd thread. As a result, the kcopyd thread can continue running
      completed jobs indefinitely, without releasing the CPU, as long as
      someone keeps submitting new completed jobs through the aforementioned
      paths. Processing of work items, queued for execution on the same CPU as
      the currently running kcopyd thread, is thus stalled for excessive
      amounts of time, hurting performance.
      
      Running the following test, from the device mapper test suite [1],
      
        dmtest run --suite snapshot -n parallel_io_to_many_snaps_N
      
      , with 8 active snapshots, we get, in dmesg, messages like the
      following:
      
      [68899.948523] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 95s!
      [68899.949282] Showing busy workqueues and worker pools:
      [68899.949288] workqueue events: flags=0x0
      [68899.949295]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
      [68899.949306]     pending: vmstat_shepherd, cache_reap
      [68899.949331] workqueue mm_percpu_wq: flags=0x8
      [68899.949337]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949345]     pending: vmstat_update
      [68899.949387] workqueue dm_bufio_cache: flags=0x8
      [68899.949392]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
      [68899.949400]     pending: work_fn [dm_bufio]
      [68899.949423] workqueue kcopyd: flags=0x8
      [68899.949429]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949437]     pending: do_work [dm_mod]
      [68899.949452] workqueue kcopyd: flags=0x8
      [68899.949458]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
      [68899.949466]     in-flight: 13:do_work [dm_mod]
      [68899.949474]     pending: do_work [dm_mod]
      [68899.949487] workqueue kcopyd: flags=0x8
      [68899.949493]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949501]     pending: do_work [dm_mod]
      [68899.949515] workqueue kcopyd: flags=0x8
      [68899.949521]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949529]     pending: do_work [dm_mod]
      [68899.949541] workqueue kcopyd: flags=0x8
      [68899.949547]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949555]     pending: do_work [dm_mod]
      [68899.949568] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=95s workers=4 idle: 27130 27223 1084
      
      Fix this by splitting the complete_jobs list into two parts: A user
      facing part, named callback_jobs, and one used internally by kcopyd,
      retaining the name complete_jobs. dm_kcopyd_do_callback() and
      dispatch_job() now push their jobs to the callback_jobs list, which is
      spliced to the complete_jobs list once, every time the kcopyd thread
      wakes up. This prevents kcopyd from hogging the CPU indefinitely and
      causing workqueue stalls.
      
      Re-running the aforementioned test:
      
        * Workqueue stalls are eliminated
        * The maximum writing time among all targets is reduced from 09m37.10s
          to 06m04.85s and the total run time of the test is reduced from
          10m43.591s to 7m19.199s
      
      [1] https://github.com/jthornber/device-mapper-test-suiteSigned-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      cbd257f3
    • A
      dm crypt: use u64 instead of sector_t to store iv_offset · 4e26ee31
      AliOS system security 提交于
      [ Upstream commit 8d683dcd65c037efc9fb38c696ec9b65b306e573 ]
      
      The iv_offset in the mapping table of crypt target is a 64bit number when
      IV algorithm is plain64, plain64be, essiv or benbi. It will be assigned to
      iv_offset of struct crypt_config, cc_sector of struct convert_context and
      iv_sector of struct dm_crypt_request. These structures members are defined
      as a sector_t. But sector_t is 32bit when CONFIG_LBDAF is not set in 32bit
      kernel. In this situation sector_t is not big enough to store the 64bit
      iv_offset.
      
      Here is a reproducer.
      Prepare test image and device (loop is automatically allocated by cryptsetup):
      
        # dd if=/dev/zero of=tst.img bs=1M count=1
        # echo "tst"|cryptsetup open --type plain -c aes-xts-plain64 \
        --skip 500000000000000000 tst.img test
      
      On 32bit system (use IV offset value that overflows to 64bit; CONFIG_LBDAF if off)
      and device checksum is wrong:
      
        # dmsetup table test --showkeys
        0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 3551657984 7:0 0
      
        # sha256sum /dev/mapper/test
        533e25c09176632b3794f35303488c4a8f3f965dffffa6ec2df347c168cb6c19 /dev/mapper/test
      
      On 64bit system (and on 32bit system with the patch), table and checksum is now correct:
      
        # dmsetup table test --showkeys
        0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 500000000000000000 7:0 0
      
        # sha256sum /dev/mapper/test
        5d16160f9d5f8c33d8051e65fdb4f003cc31cd652b5abb08f03aa6fce0df75fc /dev/mapper/test
      Signed-off-by: NAliOS system security <alios_sys_security@linux.alibaba.com>
      Tested-and-Reviewed-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4e26ee31
    • H
      x86/topology: Use total_cpus for max logical packages calculation · a4772e8b
      Hui Wang 提交于
      [ Upstream commit aa02ef099cff042c2a9109782ec2bf1bffc955d4 ]
      
      nr_cpu_ids can be limited on the command line via nr_cpus=. This can break the
      logical package management because it results in a smaller number of packages
      while in kdump kernel.
      
      Check below case:
      There is a two sockets system, each socket has 8 cores, which has 16 logical
      cpus while HT was turn on.
      
       0  1  2  3  4  5  6  7     |    16 17 18 19 20 21 22 23
       cores on socket 0               threads on socket 0
       8  9 10 11 12 13 14 15     |    24 25 26 27 28 29 30 31
       cores on socket 1               threads on socket 1
      
      While starting the kdump kernel with command line option nr_cpus=16 panic
      was triggered on one of the cpus 24-31 eg. 26, then online cpu will be
      1-15, 26(cpu 0 was disabled in kdump), ncpus will be 16 and
      __max_logical_packages will be 1, but actually two packages were booted on.
      
      This issue can reproduced by set kdump option nr_cpus=<real physical core
      numbers>, and then trigger panic on last socket's thread, for example:
      
      taskset -c 26 echo c > /proc/sysrq-trigger
      
      Use total_cpus which will not be limited by nr_cpus command line to calculate
      the value of __max_logical_packages.
      Signed-off-by: NHui Wang <john.wanghui@huawei.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: <guijianfeng@huawei.com>
      Cc: <wencongyang2@huawei.com>
      Cc: <douliyang1@huawei.com>
      Cc: <qiaonuohan@huawei.com>
      Link: https://lkml.kernel.org/r/20181107023643.22174-1-john.wanghui@huawei.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      a4772e8b
    • T
      netfilter: ipt_CLUSTERIP: fix deadlock in netns exit routine · 9d51378a
      Taehee Yoo 提交于
      [ Upstream commit 5a86d68bcf02f2d1e9a5897dd482079fd5f75e7f ]
      
      When network namespace is destroyed, cleanup_net() is called.
      cleanup_net() holds pernet_ops_rwsem then calls each ->exit callback.
      So that clusterip_tg_destroy() is called by cleanup_net().
      And clusterip_tg_destroy() calls unregister_netdevice_notifier().
      
      But both cleanup_net() and clusterip_tg_destroy() hold same
      lock(pernet_ops_rwsem). hence deadlock occurrs.
      
      After this patch, only 1 notifier is registered when module is inserted.
      And all of configs are added to per-net list.
      
      test commands:
         %ip netns add vm1
         %ip netns exec vm1 bash
         %ip link set lo up
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	-j CLUSTERIP --new --hashmode sourceip \
      	--clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
         %exit
         %ip netns del vm1
      
      splat looks like:
      [  341.809674] ============================================
      [  341.809674] WARNING: possible recursive locking detected
      [  341.809674] 4.19.0-rc5+ #16 Tainted: G        W
      [  341.809674] --------------------------------------------
      [  341.809674] kworker/u4:2/87 is trying to acquire lock:
      [  341.809674] 000000005da2d519 (pernet_ops_rwsem){++++}, at: unregister_netdevice_notifier+0x8c/0x460
      [  341.809674]
      [  341.809674] but task is already holding lock:
      [  341.809674] 000000005da2d519 (pernet_ops_rwsem){++++}, at: cleanup_net+0x119/0x900
      [  341.809674]
      [  341.809674] other info that might help us debug this:
      [  341.809674]  Possible unsafe locking scenario:
      [  341.809674]
      [  341.809674]        CPU0
      [  341.809674]        ----
      [  341.809674]   lock(pernet_ops_rwsem);
      [  341.809674]   lock(pernet_ops_rwsem);
      [  341.809674]
      [  341.809674]  *** DEADLOCK ***
      [  341.809674]
      [  341.809674]  May be due to missing lock nesting notation
      [  341.809674]
      [  341.809674] 3 locks held by kworker/u4:2/87:
      [  341.809674]  #0: 00000000d9df6c92 ((wq_completion)"%s""netns"){+.+.}, at: process_one_work+0xafe/0x1de0
      [  341.809674]  #1: 00000000c2cbcee2 (net_cleanup_work){+.+.}, at: process_one_work+0xb60/0x1de0
      [  341.809674]  #2: 000000005da2d519 (pernet_ops_rwsem){++++}, at: cleanup_net+0x119/0x900
      [  341.809674]
      [  341.809674] stack backtrace:
      [  341.809674] CPU: 1 PID: 87 Comm: kworker/u4:2 Tainted: G        W         4.19.0-rc5+ #16
      [  341.809674] Workqueue: netns cleanup_net
      [  341.809674] Call Trace:
      [ ... ]
      [  342.070196]  down_write+0x93/0x160
      [  342.070196]  ? unregister_netdevice_notifier+0x8c/0x460
      [  342.070196]  ? down_read+0x1e0/0x1e0
      [  342.070196]  ? sched_clock_cpu+0x126/0x170
      [  342.070196]  ? find_held_lock+0x39/0x1c0
      [  342.070196]  unregister_netdevice_notifier+0x8c/0x460
      [  342.070196]  ? register_netdevice_notifier+0x790/0x790
      [  342.070196]  ? __local_bh_enable_ip+0xe9/0x1b0
      [  342.070196]  ? __local_bh_enable_ip+0xe9/0x1b0
      [  342.070196]  ? clusterip_tg_destroy+0x372/0x650 [ipt_CLUSTERIP]
      [  342.070196]  ? trace_hardirqs_on+0x93/0x210
      [  342.070196]  ? __bpf_trace_preemptirq_template+0x10/0x10
      [  342.070196]  ? clusterip_tg_destroy+0x372/0x650 [ipt_CLUSTERIP]
      [  342.123094]  clusterip_tg_destroy+0x3ad/0x650 [ipt_CLUSTERIP]
      [  342.123094]  ? clusterip_net_init+0x3d0/0x3d0 [ipt_CLUSTERIP]
      [  342.123094]  ? cleanup_match+0x17d/0x200 [ip_tables]
      [  342.123094]  ? xt_unregister_table+0x215/0x300 [x_tables]
      [  342.123094]  ? kfree+0xe2/0x2a0
      [  342.123094]  cleanup_entry+0x1d5/0x2f0 [ip_tables]
      [  342.123094]  ? cleanup_match+0x200/0x200 [ip_tables]
      [  342.123094]  __ipt_unregister_table+0x9b/0x1a0 [ip_tables]
      [  342.123094]  iptable_filter_net_exit+0x43/0x80 [iptable_filter]
      [  342.123094]  ops_exit_list.isra.10+0x94/0x140
      [  342.123094]  cleanup_net+0x45b/0x900
      [ ... ]
      
      Fixes: 202f59af ("netfilter: ipt_CLUSTERIP: do not hold dev")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9d51378a
    • T
      netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine · bb7b6c49
      Taehee Yoo 提交于
      [ Upstream commit b12f7bad5ad3724d19754390a3e80928525c0769 ]
      
      When network namespace is destroyed, both clusterip_tg_destroy() and
      clusterip_net_exit() are called. and clusterip_net_exit() is called
      before clusterip_tg_destroy().
      Hence cleanup check code in clusterip_net_exit() doesn't make sense.
      
      test commands:
         %ip netns add vm1
         %ip netns exec vm1 bash
         %ip link set lo up
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	-j CLUSTERIP --new --hashmode sourceip \
      	--clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
         %exit
         %ip netns del vm1
      
      splat looks like:
      [  341.184508] WARNING: CPU: 1 PID: 87 at net/ipv4/netfilter/ipt_CLUSTERIP.c:840 clusterip_net_exit+0x319/0x380 [ipt_CLUSTERIP]
      [  341.184850] Modules linked in: ipt_CLUSTERIP nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp iptable_filter bpfilter ip_tables x_tables
      [  341.184850] CPU: 1 PID: 87 Comm: kworker/u4:2 Not tainted 4.19.0-rc5+ #16
      [  341.227509] Workqueue: netns cleanup_net
      [  341.227509] RIP: 0010:clusterip_net_exit+0x319/0x380 [ipt_CLUSTERIP]
      [  341.227509] Code: 0f 85 7f fe ff ff 48 c7 c2 80 64 2c c0 be a8 02 00 00 48 c7 c7 a0 63 2c c0 c6 05 18 6e 00 00 01 e8 bc 38 ff f5 e9 5b fe ff ff <0f> 0b e9 33 ff ff ff e8 4b 90 50 f6 e9 2d fe ff ff 48 89 df e8 de
      [  341.227509] RSP: 0018:ffff88011086f408 EFLAGS: 00010202
      [  341.227509] RAX: dffffc0000000000 RBX: 1ffff1002210de85 RCX: 0000000000000000
      [  341.227509] RDX: 1ffff1002210de85 RSI: ffff880110813be8 RDI: ffffed002210de58
      [  341.227509] RBP: ffff88011086f4d0 R08: 0000000000000000 R09: 0000000000000000
      [  341.227509] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1002210de81
      [  341.227509] R13: ffff880110625a48 R14: ffff880114cec8c8 R15: 0000000000000014
      [  341.227509] FS:  0000000000000000(0000) GS:ffff880116600000(0000) knlGS:0000000000000000
      [  341.227509] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  341.227509] CR2: 00007f11fd38e000 CR3: 000000013ca16000 CR4: 00000000001006e0
      [  341.227509] Call Trace:
      [  341.227509]  ? __clusterip_config_find+0x460/0x460 [ipt_CLUSTERIP]
      [  341.227509]  ? default_device_exit+0x1ca/0x270
      [  341.227509]  ? remove_proc_entry+0x1cd/0x390
      [  341.227509]  ? dev_change_net_namespace+0xd00/0xd00
      [  341.227509]  ? __init_waitqueue_head+0x130/0x130
      [  341.227509]  ops_exit_list.isra.10+0x94/0x140
      [  341.227509]  cleanup_net+0x45b/0x900
      [ ... ]
      
      Fixes: 613d0776 ("netfilter: exit_net cleanup check added")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bb7b6c49
    • T
      netfilter: ipt_CLUSTERIP: check MAC address when duplicate config is set · 744383c8
      Taehee Yoo 提交于
      [ Upstream commit 06aa151ad1fc74a49b45336672515774a678d78d ]
      
      If same destination IP address config is already existing, that config is
      just used. MAC address also should be same.
      However, there is no MAC address checking routine.
      So that MAC address checking routine is added.
      
      test commands:
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	   -j CLUSTERIP --new --hashmode sourceip \
      	   --clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	   -j CLUSTERIP --new --hashmode sourceip \
      	   --clustermac 01:00:5e:00:00:21 --total-nodes 2 --local-node 1
      
      After this patch, above commands are disallowed.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      744383c8
    • A
      perf vendor events intel: Fix Load_Miss_Real_Latency on SKL/SKX · bd1040e6
      Andi Kleen 提交于
      [ Upstream commit 91b2b97025097ce7ca7536bc87eba2bf14760fb4 ]
      
      Fix incorrect event names for the Load_Miss_Real_Latency metric for
      Skylake and Skylake Server.
      
      Fixes https://github.com/andikleen/pmu-tools/issues/158
      
      Before:
      
        % perf stat -M Load_Miss_Real_Latency true
        event syntax error: '..ss.pending,mem_load_retired.l1_miss_ps,mem_load_retired.fb_hit_ps}:W'
                                          \___ parser error
      
         Usage: perf stat [<options>] [<command>]
      
            -M, --metrics <metric/metric group list>
                                  monitor specified metrics or metric groups (separated by ,)
      
      After:
      
        % perf stat -M Load_Miss_Real_Latency true
      
         Performance counter stats for 'true':
      
                   279,204      l1d_pend_miss.pending     #     14.0 Load_Miss_Real_Latency
                     4,784      mem_load_uops_retired.l1_miss
                    15,188      mem_load_uops_retired.hit_lfb
      
               0.000899640 seconds time elapsed
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Link: http://lkml.kernel.org/r/20181120050635.4215-1-andi@firstfloor.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bd1040e6
    • A
      perf parse-events: Fix unchecked usage of strncpy() · 58c67a0b
      Arnaldo Carvalho de Melo 提交于
      [ Upstream commit bd8d57fb7e25e9fcf67a9eef5fa13aabe2016e07 ]
      
      The strncpy() function may leave the destination string buffer
      unterminated, better use strlcpy() that we have a __weak fallback
      implementation for systems without it.
      
      This fixes this warning on an Alpine Linux Edge system with gcc 8.2:
      
        util/parse-events.c: In function 'print_symbol_events':
        util/parse-events.c:2465:4: error: 'strncpy' specified bound 100 equals destination size [-Werror=stringop-truncation]
            strncpy(name, syms->symbol, MAX_NAME_LEN);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        In function 'print_symbol_events.constprop',
            inlined from 'print_events' at util/parse-events.c:2508:2:
        util/parse-events.c:2465:4: error: 'strncpy' specified bound 100 equals destination size [-Werror=stringop-truncation]
            strncpy(name, syms->symbol, MAX_NAME_LEN);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        In function 'print_symbol_events.constprop',
            inlined from 'print_events' at util/parse-events.c:2511:2:
        util/parse-events.c:2465:4: error: 'strncpy' specified bound 100 equals destination size [-Werror=stringop-truncation]
            strncpy(name, syms->symbol, MAX_NAME_LEN);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        cc1: all warnings being treated as errors
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Fixes: 947b4ad1 ("perf list: Fix max event string size")
      Link: https://lkml.kernel.org/n/tip-b663e33bm6x8hrkie4uxh7u2@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      58c67a0b
    • A
      perf svghelper: Fix unchecked usage of strncpy() · b332b4cd
      Arnaldo Carvalho de Melo 提交于
      [ Upstream commit 2f5302533f306d5ee87bd375aef9ca35b91762cb ]
      
      The strncpy() function may leave the destination string buffer
      unterminated, better use strlcpy() that we have a __weak fallback
      implementation for systems without it.
      
      In this specific case this would only happen if fgets() was buggy, as
      its man page states that it should read one less byte than the size of
      the destination buffer, so that it can put the nul byte at the end of
      it, so it would never copy 255 non-nul chars, as fgets reads into the
      orig buffer at most 254 non-nul chars and terminates it. But lets just
      switch to strlcpy to keep the original intent and silence the gcc 8.2
      warning.
      
      This fixes this warning on an Alpine Linux Edge system with gcc 8.2:
      
        In function 'cpu_model',
            inlined from 'svg_cpu_box' at util/svghelper.c:378:2:
        util/svghelper.c:337:5: error: 'strncpy' output may be truncated copying 255 bytes from a string of length 255 [-Werror=stringop-truncation]
             strncpy(cpu_m, &buf[13], 255);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Fixes: f48d55ce ("perf: Add a SVG helper library file")
      Link: https://lkml.kernel.org/n/tip-xzkoo0gyr56gej39ltivuh9g@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b332b4cd
    • F
      perf tests ARM: Disable breakpoint tests 32-bit · f54fc4c2
      Florian Fainelli 提交于
      [ Upstream commit 24f967337f6d6bce931425769c0f5ff5cf2d212e ]
      
      The breakpoint tests on the ARM 32-bit kernel are broken in several
      ways.
      
      The breakpoint length requested does not necessarily match whether the
      function address has the Thumb bit (bit 0) set or not, and this does
      matter to the ARM kernel hw_breakpoint infrastructure. See [1] for
      background.
      
      [1]: https://lkml.org/lkml/2018/11/15/205
      
      As Will indicated, the overflow handling would require single-stepping
      which is not supported at the moment. Just disable those tests for the
      ARM 32-bit platforms and update the comment above to explain these
      limitations.
      Co-developed-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20181203191138.2419-1-f.fainelli@gmail.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f54fc4c2
    • A
      perf intel-pt: Fix error with config term "pt=0" · c3e8c335
      Adrian Hunter 提交于
      [ Upstream commit 1c6f709b9f96366cc47af23c05ecec9b8c0c392d ]
      
      Users should never use 'pt=0', but if they do it may give a meaningless
      error:
      
      	$ perf record -e intel_pt/pt=0/u uname
      	Error:
      	The sys_perf_event_open() syscall returned with 22 (Invalid argument) for
      	event (intel_pt/pt=0/u).
      
      Fix that by forcing 'pt=1'.
      
      Committer testing:
      
        # perf record -e intel_pt/pt=0/u uname
        Error:
        The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (intel_pt/pt=0/u).
        /bin/dmesg | grep -i perf may provide additional information.
      
        # perf record -e intel_pt/pt=0/u uname
        pt=0 doesn't make sense, forcing pt=1
        Linux
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.020 MB perf.data ]
        #
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: http://lkml.kernel.org/r/b7c5b4e5-9497-10e5-fd43-5f3e4a0fe51d@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c3e8c335
    • S
      tty/serial: do not free trasnmit buffer page under port lock · f74fc96e
      Sergey Senozhatsky 提交于
      [ Upstream commit d72402145ace0697a6a9e8e75a3de5bf3375f78d ]
      
      LKP has hit yet another circular locking dependency between uart
      console drivers and debugobjects [1]:
      
           CPU0                                    CPU1
      
                                                  rhltable_init()
                                                   __init_work()
                                                    debug_object_init
           uart_shutdown()                          /* db->lock */
            /* uart_port->lock */                    debug_print_object()
             free_page()                              printk()
                                                       call_console_drivers()
              debug_check_no_obj_freed()                /* uart_port->lock */
               /* db->lock */
                debug_print_object()
      
      So there are two dependency chains:
      	uart_port->lock -> db->lock
      And
      	db->lock -> uart_port->lock
      
      This particular circular locking dependency can be addressed in several
      ways:
      
      a) One way would be to move debug_print_object() out of db->lock scope
         and, thus, break the db->lock -> uart_port->lock chain.
      b) Another one would be to free() transmit buffer page out of db->lock
         in UART code; which is what this patch does.
      
      It makes sense to apply a) and b) independently: there are too many things
      going on behind free(), none of which depend on uart_port->lock.
      
      The patch fixes transmit buffer page free() in uart_shutdown() and,
      additionally, in uart_port_startup() (as was suggested by Dmitry Safonov).
      
      [1] https://lore.kernel.org/lkml/20181211091154.GL23332@shao2-debian/T/#uSigned-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Dmitry Safonov <dima@arista.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f74fc96e
    • J
      btrfs: improve error handling of btrfs_add_link · 310f8296
      Johannes Thumshirn 提交于
      [ Upstream commit 1690dd41e0cb1dade80850ed8a3eb0121b96d22f ]
      
      In the error handling block, err holds the return value of either
      btrfs_del_root_ref() or btrfs_del_inode_ref() but it hasn't been checked
      since it's introduction with commit fe66a05a (Btrfs: improve error
      handling for btrfs_insert_dir_item callers) in 2012.
      
      If the error handling in the error handling fails, there's not much left
      to do and the abort either happened earlier in the callees or is
      necessary here.
      
      So if one of btrfs_del_root_ref() or btrfs_del_inode_ref() failed, abort
      the transaction, but still return the original code of the failure
      stored in 'ret' as this will be reported to the user.
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      310f8296
    • A
      btrfs: fix use-after-free due to race between replace start and cancel · 38b17eee
      Anand Jain 提交于
      [ Upstream commit d189dd70e2556181732598956d808ea53cc8774e ]
      
      The device replace cancel thread can race with the replace start thread
      and if fs_info::scrubs_running is not yet set, btrfs_scrub_cancel() will
      fail to stop the scrub thread.
      
      The scrub thread continues with the scrub for replace which then will
      try to write to the target device and which is already freed by the
      cancel thread.
      
      scrub_setup_ctx() warns as tgtdev is NULL.
      
        struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
        {
        ...
      	  if (is_dev_replace) {
      		  WARN_ON(!fs_info->dev_replace.tgtdev);  <===
      		  sctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO;
      		  sctx->wr_tgtdev = fs_info->dev_replace.tgtdev;
      		  sctx->flush_all_writes = false;
      	  }
      
        [ 6724.497655] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc started
        [ 6753.945017] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc canceled
        [ 6852.426700] WARNING: CPU: 0 PID: 4494 at fs/btrfs/scrub.c:622 scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
        ...
        [ 6852.428928] RIP: 0010:scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
        ...
        [ 6852.432970] Call Trace:
        [ 6852.433202]  btrfs_scrub_dev+0x19b/0x5c0 [btrfs]
        [ 6852.433471]  btrfs_dev_replace_start+0x48c/0x6a0 [btrfs]
        [ 6852.433800]  btrfs_dev_replace_by_ioctl+0x3a/0x60 [btrfs]
        [ 6852.434097]  btrfs_ioctl+0x2476/0x2d20 [btrfs]
        [ 6852.434365]  ? do_sigaction+0x7d/0x1e0
        [ 6852.434623]  do_vfs_ioctl+0xa9/0x6c0
        [ 6852.434865]  ? syscall_trace_enter+0x1c8/0x310
        [ 6852.435124]  ? syscall_trace_enter+0x1c8/0x310
        [ 6852.435387]  ksys_ioctl+0x60/0x90
        [ 6852.435663]  __x64_sys_ioctl+0x16/0x20
        [ 6852.435907]  do_syscall_64+0x50/0x180
        [ 6852.436150]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Further, as the replace thread enters scrub_write_page_to_dev_replace()
      without the target device it panics:
      
        static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
      				      struct scrub_page *spage)
        {
        ...
      	bio_set_dev(bio, sbio->dev->bdev); <======
      
        [ 6929.715145] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
        ..
        [ 6929.717106] Workqueue: btrfs-scrub btrfs_scrub_helper [btrfs]
        [ 6929.717420] RIP: 0010:scrub_write_page_to_dev_replace+0xb4/0x260
        [btrfs]
        ..
        [ 6929.721430] Call Trace:
        [ 6929.721663]  scrub_write_block_to_dev_replace+0x3f/0x60 [btrfs]
        [ 6929.721975]  scrub_bio_end_io_worker+0x1af/0x490 [btrfs]
        [ 6929.722277]  normal_work_helper+0xf0/0x4c0 [btrfs]
        [ 6929.722552]  process_one_work+0x1f4/0x520
        [ 6929.722805]  ? process_one_work+0x16e/0x520
        [ 6929.723063]  worker_thread+0x46/0x3d0
        [ 6929.723313]  kthread+0xf8/0x130
        [ 6929.723544]  ? process_one_work+0x520/0x520
        [ 6929.723800]  ? kthread_delayed_work_timer_fn+0x80/0x80
        [ 6929.724081]  ret_from_fork+0x3a/0x50
      
      Fix this by letting the btrfs_dev_replace_finishing() to do the job of
      cleaning after the cancel, including freeing of the target device.
      btrfs_dev_replace_finishing() is called when btrfs_scub_dev() returns
      along with the scrub return status.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      38b17eee
    • H
      btrfs: alloc_chunk: fix more DUP stripe size handling · 720b86a5
      Hans van Kranenburg 提交于
      [ Upstream commit baf92114c7e6dd6124aa3d506e4bc4b694da3bc3 ]
      
      Commit 92e222df "btrfs: alloc_chunk: fix DUP stripe size handling"
      fixed calculating the stripe_size for a new DUP chunk.
      
      However, the same calculation reappears a bit later, and that one was
      not changed yet. The resulting bug that is exposed is that the newly
      allocated device extents ('stripes') can have a few MiB overlap with the
      next thing stored after them, which is another device extent or the end
      of the disk.
      
      The scenario in which this can happen is:
      * The block device for the filesystem is less than 10GiB in size.
      * The amount of contiguous free unallocated disk space chosen to use for
        chunk allocation is 20% of the total device size, or a few MiB more or
        less.
      
      An example:
      - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
      - There's 1578MiB unallocated raw disk space left in one contiguous
        piece.
      
      In this case stripe_size is first calculated as 789MiB, (half of
      1578MiB).
      
      Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
      enter the if block. Now stripe_size value is immediately overwritten
      while calculating an adjusted value based on max_chunk_size, which ends
      up as 788MiB.
      
      Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
      actually more than the value we had before. However, the last comparison
      fails to detect this, because it's comparing the value with the total
      amount of free space, which is about twice the size of stripe_size.
      
      In the example above, this means that the resulting raw disk space being
      allocated is 1600MiB, while only a gap of 1578MiB has been found. The
      second device extent object for this DUP chunk will overlap for 22MiB
      with whatever comes next.
      
      The underlying problem here is that the stripe_size is reused all the
      time for different things. So, when entering the code in the if block,
      stripe_size is immediately overwritten with something else. If later we
      decide we want to have the previous value back, then the logic to
      compute it was copy pasted in again.
      
      With this change, the value in stripe_size is not unnecessarily
      destroyed, so the duplicated calculation is not needed any more.
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      720b86a5
    • Q
      btrfs: volumes: Make sure there is no overlap of dev extents at mount time · bb5717a4
      Qu Wenruo 提交于
      [ Upstream commit 5eb193812a42dc49331f25137a38dfef9612d3e4 ]
      
      Enhance btrfs_verify_dev_extents() to remember previous checked dev
      extents, so it can verify no dev extents can overlap.
      
      Analysis from Hans:
      
      "Imagine allocating a DATA|DUP chunk.
      
       In the chunk allocator, we first set...
         max_stripe_size = SZ_1G;
         max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
       ... which is 10GiB.
      
       Then...
         /* we don't want a chunk larger than 10% of writeable space */
         max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
             		 max_chunk_size);
      
       Imagine we only have one 7880MiB block device in this filesystem. Now
       max_chunk_size is down to 788MiB.
      
       The next step in the code is to search for max_stripe_size * dev_stripes
       amount of free space on the device, which is in our example 1GiB * 2 =
       2GiB. Imagine the device has exactly 1578MiB free in one contiguous
       piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
      
       Next we recalculate the stripe_size (which is actually the device extent
       length), based on the actual maximum amount of available raw disk space:
         stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
      
       stripe_size is now 789MiB
      
       Next we do...
         data_stripes = num_stripes / ncopies
       ...where data_stripes ends up as 1, because num_stripes is 2 (the amount
       of device extents we're going to have), and DUP has ncopies 2.
      
       Next there's a check...
         if (stripe_size * data_stripes > max_chunk_size)
       ...which matches because 789MiB * 1 > 788MiB.
      
       We go into the if code, and next is...
         stripe_size = div_u64(max_chunk_size, data_stripes);
       ...which resets stripe_size to max_chunk_size: 788MiB
      
       Next is a fun one...
         /* bump the answer up to a 16MB boundary */
         stripe_size = round_up(stripe_size, SZ_16M);
       ...which changes stripe_size from 788MiB to 800MiB.
      
       We're not done changing stripe_size yet...
         /* But don't go higher than the limits we found while searching
          * for free extents
          */
         stripe_size = min(devices_info[ndevs - 1].max_avail,
             	      stripe_size);
      
       This is bad. max_avail is twice the stripe_size (we need to fit 2 device
       extents on the same device for DUP).
      
       The result here is that 800MiB < 1578MiB, so it's unchanged. However,
       the resulting DUP chunk will need 1600MiB disk space, which isn't there,
       and the second dev_extent might extend into the next thing (next
       dev_extent? end of device?) for 22MiB.
      
       The last shown line of code relies on a situation where there's twice
       the value of stripe_size present as value for the variable stripe_size
       when it's DUP. This was actually the case before commit 92e222df
       "btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
         "[...] in the meantime there's a check to see if the stripe_size does
       not exceed max_chunk_size. Since during this check stripe_size is twice
       the amount as intended, the check will reduce the stripe_size to
       max_chunk_size if the actual correct to be used stripe_size is more than
       half the amount of max_chunk_size."
      
       In the previous version of the code, the 16MiB alignment (why is this
       done, by the way?) would result in a 50% chance that it would actually
       do an 8MiB alignment for the individual dev_extents, since it was
       operating on double the size. Does this matter?
      
       Does it matter that stripe_size can be set to anything which is not
       16MiB aligned because of the amount of remaining available disk space
       which is just taken?
      
       What is the main purpose of this round_up?
      
       The most straightforward thing to do seems something like...
         stripe_size = min(
             div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
             stripe_size
         )
       ..just putting half of the max_avail into stripe_size."
      
      Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/Reported-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ add analysis from report ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bb5717a4
    • J
      mmc: atmel-mci: do not assume idle after atmci_request_end · c21991ed
      Jonas Danielsson 提交于
      [ Upstream commit ae460c115b7aa50c9a36cf78fced07b27962c9d0 ]
      
      On our AT91SAM9260 board we use the same sdio bus for wifi and for the
      sd card slot. This caused the atmel-mci to give the following splat on
      the serial console:
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 538 at drivers/mmc/host/atmel-mci.c:859 atmci_send_command+0x24/0x44
        Modules linked in:
        CPU: 0 PID: 538 Comm: mmcqd/0 Not tainted 4.14.76 #14
        Hardware name: Atmel AT91SAM9
        [<c000fccc>] (unwind_backtrace) from [<c000d3dc>] (show_stack+0x10/0x14)
        [<c000d3dc>] (show_stack) from [<c0017644>] (__warn+0xd8/0xf4)
        [<c0017644>] (__warn) from [<c0017704>] (warn_slowpath_null+0x1c/0x24)
        [<c0017704>] (warn_slowpath_null) from [<c033bb9c>] (atmci_send_command+0x24/0x44)
        [<c033bb9c>] (atmci_send_command) from [<c033e984>] (atmci_start_request+0x1f4/0x2dc)
        [<c033e984>] (atmci_start_request) from [<c033f3b4>] (atmci_request+0xf0/0x164)
        [<c033f3b4>] (atmci_request) from [<c0327108>] (mmc_start_request+0x280/0x2d0)
        [<c0327108>] (mmc_start_request) from [<c032800c>] (mmc_start_areq+0x230/0x330)
        [<c032800c>] (mmc_start_areq) from [<c03366f8>] (mmc_blk_issue_rw_rq+0xc4/0x310)
        [<c03366f8>] (mmc_blk_issue_rw_rq) from [<c03372c4>] (mmc_blk_issue_rq+0x118/0x5ac)
        [<c03372c4>] (mmc_blk_issue_rq) from [<c033781c>] (mmc_queue_thread+0xc4/0x118)
        [<c033781c>] (mmc_queue_thread) from [<c002daf8>] (kthread+0x100/0x118)
        [<c002daf8>] (kthread) from [<c000a580>] (ret_from_fork+0x14/0x34)
        ---[ end trace 594371ddfa284bd6 ]---
      
      This is:
        WARN_ON(host->cmd);
      
      This was fixed on our board by letting atmci_request_end determine what
      state we are in. Instead of unconditionally setting it to STATE_IDLE on
      STATE_END_REQUEST.
      Signed-off-by: NJonas Danielsson <jonas@orbital-systems.com>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c21991ed
    • M
      kconfig: fix memory leak when EOF is encountered in quotation · 46199110
      Masahiro Yamada 提交于
      [ Upstream commit fbac5977d81cb2b2b7e37b11c459055d9585273c ]
      
      An unterminated string literal followed by new line is passed to the
      parser (with "multi-line strings not supported" warning shown), then
      handled properly there.
      
      On the other hand, an unterminated string literal at end of file is
      never passed to the parser, then results in memory leak.
      
      [Test Code]
      
        ----------(Kconfig begin)----------
        source "Kconfig.inc"
      
        config A
                bool "a"
        -----------(Kconfig end)-----------
      
        --------(Kconfig.inc begin)--------
        config B
                bool "b\No new line at end of file
        ---------(Kconfig.inc end)---------
      
      [Summary from Valgrind]
      
        Before the fix:
      
          LEAK SUMMARY:
             definitely lost: 16 bytes in 1 blocks
             ...
      
        After the fix:
      
          LEAK SUMMARY:
             definitely lost: 0 bytes in 0 blocks
             ...
      
      Eliminate the memory leak path by handling this case. Of course, such
      a Kconfig file is wrong already, so I will add an error message later.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      46199110
    • M
      kconfig: fix file name and line number of warn_ignored_character() · ba8efcdc
      Masahiro Yamada 提交于
      [ Upstream commit 77c1c0fa8b1477c5799bdad65026ea5ff676da44 ]
      
      Currently, warn_ignore_character() displays invalid file name and
      line number.
      
      The lexer should use current_file->name and yylineno, while the parser
      should use zconf_curname() and zconf_lineno().
      
      This difference comes from that the lexer is always going ahead
      of the parser. The parser needs to look ahead one token to make a
      shift/reduce decision, so the lexer is requested to scan more text
      from the input file.
      
      This commit fixes the warning message from warn_ignored_character().
      
      [Test Code]
      
        ----(Kconfig begin)----
        /
        -----(Kconfig end)-----
      
      [Output]
      
        Before the fix:
      
        <none>:0:warning: ignoring unsupported character '/'
      
        After the fix:
      
        Kconfig:1:warning: ignoring unsupported character '/'
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ba8efcdc
    • J
      bpf: relax verifier restriction on BPF_MOV | BPF_ALU · 344b51e7
      Jiong Wang 提交于
      [ Upstream commit e434b8cdf788568ba65a0a0fd9f3cb41f3ca1803 ]
      
      Currently, the destination register is marked as unknown for 32-bit
      sub-register move (BPF_MOV | BPF_ALU) whenever the source register type is
      SCALAR_VALUE.
      
      This is too conservative that some valid cases will be rejected.
      Especially, this may turn a constant scalar value into unknown value that
      could break some assumptions of verifier.
      
      For example, test_l4lb_noinline.c has the following C code:
      
          struct real_definition *dst
      
      1:  if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6))
      2:    return TC_ACT_SHOT;
      3:
      4:  if (dst->flags & F_IPV6) {
      
      get_packet_dst is responsible for initializing "dst" into valid pointer and
      return true (1), otherwise return false (0). The compiled instruction
      sequence using alu32 will be:
      
        412: (54) (u32) r7 &= (u32) 1
        413: (bc) (u32) r0 = (u32) r7
        414: (95) exit
      
      insn 413, a BPF_MOV | BPF_ALU, however will turn r0 into unknown value even
      r7 contains SCALAR_VALUE 1.
      
      This causes trouble when verifier is walking the code path that hasn't
      initialized "dst" inside get_packet_dst, for which case 0 is returned and
      we would then expect verifier concluding line 1 in the above C code pass
      the "if" check, therefore would skip fall through path starting at line 4.
      Now, because r0 returned from callee has became unknown value, so verifier
      won't skip analyzing path starting at line 4 and "dst->flags" requires
      dereferencing the pointer "dst" which actually hasn't be initialized for
      this path.
      
      This patch relaxed the code marking sub-register move destination. For a
      SCALAR_VALUE, it is safe to just copy the value from source then truncate
      it into 32-bit.
      
      A unit test also included to demonstrate this issue. This test will fail
      before this patch.
      
      This relaxation could let verifier skipping more paths for conditional
      comparison against immediate. It also let verifier recording a more
      accurate/strict value for one register at one state, if this state end up
      with going through exit without rejection and it is used for state
      comparison later, then it is possible an inaccurate/permissive value is
      better. So the real impact on verifier processed insn number is complex.
      But in all, without this fix, valid program could be rejected.
      
      >From real benchmarking on kernel selftests and Cilium bpf tests, there is
      no impact on processed instruction number when tests ares compiled with
      default compilation options. There is slightly improvements when they are
      compiled with -mattr=+alu32 after this patch.
      
      Also, test_xdp_noinline/-mattr=+alu32 now passed verification. It is
      rejected before this fix.
      
      Insn processed before/after this patch:
      
                              default     -mattr=+alu32
      
      Kernel selftest
      
      ===
      test_xdp.o              371/371      369/369
      test_l4lb.o             6345/6345    5623/5623
      test_xdp_noinline.o     2971/2971    rejected/2727
      test_tcp_estates.o      429/429      430/430
      
      Cilium bpf
      ===
      bpf_lb-DLB_L3.o:        2085/2085     1685/1687
      bpf_lb-DLB_L4.o:        2287/2287     1986/1982
      bpf_lb-DUNKNOWN.o:      690/690       622/622
      bpf_lxc.o:              95033/95033   N/A
      bpf_netdev.o:           7245/7245     N/A
      bpf_overlay.o:          2898/2898     3085/2947
      
      NOTE:
        - bpf_lxc.o and bpf_netdev.o compiled by -mattr=+alu32 are rejected by
          verifier due to another issue inside verifier on supporting alu32
          binary.
        - Each cilium bpf program could generate several processed insn number,
          above number is sum of them.
      
      v1->v2:
       - Restrict the change on SCALAR_VALUE.
       - Update benchmark numbers on Cilium bpf tests.
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      344b51e7
    • W
      arm64: Fix minor issues with the dcache_by_line_op macro · dfbf8c98
      Will Deacon 提交于
      [ Upstream commit 33309ecda0070506c49182530abe7728850ebe78 ]
      
      The dcache_by_line_op macro suffers from a couple of small problems:
      
      First, the GAS directives that are currently being used rely on
      assembler behavior that is not documented, and probably not guaranteed
      to produce the correct behavior going forward. As a result, we end up
      with some undefined symbols in cache.o:
      
      $ nm arch/arm64/mm/cache.o
               ...
               U civac
               ...
               U cvac
               U cvap
               U cvau
      
      This is due to the fact that the comparisons used to select the
      operation type in the dcache_by_line_op macro are comparing symbols
      not strings, and even though it seems that GAS is doing the right
      thing here (undefined symbols by the same name are equal to each
      other), it seems unwise to rely on this.
      
      Second, when patching in a DC CVAP instruction on CPUs that support it,
      the fallback path consists of a DC CVAU instruction which may be
      affected by CPU errata that require ARM64_WORKAROUND_CLEAN_CACHE.
      
      Solve these issues by unrolling the various maintenance routines and
      using the conditional directives that are documented as operating on
      strings. To avoid the complexity of nested alternatives, we move the
      DC CVAP patching to __clean_dcache_area_pop, falling back to a branch
      to __clean_dcache_area_poc if DCPOP is not supported by the CPU.
      Reported-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Suggested-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      dfbf8c98
    • L
      clk: imx6q: reset exclusive gates on init · 73f0b2e3
      Lucas Stach 提交于
      [ Upstream commit f7542d817733f461258fd3a47d77da35b2d9fc81 ]
      
      The exclusive gates may be set up in the wrong way by software running
      before the clock driver comes up. In that case the exclusive setup is
      locked in its initial state, as the complementary function can't be
      activated without disabling the initial setup first.
      
      To avoid this lock situation, reset the exclusive gates to the off
      state and allow the kernel to provide the proper setup.
      Signed-off-by: NLucas Stach <l.stach@pengutronix.de>
      Reviewed-by: NDong Aisheng <Aisheng.dong@nxp.com>
      Signed-off-by: NStephen Boyd <sboyd@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      73f0b2e3
    • Q
      arm64: kasan: Increase stack size for KASAN_EXTRA · 8f183b33
      Qian Cai 提交于
      [ Upstream commit 6e8830674ea77f57d57a33cca09083b117a71f41 ]
      
      If the kernel is configured with KASAN_EXTRA, the stack size is
      increased significantly due to setting the GCC -fstack-reuse option to
      "none" [1]. As a result, it can trigger a stack overrun quite often with
      32k stack size compiled using GCC 8. For example, this reproducer
      
        https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise06.c
      
      can trigger a "corrupted stack end detected inside scheduler" very
      reliably with CONFIG_SCHED_STACK_END_CHECK enabled. There are other
      reports at:
      
        https://lore.kernel.org/lkml/1542144497.12945.29.camel@gmx.us/
        https://lore.kernel.org/lkml/721E7B42-2D55-4866-9C1A-3E8D64F33F9C@gmx.us/
      
      There are just too many functions that could have a large stack with
      KASAN_EXTRA due to large local variables that have been called over and
      over again without being able to reuse the stacks. Some noticiable ones
      are,
      
      size
      7536 shrink_inactive_list
      7440 shrink_page_list
      6560 fscache_stats_show
      3920 jbd2_journal_commit_transaction
      3216 try_to_unmap_one
      3072 migrate_page_move_mapping
      3584 migrate_misplaced_transhuge_page
      3920 ip_vs_lblcr_schedule
      4304 lpfc_nvme_info_show
      3888 lpfc_debugfs_nvmestat_data.constprop
      
      There are other 49 functions over 2k in size while compiling kernel with
      "-Wframe-larger-than=" on this machine. Hence, it is too much work to
      change Makefiles for each object to compile without
      -fsanitize-address-use-after-scope individually.
      
      [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715#c23Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8f183b33
    • D
      selftests: do not macro-expand failed assertion expressions · 656257cf
      Dmitry V. Levin 提交于
      [ Upstream commit b708a3cc9600390ccaa2b68a88087dd265154b2b ]
      
      I've stumbled over the current macro-expand behaviour of the test
      harness:
      
      $ gcc -Wall -xc - <<'__EOF__'
      TEST(macro) {
      	int status = 0;
      	ASSERT_TRUE(WIFSIGNALED(status));
      }
      TEST_HARNESS_MAIN
      __EOF__
      $ ./a.out
      [==========] Running 1 tests from 1 test cases.
      [ RUN      ] global.macro
      <stdin>:4:global.macro:Expected 0 (0) != (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) (0)
      global.macro: Test terminated by assertion
      [     FAIL ] global.macro
      [==========] 0 / 1 tests passed.
      [  FAILED  ]
      
      With this change the output of the same test looks much more
      comprehensible:
      
      [==========] Running 1 tests from 1 test cases.
      [ RUN      ] global.macro
      <stdin>:4:global.macro:Expected 0 (0) != WIFSIGNALED(status) (0)
      global.macro: Test terminated by assertion
      [     FAIL ] global.macro
      [==========] 0 / 1 tests passed.
      [  FAILED  ]
      
      The issue is very similar to the bug fixed in glibc assert(3)
      three years ago:
      https://sourceware.org/bugzilla/show_bug.cgi?id=18604
      
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Will Drewry <wad@chromium.org>
      Cc: linux-kselftest@vger.kernel.org
      Signed-off-by: NDmitry V. Levin <ldv@altlinux.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NShuah Khan <shuah@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      656257cf
    • B
      scsi: target/core: Make sure that target_wait_for_sess_cmds() waits long enough · 3ad8148c
      Bart Van Assche 提交于
      [ Upstream commit ad669505c4e9db9af9faeb5c51aa399326a80d91 ]
      
      A session must only be released after all code that accesses the session
      structure has finished. Make sure that this is the case by introducing a
      new command counter per session that is only decremented after the
      .release_cmd() callback has finished. This patch fixes the following crash:
      
      BUG: KASAN: use-after-free in do_raw_spin_lock+0x1c/0x130
      Read of size 4 at addr ffff8801534b16e4 by task rmdir/14805
      CPU: 16 PID: 14805 Comm: rmdir Not tainted 4.18.0-rc2-dbg+ #5
      Call Trace:
      dump_stack+0xa4/0xf5
      print_address_description+0x6f/0x270
      kasan_report+0x241/0x360
      __asan_load4+0x78/0x80
      do_raw_spin_lock+0x1c/0x130
      _raw_spin_lock_irqsave+0x52/0x60
      srpt_set_ch_state+0x27/0x70 [ib_srpt]
      srpt_disconnect_ch+0x1b/0xc0 [ib_srpt]
      srpt_close_session+0xa8/0x260 [ib_srpt]
      target_shutdown_sessions+0x170/0x180 [target_core_mod]
      core_tpg_del_initiator_node_acl+0xf3/0x200 [target_core_mod]
      target_fabric_nacl_base_release+0x25/0x30 [target_core_mod]
      config_item_release+0x9c/0x110 [configfs]
      config_item_put+0x26/0x30 [configfs]
      configfs_rmdir+0x3b8/0x510 [configfs]
      vfs_rmdir+0xb3/0x1e0
      do_rmdir+0x262/0x2c0
      do_syscall_64+0x77/0x230
      entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Cc: Nicholas Bellinger <nab@linux-iscsi.org>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Disseldorp <ddiss@suse.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3ad8148c
    • D
      scsi: target: use consistent left-aligned ASCII INQUIRY data · 25d3546a
      David Disseldorp 提交于
      [ Upstream commit 0de263577de5d5e052be5f4f93334e63cc8a7f0b ]
      
      spc5r17.pdf specifies:
      
        4.3.1 ASCII data field requirements
        ASCII data fields shall contain only ASCII printable characters (i.e.,
        code values 20h to 7Eh) and may be terminated with one or more ASCII null
        (00h) characters.  ASCII data fields described as being left-aligned
        shall have any unused bytes at the end of the field (i.e., highest
        offset) and the unused bytes shall be filled with ASCII space characters
        (20h).
      
      LIO currently space-pads the T10 VENDOR IDENTIFICATION and PRODUCT
      IDENTIFICATION fields in the standard INQUIRY data. However, the PRODUCT
      REVISION LEVEL field in the standard INQUIRY data as well as the T10 VENDOR
      IDENTIFICATION field in the INQUIRY Device Identification VPD Page are
      zero-terminated/zero-padded.
      
      Fix this inconsistency by using space-padding for all of the above fields.
      Signed-off-by: NDavid Disseldorp <ddiss@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBryant G. Ly <bly@catalogicsoftware.com>
      Reviewed-by: NLee Duncan <lduncan@suse.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NRoman Bolshakov <r.bolshakov@yadro.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      25d3546a
    • Y
      net: call sk_dst_reset when set SO_DONTROUTE · 50deccdc
      yupeng 提交于
      [ Upstream commit 0fbe82e628c817e292ff588cd5847fc935e025f2 ]
      
      after set SO_DONTROUTE to 1, the IP layer should not route packets if
      the dest IP address is not in link scope. But if the socket has cached
      the dst_entry, such packets would be routed until the sk_dst_cache
      expires. So we should clean the sk_dst_cache when a user set
      SO_DONTROUTE option. Below are server/client python scripts which
      could reprodue this issue:
      
      server side code:
      
      ==========================================================================
      import socket
      import struct
      import time
      
      s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
      s.bind(('0.0.0.0', 9000))
      s.listen(1)
      sock, addr = s.accept()
      sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1))
      while True:
          sock.send(b'foo')
          time.sleep(1)
      ==========================================================================
      
      client side code:
      ==========================================================================
      import socket
      import time
      
      s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
      s.connect(('server_address', 9000))
      while True:
          data = s.recv(1024)
          print(data)
      ==========================================================================
      Signed-off-by: Nyupeng <yupeng0921@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      50deccdc
    • G
      staging: erofs: fix use-after-free of on-stack `z_erofs_vle_unzip_io' · fd4c7fe1
      Gao Xiang 提交于
      [ Upstream commit 848bd9acdcd00c164b42b14aacec242949ecd471 ]
      
      The root cause is the race as follows:
       Thread #0                         Thread #1
      
       z_erofs_vle_unzip_kickoff         z_erofs_submit_and_unzip
      
                                          struct z_erofs_vle_unzip_io io[]
         atomic_add_return()
                                          wait_event()
                                          [end of function]
         wake_up()
      
      Fix it by taking the waitqueue lock between atomic_add_return and
      wake_up to close such the race.
      
      kernel message:
      
      Unable to handle kernel paging request at virtual address 97f7052caa1303dc
      ...
      Workqueue: kverityd verity_work
      task: ffffffe32bcb8000 task.stack: ffffffe3298a0000
      PC is at __wake_up_common+0x48/0xa8
      LR is at __wake_up+0x3c/0x58
      ...
      Call trace:
      ...
      [<ffffff94a08ff648>] __wake_up_common+0x48/0xa8
      [<ffffff94a08ff8b8>] __wake_up+0x3c/0x58
      [<ffffff94a0c11b60>] z_erofs_vle_unzip_kickoff+0x40/0x64
      [<ffffff94a0c118e4>] z_erofs_vle_read_endio+0x94/0x134
      [<ffffff94a0c83c9c>] bio_endio+0xe4/0xf8
      [<ffffff94a1076540>] dec_pending+0x134/0x32c
      [<ffffff94a1076f28>] clone_endio+0x90/0xf4
      [<ffffff94a0c83c9c>] bio_endio+0xe4/0xf8
      [<ffffff94a1095024>] verity_work+0x210/0x368
      [<ffffff94a08c4150>] process_one_work+0x188/0x4b4
      [<ffffff94a08c45bc>] worker_thread+0x140/0x458
      [<ffffff94a08cad48>] kthread+0xec/0x108
      [<ffffff94a0883ab4>] ret_from_fork+0x10/0x1c
      Code: d1006273 54000260 f9400804 b9400019 (b85fc081)
      ---[ end trace be9dde154f677cd1 ]---
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NGao Xiang <gaoxiang25@huawei.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fd4c7fe1
    • V
      media: venus: core: Set dma maximum segment size · 38be2cba
      Vivek Gautam 提交于
      [ Upstream commit de2563bce7a157f5296bab94f3843d7d64fb14b4 ]
      
      Turning on CONFIG_DMA_API_DEBUG_SG results in the following error:
      
      [  460.308650] ------------[ cut here ]------------
      [  460.313490] qcom-venus aa00000.video-codec: DMA-API: mapping sg segment longer than device claims to support [len=4194304] [max=65536]
      [  460.326017] WARNING: CPU: 3 PID: 3555 at src/kernel/dma/debug.c:1301 debug_dma_map_sg+0x174/0x254
      [  460.338888] Modules linked in: venus_dec venus_enc videobuf2_dma_sg videobuf2_memops hci_uart btqca bluetooth venus_core v4l2_mem2mem videobuf2_v4l2 videobuf2_common ath10k_snoc ath10k_core ath lzo lzo_compress zramjoydev
      [  460.375811] CPU: 3 PID: 3555 Comm: V4L2DecoderThre Tainted: G        W         4.19.1 #82
      [  460.384223] Hardware name: Google Cheza (rev1) (DT)
      [  460.389251] pstate: 60400009 (nZCv daif +PAN -UAO)
      [  460.394191] pc : debug_dma_map_sg+0x174/0x254
      [  460.398680] lr : debug_dma_map_sg+0x174/0x254
      [  460.403162] sp : ffffff80200c37d0
      [  460.406583] x29: ffffff80200c3830 x28: 0000000000010000
      [  460.412056] x27: 00000000ffffffff x26: ffffffc0f785ea80
      [  460.417532] x25: 0000000000000000 x24: ffffffc0f4ea1290
      [  460.423001] x23: ffffffc09e700300 x22: ffffffc0f4ea1290
      [  460.428470] x21: ffffff8009037000 x20: 0000000000000001
      [  460.433936] x19: ffffff80091b0000 x18: 0000000000000000
      [  460.439411] x17: 0000000000000000 x16: 000000000000f251
      [  460.444885] x15: 0000000000000006 x14: 0720072007200720
      [  460.450354] x13: ffffff800af536e0 x12: 0000000000000000
      [  460.455822] x11: 0000000000000000 x10: 0000000000000000
      [  460.461288] x9 : 537944d9c6c48d00 x8 : 537944d9c6c48d00
      [  460.466758] x7 : 0000000000000000 x6 : ffffffc0f8d98f80
      [  460.472230] x5 : 0000000000000000 x4 : 0000000000000000
      [  460.477703] x3 : 000000000000008a x2 : ffffffc0fdb13948
      [  460.483170] x1 : ffffffc0fdb0b0b0 x0 : 000000000000007a
      [  460.488640] Call trace:
      [  460.491165]  debug_dma_map_sg+0x174/0x254
      [  460.495307]  vb2_dma_sg_alloc+0x260/0x2dc [videobuf2_dma_sg]
      [  460.501150]  __vb2_queue_alloc+0x164/0x374 [videobuf2_common]
      [  460.507076]  vb2_core_reqbufs+0xfc/0x23c [videobuf2_common]
      [  460.512815]  vb2_reqbufs+0x44/0x5c [videobuf2_v4l2]
      [  460.517853]  v4l2_m2m_reqbufs+0x44/0x78 [v4l2_mem2mem]
      [  460.523144]  v4l2_m2m_ioctl_reqbufs+0x1c/0x28 [v4l2_mem2mem]
      [  460.528976]  v4l_reqbufs+0x30/0x40
      [  460.532480]  __video_do_ioctl+0x36c/0x454
      [  460.536610]  video_usercopy+0x25c/0x51c
      [  460.540572]  video_ioctl2+0x38/0x48
      [  460.544176]  v4l2_ioctl+0x60/0x74
      [  460.547602]  do_video_ioctl+0x948/0x3520
      [  460.551648]  v4l2_compat_ioctl32+0x60/0x98
      [  460.555872]  __arm64_compat_sys_ioctl+0x134/0x20c
      [  460.560718]  el0_svc_common+0x9c/0xe4
      [  460.564498]  el0_svc_compat_handler+0x2c/0x38
      [  460.568982]  el0_svc_compat+0x8/0x18
      [  460.572672] ---[ end trace ce209b87b2f3af88 ]---
      
      >From above warning one would deduce that the sg segment will overflow
      the device's capacity. In reality, the hardware can accommodate larger
      sg segments.
      So, initialize the max segment size properly to weed out this warning.
      
      Based on a similar patch sent by Sean Paul for mdss:
      https://patchwork.kernel.org/patch/10671457/Signed-off-by: NVivek Gautam <vivek.gautam@codeaurora.org>
      Acked-by: NStanimir Varbanov <stanimir.varbanov@linaro.org>
      Signed-off-by: NHans Verkuil <hverkuil-cisco@xs4all.nl>
      Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      38be2cba
    • Y
      ASoC: use dma_ops of parent device for acp_audio_dma · 9df6861a
      Yu Zhao 提交于
      [ Upstream commit 23aa128bb28d9da69bb1bdb2b70e50128857884a ]
      
      AMD platform device acp_audio_dma can only be created by parent PCI
      device driver (drivers/gpu/drm/amd/amdgpu/amdgpu_acp.c). Pass struct
      device of the parent to snd_pcm_lib_preallocate_pages() so
      dma_alloc_coherent() can use correct dma_ops. Otherwise, it will
      use default dma_ops which is nommu_dma_ops on x86_64 even when
      IOMMU is enabled and set to non passthrough mode.
      
      Though platform device inherits some dma related fields during its
      creation in mfd_add_device(), we can't simply pass its struct device
      to snd_pcm_lib_preallocate_pages() because dma_ops is not among the
      inherited fields. Even it were, drivers/iommu/amd_iommu.c would
      ignore it because get_device_id() doesn't handle platform device.
      
      This change shouldn't give us any trouble even struct device of the
      parent becomes null or represents some non PCI device in the future,
      because get_dma_ops() correctly handles null struct device or uses
      the default dma_ops if struct device doesn't have it set.
      Signed-off-by: NYu Zhao <yuzhao@google.com>
      Signed-off-by: NMark Brown <broonie@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9df6861a
    • N
      media: firewire: Fix app_info parameter type in avc_ca{,_app}_info · 597a09e0
      Nathan Chancellor 提交于
      [ Upstream commit b2e9a4eda11fd2cb1e6714e9ad3f455c402568ff ]
      
      Clang warns:
      
      drivers/media/firewire/firedtv-avc.c:999:45: warning: implicit
      conversion from 'int' to 'char' changes value from 159 to -97
      [-Wconstant-conversion]
              app_info[0] = (EN50221_TAG_APP_INFO >> 16) & 0xff;
                          ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
      drivers/media/firewire/firedtv-avc.c:1000:45: warning: implicit
      conversion from 'int' to 'char' changes value from 128 to -128
      [-Wconstant-conversion]
              app_info[1] = (EN50221_TAG_APP_INFO >>  8) & 0xff;
                          ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
      drivers/media/firewire/firedtv-avc.c:1040:44: warning: implicit
      conversion from 'int' to 'char' changes value from 159 to -97
      [-Wconstant-conversion]
              app_info[0] = (EN50221_TAG_CA_INFO >> 16) & 0xff;
                          ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
      drivers/media/firewire/firedtv-avc.c:1041:44: warning: implicit
      conversion from 'int' to 'char' changes value from 128 to -128
      [-Wconstant-conversion]
              app_info[1] = (EN50221_TAG_CA_INFO >>  8) & 0xff;
                          ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
      4 warnings generated.
      
      Change app_info's type to unsigned char to match the type of the
      member msg in struct ca_msg, which is the only thing passed into the
      app_info parameter in this function.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/105Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      597a09e0
    • B
      powerpc/pseries/cpuidle: Fix preempt warning · 3049cdc2
      Breno Leitao 提交于
      [ Upstream commit 2b038cbc5fcf12a7ee1cc9bfd5da1e46dacdee87 ]
      
      When booting a pseries kernel with PREEMPT enabled, it dumps the
      following warning:
      
         BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
         caller is pseries_processor_idle_init+0x5c/0x22c
         CPU: 13 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc3-00090-g12201a0128bc-dirty #828
         Call Trace:
         [c000000429437ab0] [c0000000009c8878] dump_stack+0xec/0x164 (unreliable)
         [c000000429437b00] [c0000000005f2f24] check_preemption_disabled+0x154/0x160
         [c000000429437b90] [c000000000cab8e8] pseries_processor_idle_init+0x5c/0x22c
         [c000000429437c10] [c000000000010ed4] do_one_initcall+0x64/0x300
         [c000000429437ce0] [c000000000c54500] kernel_init_freeable+0x3f0/0x500
         [c000000429437db0] [c0000000000112dc] kernel_init+0x2c/0x160
         [c000000429437e20] [c00000000000c1d0] ret_from_kernel_thread+0x5c/0x6c
      
      This happens because the code calls get_lppaca() which calls
      get_paca() and it checks if preemption is disabled through
      check_preemption_disabled().
      
      Preemption should be disabled because the per CPU variable may make no
      sense if there is a preemption (and a CPU switch) after it reads the
      per CPU data and when it is used.
      
      In this device driver specifically, it is not a problem, because this
      code just needs to have access to one lppaca struct, and it does not
      matter if it is the current per CPU lppaca struct or not (i.e. when
      there is a preemption and a CPU migration).
      
      That said, the most appropriate fix seems to be related to avoiding
      the debug_smp_processor_id() call at get_paca(), instead of calling
      preempt_disable() before get_paca().
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3049cdc2
    • B
      powerpc/xmon: Fix invocation inside lock region · 115a0d66
      Breno Leitao 提交于
      [ Upstream commit 8d4a862276a9c30a269d368d324fb56529e6d5fd ]
      
      Currently xmon needs to get devtree_lock (through rtas_token()) during its
      invocation (at crash time). If there is a crash while devtree_lock is being
      held, then xmon tries to get the lock but spins forever and never get into
      the interactive debugger, as in the following case:
      
      	int *ptr = NULL;
      	raw_spin_lock_irqsave(&devtree_lock, flags);
      	*ptr = 0xdeadbeef;
      
      This patch avoids calling rtas_token(), thus trying to get the same lock,
      at crash time. This new mechanism proposes getting the token at
      initialization time (xmon_init()) and just consuming it at crash time.
      
      This would allow xmon to be possible invoked independent of devtree_lock
      being held or not.
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Reviewed-by: NThiago Jung Bauermann <bauerman@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      115a0d66