1. 08 12月, 2019 4 次提交
  2. 05 12月, 2019 3 次提交
  3. 29 11月, 2019 3 次提交
    • X
      alios: mm, memcg: fix possible soft lockup in try_charge · 1f6142a0
      Xu Yu 提交于
      When events such as direct reclaim and oom occur intensively, soft
      lockup is very likely to happen in the instances with 1 vcpu and with
      kernel preempt disabled.
      
      The example soft lockup is as follows.
      
      [  160.555984] watchdog: BUG: soft lockup - CPU#0 stuck for 112s! [malloc:2188]
      [  160.557975] Modules linked in: button
      [  160.559495] CPU: 0 PID: 2188 Comm: malloc Not tainted 4.19.57-15.457.al7.x86_64 #1
      [  160.561546] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
      [  160.563707] RIP: 0010:shrink_node+0x1ae/0x450
      [  160.565391] Code: 00 00 00 49 8b 4f 20 ba 01 00 00 00 4c 8b 74 24 10 4d 8b 47 28 49 8b 77 10 48 2b 4c 24 08 41 8b 7f 1c 4d8
      [  160.570747] RSP: 0000:ffff9d0ec07a3b58 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
      [  160.572889] RAX: ffff982ab6014330 RBX: ffff982ab6014000 RCX: 0000000000000000
      [  160.574992] RDX: 0000000000000001 RSI: ffff982ab6014000 RDI: ffff982ab6014000
      [  160.577106] RBP: ffff982afffb6000 R08: 0000000000000000 R09: ffff982ab6014000
      [  160.579219] R10: 0000000000000004 R11: 0000000000aaaaaa R12: 0000000000000000
      [  160.581326] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9d0ec07a3c50
      [  160.583450] FS:  00007f8b414f7740(0000) GS:ffff982afda00000(0000) knlGS:0000000000000000
      [  160.585704] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  160.587662] CR2: 00007f8adb800010 CR3: 000000007ac9e001 CR4: 00000000003606b0
      [  160.589835] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  160.591971] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  160.594133] Call Trace:
      [  160.595602]  do_try_to_free_pages+0xcc/0x390
      [  160.597356]  try_to_free_mem_cgroup_pages+0xf9/0x1d0
      [  160.599198]  ? out_of_memory+0xb5/0x4a0
      [  160.600882]  try_charge+0x244/0x750
      [  160.602522]  ? __pagevec_lru_add_fn+0x1d0/0x330
      [  160.604310]  mem_cgroup_try_charge+0xb4/0x1d0
      [  160.606085]  mem_cgroup_try_charge_delay+0x1c/0x40
      [  160.607892]  do_anonymous_page+0xf7/0x540
      [  160.609574]  __handle_mm_fault+0x665/0xa00
      [  160.611233]  ? __switch_to_asm+0x35/0x70
      [  160.612838]  handle_mm_fault+0x122/0x1e0
      [  160.614407]  __do_page_fault+0x1b7/0x470
      [  160.615962]  do_page_fault+0x32/0x140
      [  160.617474]  ? async_page_fault+0x8/0x30
      [  160.619012]  async_page_fault+0x1e/0x30
      [  160.620526] RIP: 0033:0x40068e
      
      Fix it by adding cond_resched() in try_charge(), just before goto retry
      after OOM_SUCCESS, in order to let OOM free some memory first.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1f6142a0
    • J
      iocost: add ioc_gq stat · 86068d0f
      Jiufei Xue 提交于
      Add a stat file to monitor the ioc_gq stat.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      86068d0f
    • J
      dm thin: wakeup worker only when deferred bios exist · 6a2b7b88
      Jeffle Xu 提交于
      commit d256d796279de0bdc227ff4daef565aa7e80c898 upstream.
      
      Single thread fio test (read, bs=4k, ioengine=libaio, iodepth=128,
      numjobs=1) over dm-thin device has poor performance versus bare nvme
      device.
      
      Further investigation with perf indicates that queue_work_on() consumes
      over 20% CPU time when doing IO over dm-thin device. The call stack is
      as follows.
      
      - 40.57% thin_map
          + 22.07% queue_work_on
          + 9.95% dm_thin_find_block
          + 2.80% cell_defer_no_holder
            1.91% inc_all_io_entry.isra.33.part.34
          + 1.78% bio_detain.isra.35
      
      In cell_defer_no_holder(), wakeup_worker() is always called, no matter
      whether the tc->deferred_bio_list list is empty or not. In single thread
      IO model, this list is most likely empty. So skip waking up worker thread
      if tc->deferred_bio_list list is empty.
      
      Single thread IO performance improves from 448 MiB/s to 646 MiB/s (+44%)
      once the needless wake_worker() calls are properly skipped.
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6a2b7b88
  4. 28 11月, 2019 15 次提交
    • X
      alios: blk-throttle: limit bios to fix amount of pages entering writeback prematurely · 06a67773
      Xiaoguang Wang 提交于
      Currently in blk_throtl_bio(), if one bio exceeds its throtl_grp's bps
      or iops limit, this bio will be queued throtl_grp's throtl_service_queue,
      then obviously mm subsys will submit more pages, even underlying device
      can not handle these io requests, also this will make large amount of pages
      entering writeback prematurely, later if some process writes some of these
      pages, it will wait for long time.
      
      I have done some tests: one process does buffered writes on a 1GB file,
      and make this process's blkcg max bps limit be 10MB/s, I observe this:
      	#cat /proc/meminfo  | grep -i back
      	Writeback:        900024 kB
      	WritebackTmp:          0 kB
      
      I think this Writeback value is just too big, indeed many bios have been
      queued in throtl_grp's throtl_service_queue, if one process try to write
      the last bio's page in this queue, it will call wait_on_page_writeback(page),
      which must wait the previous bios to finish and will take long time, we
      have also see 120s hung task warning in our server.
      
       INFO: task kworker/u128:0:30072 blocked for more than 120 seconds.
             Tainted: G            E 4.9.147-013.ali3000_015_test.alios7.x86_64 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       kworker/u128:0  D    0 30072      2 0x00000000
       Workqueue: writeback wb_workfn (flush-8:16)
        ffff882ddd066b40 0000000000000000 ffff882e5cad3400 ffff882fbe959e80
        ffff882fa50b1a00 ffffc9003a5a3768 ffffffff8173325d ffffc9003a5a3780
        00ff882e5cad3400 ffff882fbe959e80 ffffffff81360b49 ffff882e5cad3400
       Call Trace:
        [<ffffffff8173325d>] ? __schedule+0x23d/0x6d0
        [<ffffffff81360b49>] ? alloc_request_struct+0x19/0x20
        [<ffffffff81733726>] schedule+0x36/0x80
        [<ffffffff81736c56>] schedule_timeout+0x206/0x4b0
        [<ffffffff81036c69>] ? sched_clock+0x9/0x10
        [<ffffffff81363073>] ? get_request+0x403/0x810
        [<ffffffff8110ca10>] ? ktime_get+0x40/0xb0
        [<ffffffff81732f8a>] io_schedule_timeout+0xda/0x170
        [<ffffffff81733f90>] ? bit_wait+0x60/0x60
        [<ffffffff81733fab>] bit_wait_io+0x1b/0x60
        [<ffffffff81733b28>] __wait_on_bit+0x58/0x90
        [<ffffffff811b0d91>] ? find_get_pages_tag+0x161/0x2e0
        [<ffffffff811aff62>] wait_on_page_bit+0x82/0xa0
        [<ffffffff810d47f0>] ? wake_atomic_t_function+0x60/0x60
        [<ffffffffa02fc181>] mpage_prepare_extent_to_map+0x2d1/0x310 [ext4]
        [<ffffffff8121ff65>] ? kmem_cache_alloc+0x185/0x1a0
        [<ffffffffa0305a2f>] ? ext4_init_io_end+0x1f/0x40 [ext4]
        [<ffffffffa0300294>] ext4_writepages+0x404/0xef0 [ext4]
        [<ffffffff81508c64>] ? scsi_init_io+0x44/0x200
        [<ffffffff81398a0f>] ? fprop_fraction_percpu+0x2f/0x80
        [<ffffffff811c139e>] do_writepages+0x1e/0x30
        [<ffffffff8127c0f5>] __writeback_single_inode+0x45/0x320
        [<ffffffff8127c942>] writeback_sb_inodes+0x272/0x600
        [<ffffffff8127cf6b>] wb_writeback+0x10b/0x300
        [<ffffffff8127d884>] wb_workfn+0xb4/0x380
        [<ffffffff810b85e9>] ? try_to_wake_up+0x59/0x3e0
        [<ffffffff810a5759>] process_one_work+0x189/0x420
        [<ffffffff810a5a3e>] worker_thread+0x4e/0x4b0
        [<ffffffff810a59f0>] ? process_one_work+0x420/0x420
        [<ffffffff810ac026>] kthread+0xe6/0x100
        [<ffffffff810abf40>] ? kthread_park+0x60/0x60
        [<ffffffff81738499>] ret_from_fork+0x39/0x50
      
      To fix this issue, we can simply limit throtl_service_queue's max queued
      bios, currently we limit it to throtl_grp's bps_limit or iops limit, if it
      still exteeds, we just sleep for a while.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      06a67773
    • J
      alios: block-throttle: add counters for completed io · 6bb5d410
      Jiufei Xue 提交于
      Now we have counters for wait_time and service_time, but no completed
      ios, so the average latency can not be measured.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      6bb5d410
    • J
      alios: block-throttle: code cleanup · 4a7c0663
      Jiufei Xue 提交于
      This patch does the code cleanup because the seq_show handlers for tg
      counters are the same. No functional changes.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      4a7c0663
    • J
      alios: blk-throttle: add throttled io/bytes counter · eeb720d8
      Joseph Qi 提交于
      Add 2 interfaces to stat io throttle information:
        blkio.throttle.total_io_queued
        blkio.throttle.total_bytes_queued
      
      These interfaces are used for monitoring throttled io/bytes and
      analyzing if delay has relation with io throttle.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      eeb720d8
    • J
      alios: blk-throttle: fix tg NULL pointer dereference · 4667e926
      Joseph Qi 提交于
      io throtl stats will blkg_get at the beginning of throttle and then
      blkg_put at the new introduced bi_tg_end_io. This will cause blkg to be
      freed if end_io is called twice like dm-thin, which will save origin
      end_io first, and call its overwrite end_io and then the saved end_io.
      After that, access blkg is invalid and finally BUG:
      
      [ 4417.235048] BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0
      [ 4417.236475] IP: [<ffffffff812e7c71>] throtl_update_dispatch_stats+0x21/0xb0
      [ 4417.237865] PGD 98395067 PUD 362e1067 PMD 0
      [ 4417.239232] Oops: 0000 [#1] SMP
      ......
      [ 4417.274070] Call Trace:
      [ 4417.275407]  [<ffffffff812ea93d>] blk_throtl_bio+0xfd/0x630
      [ 4417.276760]  [<ffffffff810b3613>] ? wake_up_process+0x23/0x40
      [ 4417.278079]  [<ffffffff81094c04>] ? wake_up_worker+0x24/0x30
      [ 4417.279387]  [<ffffffff81095772>] ? insert_work+0x62/0xa0
      [ 4417.280697]  [<ffffffff8116c2c7>] ? mempool_free_slab+0x17/0x20
      [ 4417.282019]  [<ffffffff8116c6c9>] ? mempool_free+0x49/0x90
      [ 4417.283326]  [<ffffffff812c9acf>] generic_make_request_checks+0x16f/0x360
      [ 4417.284637]  [<ffffffffa0340d97>] ? thin_map+0x227/0x2c0 [dm_thin_pool]
      [ 4417.285951]  [<ffffffff812c9ce7>] generic_make_request+0x27/0x130
      [ 4417.287240]  [<ffffffffa0230b3d>] __map_bio+0xad/0x100 [dm_mod]
      [ 4417.288503]  [<ffffffffa023257e>] __clone_and_map_data_bio+0x15e/0x240 [dm_mod]
      [ 4417.289778]  [<ffffffffa02329ea>] __split_and_process_bio+0x38a/0x500 [dm_mod]
      [ 4417.291062]  [<ffffffffa0232c91>] dm_make_request+0x131/0x1a0 [dm_mod]
      [ 4417.292344]  [<ffffffff812c9da2>] generic_make_request+0xe2/0x130
      [ 4417.293626]  [<ffffffff812c9e61>] submit_bio+0x71/0x150
      [ 4417.294909]  [<ffffffff8121ab1d>] ? bio_alloc_bioset+0x20d/0x360
      [ 4417.296195]  [<ffffffff81215acb>] _submit_bh+0x14b/0x220
      [ 4417.297484]  [<ffffffff81215bb0>] submit_bh+0x10/0x20
      [ 4417.298744]  [<ffffffffa016d8d8>] jbd2_journal_commit_transaction+0x6c8/0x19a0 [jbd2]
      [ 4417.300014]  [<ffffffff810135b8>] ? __switch_to+0xf8/0x4c0
      [ 4417.301268]  [<ffffffffa01731e9>] kjournald2+0xc9/0x270 [jbd2]
      [ 4417.302524]  [<ffffffff810a0fd0>] ? wake_up_atomic_t+0x30/0x30
      [ 4417.303753]  [<ffffffffa0173120>] ? commit_timeout+0x10/0x10 [jbd2]
      [ 4417.304950]  [<ffffffff8109ffef>] kthread+0xcf/0xe0
      [ 4417.306107]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
      [ 4417.307255]  [<ffffffff81647f18>] ret_from_fork+0x58/0x90
      [ 4417.308349]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
      ......
      
      Now we introduce a new bio flag BIO_THROTL_STATED to make sure
      blkg_get/put only get called once for the same bio.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      4667e926
    • J
      alios: blk-throttle: support io delay stats · 65e6966a
      Joseph Qi 提交于
      Add blkio.throttle.io_service_time and blkio.throttle.io_wait_time to
      get per-cgroup io delay statistics.
      io_service_time represents the time spent after io throttle to io
      completion, while io_wait_time represents the time spent on throttle
      queue.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      65e6966a
    • W
      alios: nvme-pci: Disable dicard zero-out functionality on Intel's P3600 NVMe disk drive · d79c6eda
      Wenwei Tao 提交于
      We found huge performance lost on below particular Intel's disk drive
      when discard zeroout functionality is enabled on it. The issue was
      found when we have ext4 filesystem mounted on the disk drive and
      started regular FIO testing. With it disabled, we don't observe
      performance lost any more.
      
      81:00.0 Non-Volatile memory controller: Intel Corporation \
                   PCIe Data Center SSD (rev 01)
      
      This imposes to disable the discard zero-out functionality on above
      disk drive in order to regain the high performance that NVMe disk
      driver supposes to provide.
      
      Differential Revision: https://aone.alibaba-inc.com/code/D377540Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d79c6eda
    • X
      alios: memcg: Point wb to root memcg/blkcg when offlining to avoid zombie · 8514dbc7
      Xunlei Pang 提交于
      After turning off the memcg kmem charging, we still suffer
      from various zombie memcg problems on production environment
      because of its non-zero reference count from both page caches
      and per-memcg writeback related structure(bdi_writeback takes
      a reference).
      
      After we reclaimed all the page caches of the zombie memcg,
      it still can't be dropped due to its bdi_writeback.
      
      bdi_writeback is further referenced by the inodes of files,
      so the memcg can't be truely released until the inodes are
      destroyed afterwards which is quite unlikely in short term.
      
      When memcg is offlining, change it's bdi_writeback to root,
      and call css_put to formally release it. We've tested on
      product environment, it yields pretty good effect.
      
      Ditto for wb_blkcg_offline().
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      8514dbc7
    • X
      alios: block: add counter to track io request's d2c time · 07232d74
      Xiaoguang Wang 提交于
      Indeed tool iostat's await is not good enough, which is somewhat sketchy
      and could not show request's latency on device driver's side.
      
      Here we add a new counter to track io request's d2c time, also with this
      patch, we can extend iostat to show this value easily.
      
      Note:
      I had checked how iostat is implemented, it just reads fields it needs,
      so iostat won't be affected by this change, so does tsar.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      07232d74
    • M
      alios: fuse: add sysfs api to flush processing queue requests · 2292db66
      Ma Jie Yue 提交于
      The failover of fuse userspace daemon will reuse the existing fuse conn,
      without unmounting it, during daemon crashing and recovery procedure.
      But some requests might be in process in the daemon before sending out reply,
      when the crash happens. This will stuck the application since it will
      never get the reply after the failover.
      
      We add the sysfs api to flush these requests, after the daemon crash, before
      recovery. It is easy to reproduce the issue in the fuse userspace daemon,
      just exit after receiving the request and before sending the reply back.
      The application will hang up in some read/write operation, before
      echo 1 > /sys/fs/fuse/connection/xxx/flush. The flush operation will make
      the io fail and return the error to the application.
      Signed-off-by: NMa Jie Yue <majieyue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      2292db66
    • X
      alios: jbd2: add proc entry to control whether doing buffer copy-out · ac452d09
      Xiaoguang Wang 提交于
      When jbd2 tries to get write access to one buffer, and if this buffer
      is under writeback with BH_Shadow flag, jbd2 will wait until this buffer
      has been written to disk, but sometimes the time taken to wait may be
      much long, especially disk capacity is almost full.
      
      Here add a proc entry "force-copy", if its value is not zero, jbd2 will
      always do meta buffer copy-cout, then we can eliminate the unnecessary
      wating time here, and reduce long tail latency for buffered-write.
      
      I construct such test case below:
      
      $cat offline.fio
      ; fio-rand-RW.job for fiotest
      
      [global]
      name=fio-rand-RW
      filename=fio-rand-RW
      rw=randrw
      rwmixread=60
      rwmixwrite=40
      bs=4K
      direct=0
      numjobs=4
      time_based=1
      runtime=900
      
      [file1]
      size=60G
      ioengine=sync
      iodepth=16
      
      $cat online.fio
      ; fio-seq-write.job for fiotest
      
      [global]
      name=fio-seq-write
      filename=fio-seq-write
      rw=write
      bs=256K
      direct=0
      numjobs=1
      time_based=1
      runtime=60
      
      [file1]
      rate=50m
      size=10G
      ioengine=sync
      iodepth=16
      
      With this patch:
      $cat /proc/fs/jbd2/sda5-8/force_copy
      0
      
      online fio almost always get such long tail latency:
      
      Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
      00m:00s]
      file1: (groupid=0, jobs=1): err= 0: pid=17855: Thu Nov 15 09:45:57 2018
        write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
          clat (usec): min=135, max=4086.6k, avg=867.21, stdev=50338.22
           lat (usec): min=139, max=4086.6k, avg=871.16, stdev=50338.22
          clat percentiles (usec):
           |  1.00th=[    141],  5.00th=[    143], 10.00th=[    145],
           | 20.00th=[    147], 30.00th=[    147], 40.00th=[    149],
           | 50.00th=[    149], 60.00th=[    151], 70.00th=[    153],
           | 80.00th=[    155], 90.00th=[    159], 95.00th=[    163],
           | 99.00th=[    255], 99.50th=[    273], 99.90th=[    429],
           | 99.95th=[    441], 99.99th=[3640656]
      
      $cat /proc/fs/jbd2/sda5-8/force_copy
      1
      
      online fio latency is much better.
      
      Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
      00m:00s]
      file1: (groupid=0, jobs=1): err= 0: pid=8084: Thu Nov 15 09:31:15 2018
        write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
          clat (usec): min=137, max=545, avg=151.35, stdev=16.22
           lat (usec): min=140, max=548, avg=155.31, stdev=16.65
          clat percentiles (usec):
           |  1.00th=[  143],  5.00th=[  145], 10.00th=[  145], 20.00th=[
      147],
           | 30.00th=[  147], 40.00th=[  147], 50.00th=[  149], 60.00th=[
      149],
           | 70.00th=[  151], 80.00th=[  155], 90.00th=[  157], 95.00th=[
      161],
           | 99.00th=[  239], 99.50th=[  269], 99.90th=[  420], 99.95th=[
      429],
           | 99.99th=[  537]
      
      As to the cost: because we'll always need to copy meta buffer, will
      consume minor cpu time and some memory(at most 32MB for 128MB journal
      size).
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ac452d09
    • X
      alios: ext4: don't submit unwritten extent while holding active jbd2 handle · a8366d32
      Xiaoguang Wang 提交于
      In ext4_writepages(), for every iteration, mpage_prepare_extent_to_map()
      will try to find 2048 pages to map and normally one bio can contain 256
      pages at most. If we really found 2048 pages to map, there will be 4 bios
      and 4 ext4_io_submit() calls which are called both in ext4_writepages()
      and mpage_map_and_submit_extent().
      
      But note that in mpage_map_and_submit_extent(), we hold a valid jbd2 handle,
      when dioread_nolock is enabled and extent is unwritten, jbd2 commit thread
      will wait this handle to finish, so wait the unwritten extent is written to
      disk, this will introduce unnecessary stall time, especially longer when
      the writeback operation is io throttled, need to fix this issue.
      
      Here for this scene, we accumulate bios in ext4_io_submit's io_bio, and
      only submit these bios after dropping the jbd2 handle.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      a8366d32
    • Z
      alios: fs,ext4: remove projid limit when create hard link · 28df06b3
      zhangliguang 提交于
      This is a temporary workaround plan to avoid the limitation when
      creating hard link cross two projids.
      Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      28df06b3
    • X
      alios: jbd2: add new "stats" proc file · 3550da0c
      Xiaoguang Wang 提交于
      /proc/fs/jbd2/${device}/info only shows whole average statistical
      info about jbd2's life cycle, but it can not show jbd2 info in
      specified time interval and sometimes this capability is very useful
      for trouble shooting. For example, we can not see how rs_locked and
      rs_flushing grows in specified time interval, but these two indexes
      can explain some reasons for app's behaviours.
      
      Here we add a new "stats" proc file like /proc/diskstats, then we can
      implement a simple tool jbd2_stats which'll display detailed jbd2 info
      in specified time interval. Like below(time interval 5s):
      
      [lege@localhost ~]$ cat /proc/fs/jbd2/vdb1-8/stats
      51 30 8192 0 1 241616 0 0 22 0 47158 891 942 1000 1000
      
      [lege@localhost ~]$ gcc -o jbd2_stat jbd2_stat.c ; ./jbd2_stat
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             1861       158       359     13.00      0.00
      2.00
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             1974       113       389     26.00      0.00
      5.00
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             2188       214       308     10.00      0.00
      7.00
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             2344       156       332     19.00      0.00
      4.00
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3550da0c
    • J
      alios: jbd2: create jbd2-ckpt thread for journal checkpoint · c31b17e5
      Joseph Qi 提交于
      This is trying to do jbd2 checkpoint in a specific kernel thread, then
      checkpoint won't be under io throttle control.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
      Reviewed by: Baoyou Xie <baoyou.xie@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      c31b17e5
  5. 21 11月, 2019 1 次提交
  6. 20 11月, 2019 14 次提交
    • Y
      mm: thp: handle page cache THP correctly in PageTransCompoundMap · c9f8166a
      Yang Shi 提交于
      commit 169226f7e0d275c1879551f37484ef6683579a5c upstream
      
      We have a usecase to use tmpfs as QEMU memory backend and we would like
      to take the advantage of THP as well.  But, our test shows the EPT is
      not PMD mapped even though the underlying THP are PMD mapped on host.
      The number showed by /sys/kernel/debug/kvm/largepage is much less than
      the number of PMD mapped shmem pages as the below:
      
        7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
        Size:            4194304 kB
        [snip]
        AnonHugePages:         0 kB
        ShmemPmdMapped:   579584 kB
        [snip]
        Locked:                0 kB
      
        cat /sys/kernel/debug/kvm/largepages
        12
      
      And some benchmarks do worse than with anonymous THPs.
      
      By digging into the code we figured out that commit 127393fb ("mm:
      thp: kvm: fix memory corruption in KVM with THP enabled") checks if
      there is a single PTE mapping on the page for anonymous THP when setting
      up EPT map.  But the _mapcount < 0 check doesn't work for page cache THP
      since every subpage of page cache THP would get _mapcount inc'ed once it
      is PMD mapped, so PageTransCompoundMap() always returns false for page
      cache THP.  This would prevent KVM from setting up PMD mapped EPT entry.
      
      So we need handle page cache THP correctly.  However, when page cache
      THP's PMD gets split, kernel just remove the map instead of setting up
      PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
      the subpages may get PTE mapped even though it is still a THP since the
      page cache THP may be mapped by other processes at the mean time.
      
      Checking its _mapcount and whether the THP has PTE mapped or not.
      Although this may report some false negative cases (PTE mapped by other
      processes), it looks not trivial to make this accurate.
      
      With this fix /sys/kernel/debug/kvm/largepage would show reasonable
      pages are PMD mapped by EPT as the below:
      
        7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
        Size:            4194304 kB
        [snip]
        AnonHugePages:         0 kB
        ShmemPmdMapped:   557056 kB
        [snip]
        Locked:                0 kB
      
        cat /sys/kernel/debug/kvm/largepages
        271
      
      And the benchmarks are as same as anonymous THPs.
      
      [yang.shi@linux.alibaba.com: v4]
        Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: dd78fedd ("rmap: support file thp")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reported-by: NGang Deng <gavin.dg@linux.alibaba.com>
      Tested-by: NGang Deng <gavin.dg@linux.alibaba.com>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>    [4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      c9f8166a
    • Y
      ICX: perf/x86/intel: Fix invalid Bit 13 for Icelake MSR_OFFCORE_RSP_x register · c154e184
      Yunying Sun 提交于
      commit 3b238a64c3009fed36eaea1af629d9377759d87d upstream.
      
      The Intel SDM states that bit 13 of Icelake's MSR_OFFCORE_RSP_x
      register is valid, and used for counting hardware generated prefetches
      of L3 cache. Update the bitmask to allow bit 13.
      
      Before:
      $ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3
       Performance counter stats for 'sleep 3':
         <not supported>      cpu/event=0xb7,umask=0x1,config1=0x1bfff/u
      
      After:
      $ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3
       Performance counter stats for 'sleep 3':
                   9,293      cpu/event=0xb7,umask=0x1,config1=0x1bfff/u
      Signed-off-by: NYunying Sun <yunying.sun@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NKan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: alexander.shishkin@linux.intel.com
      Cc: bp@alien8.de
      Cc: hpa@zytor.com
      Cc: jolsa@redhat.com
      Cc: namhyung@kernel.org
      Link: https://lkml.kernel.org/r/20190724082932.12833-1-yunying.sun@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NLin Wang <lin.x.wang@intel.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c154e184
    • K
      ICX: perf/x86/intel: Add more Icelake CPUIDs · e4ed6f52
      Kan Liang 提交于
      commit faaeff98666c24376cebd0b106504d05a36881d1 upstream.
      
      Add new model number for Icelake desktop and server to perf.
      
      The data source encoding for Icelake server is the same as Skylake
      server.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bp@alien8.de
      Cc: qiuxu.zhuo@intel.com
      Cc: rui.zhang@intel.com
      Cc: tony.luck@intel.com
      Link: https://lkml.kernel.org/r/20190603134122.13853-2-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NLin Wang <lin.x.wang@intel.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e4ed6f52
    • B
      resource/docs: Complete kernel-doc style function documentation · 1de9c7c3
      Borislav Petkov 提交于
      commit f26621e60b35369bca9228bc936dc723b3e421af upstream.
      
      Add the missing kernel-doc style function parameters documentation.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: linux-tip-commits@vger.kernel.org
      Cc: rdunlap@infradead.org
      Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
      Link: http://lkml.kernel.org/r/20181105093307.GA12445@zn.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      [joseph: fix find_next_iomem_res() documentation]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      1de9c7c3
    • R
      resource/docs: Fix new kernel-doc warnings · 39cecf2f
      Randy Dunlap 提交于
      commit f75d651587f719a813ebbbfeee570e6570731d55 upstream.
      
      The first group of warnings is caused by a "/**" kernel-doc notation
      marker but the function comments are not in kernel-doc format.
      Also add another error return value here.
      
        ../kernel/resource.c:337: warning: Function parameter or member 'start' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'end' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'flags' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'desc' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'first_lvl' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'res' not described in 'find_next_iomem_res'
      
      Add the missing function parameter documentation for the other warnings:
      
        ../kernel/resource.c:409: warning: Function parameter or member 'arg' not described in 'walk_iomem_res_desc'
        ../kernel/resource.c:409: warning: Function parameter or member 'func' not described in 'walk_iomem_res_desc'
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
      Link: http://lkml.kernel.org/r/dda2e4d8-bedd-3167-20fe-8c7d2d35b354@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      [joseph: fix find_next_iomem_res() documentation]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      39cecf2f
    • Q
      acpi/hmat: fix an uninitialized memory_target · ffd11878
      Qian Cai 提交于
      commit ab3a9f2ccc080d27873f76869c9a780be45e581e upstream.
      
      The commit 665ac7e92757 ("acpi/hmat: Register processor domain to its
      memory") introduced an uninitialized "struct memory_target" that could
      cause an incorrect branching.
      
      drivers/acpi/hmat/hmat.c:385:6: warning: variable 'target' is used
      uninitialized whenever 'if' condition is false
      [-Wsometimes-uninitialized]
              if (p->flags & ACPI_HMAT_MEMORY_PD_VALID) {
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/acpi/hmat/hmat.c:392:6: note: uninitialized use occurs here
              if (target && p->flags & ACPI_HMAT_PROCESSOR_PD_VALID) {
                  ^~~~~~
      drivers/acpi/hmat/hmat.c:385:2: note: remove the 'if' if its condition
      is always true
              if (p->flags & ACPI_HMAT_MEMORY_PD_VALID) {
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/acpi/hmat/hmat.c:369:30: note: initialize the variable 'target'
      to silence this warning
              struct memory_target *target;
                                          ^
                                           = NULL
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NMukesh Ojha <mojha@codeaurora.org>
      Fixes: 665ac7e92757 ("acpi/hmat: Register processor domain to its memory")
      Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ffd11878
    • T
      ICX: EDAC, i10nm: Fix randconfig builds · ea396c30
      Tony Luck 提交于
      commit d6a9f7336d925364daca00557afa59a68e78b422 upstream.
      
      I10NM_EDAC depends on CONFIG_ACPI so make that dependency explicit.
      Reported-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20190205180200.26865-1-tony.luck@intel.comSigned-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ea396c30
    • A
      tools x86 uapi asm: Sync the pt_regs.h copy with the kernel sources · fae5be70
      Arnaldo Carvalho de Melo 提交于
      commit 0ceb5499a8001e5ddac2c8bd7b45eb4c643469ad upstream.
      
      To get the changes in:
      
        878068ea270e ("perf/x86: Support outputting XMM registers")
      
      That will be used in a followup patch to allow users to ask for some or
      all of those registers to be collected in certain contatexts.
      
      This silences the following perf build warning:
      
        Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/perf_regs.h' differs from latest version at 'arch/x86/include/uapi/asm/perf_regs.h'
        diff -u tools/arch/x86/include/uapi/asm/perf_regs.h arch/x86/include/uapi/asm/perf_regs.h
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/n/tip-6pjnnrzqt3x3n2cd6br3wk7k@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      fae5be70
    • P
      device-dax: fix memory and resource leak if hotplug fails · d0219d42
      Pavel Tatashin 提交于
      commit 31e4ca92a7dd4cdebd7fe1456b3b0b6ace9a816f upstream
      
      Patch series ""Hotremove" persistent memory", v6.
      
      Recently, adding a persistent memory to be used like a regular RAM was
      added to Linux.  This work extends this functionality to also allow hot
      removing persistent memory.
      
      We (Microsoft) have an important use case for this functionality.
      
      The requirement is for physical machines with small amount of RAM (~8G)
      to be able to reboot in a very short period of time (<1s).  Yet, there
      is a userland state that is expensive to recreate (~2G).
      
      The solution is to boot machines with 2G preserved for persistent
      memory.
      
      Copy the state, and hotadd the persistent memory so machine still has
      all 8G available for runtime.  Before reboot, offline and hotremove
      device-dax 2G, copy the memory that is needed to be preserved to pmem0
      device, and reboot.
      
      The series of operations look like this:
      
      1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
         and free ramdisk.
      2. Convert raw pmem0 to devdax
         ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
      3. Hotadd to System RAM
         echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
         echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
         echo online_movable > /sys/devices/system/memoryXXX/state
      4. Before reboot hotremove device-dax memory from System RAM
         echo offline > /sys/devices/system/memoryXXX/state
         echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
      5. Create raw pmem0 device
         ndctl create-namespace --mode raw  -e namespace0.0 -f
      6. Copy the state that was stored by apps to ramdisk to pmem device
      7. Do kexec reboot or reboot through firmware if firmware does not
         zero memory in pmem0 region (These machines have only regular
         volatile memory). So to have pmem0 device either memmap kernel
         parameter is used, or devices nodes in dtb are specified.
      
      This patch (of 3):
      
      When add_memory() fails, the resource and the memory should be freed.
      
      Link: http://lkml.kernel.org/r/20190517215438.6487-2-pasha.tatashin@soleen.com
      Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      d0219d42
    • V
      device-dax: Add a 'resource' attribute · 651aadb6
      Vishal Verma 提交于
      commit 40cdc60ac16a42eb4e013f84d0e7aa1d6ee060d3 upstream
      
      device-dax based devices were missing a 'resource' attribute to indicate
      the physical address range contributed by the device in question. This
      information is desirable to userspace tooling that may want to use the
      dax device as system-ram, and wants to selectively hotplug and online
      the memory blocks associated with a given device.
      
      Without this, the tooling would have to parse /proc/iomem for the memory
      ranges contributed by dax devices, which can be a workaround, but it is
      far easier to provide this information in the sysfs hierarchy.
      
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      651aadb6
    • A
      drivers/dax: Allow to include DEV_DAX_PMEM as builtin · c83b6a3a
      Aneesh Kumar K.V 提交于
      commit 67476656febd7ec5f1fe1aeec3c441fcf53b1e45 upstream
      
      This move the dependency to DEV_DAX_PMEM_COMPAT such that only
      if DEV_DAX_PMEM is built as module we can allow the compat support.
      
      This allows to test the new code easily in a emulation setup where we
      often build things without module support.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 730926c3b099 ("device-dax: Add /sys/class/dax backwards compatibility")
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      c83b6a3a
    • D
      device-dax: "Hotplug" persistent memory for use like normal RAM · 370de25f
      Dave Hansen 提交于
      commit c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 upstream
      
      This is intended for use with NVDIMMs that are physically persistent
      (physically like flash) so that they can be used as a cost-effective
      RAM replacement.  Intel Optane DC persistent memory is one
      implementation of this kind of NVDIMM.
      
      Currently, a persistent memory region is "owned" by a device driver,
      either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
      allow applications to explicitly use persistent memory, generally
      by being modified to use special, new libraries. (DIMM-based
      persistent memory hardware/software is described in great detail
      here: Documentation/nvdimm/nvdimm.txt).
      
      However, this limits persistent memory use to applications which
      *have* been modified.  To make it more broadly usable, this driver
      "hotplugs" memory into the kernel, to be managed and used just like
      normal RAM would be.
      
      To make this work, management software must remove the device from
      being controlled by the "Device DAX" infrastructure:
      
      	echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
      
      and then tell the new driver that it can bind to the device:
      
      	echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
      
      After this, there will be a number of new memory sections visible
      in sysfs that can be onlined, or that may get onlined by existing
      udev-initiated memory hotplug rules.
      
      This rebinding procedure is currently a one-way trip.  Once memory
      is bound to "kmem", it's there permanently and can not be
      unbound and assigned back to device_dax.
      
      The kmem driver will never bind to a dax device unless the device
      is *explicitly* bound to the driver.  There are two reasons for
      this: One, since it is a one-way trip, it can not be undone if
      bound incorrectly.  Two, the kmem driver destroys data on the
      device.  Think of if you had good data on a pmem device.  It
      would be catastrophic if you compile-in "kmem", but leave out
      the "device_dax" driver.  kmem would take over the device and
      write volatile data all over your good data.
      
      This inherits any existing NUMA information for the newly-added
      memory from the persistent memory device that came from the
      firmware.  On Intel platforms, the firmware has guarantees that
      require each socket's persistent memory to be in a separate
      memory-only NUMA node.  That means that this patch is not expected
      to create NUMA nodes, but will simply hotplug memory into existing
      nodes.
      
      Because NUMA nodes are created, the existing NUMA APIs and tools
      are sufficient to create policies for applications or memory areas
      to have affinity for or an aversion to using this memory.
      
      There is currently some metadata at the beginning of pmem regions.
      The section-size memory hotplug restrictions, plus this small
      reserved area can cause the "loss" of a section or two of capacity.
      This should be fixable in follow-on patches.  But, as a first step,
      losing 256MB of memory (worst case) out of hundreds of gigabytes
      is a good tradeoff vs. the required code to fix this up precisely.
      This calculation is also the reason we export
      memory_block_size_bytes().
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      370de25f
    • D
      mm/resource: Let walk_system_ram_range() search child resources · 3ed62604
      Dave Hansen 提交于
      commit 2b539aefe9e48e3908cff02699aa63a8b9bd268e upstream
      
      In the process of onlining memory, we use walk_system_ram_range()
      to find the actual RAM areas inside of the area being onlined.
      
      However, it currently only finds memory resources which are
      "top-level" iomem_resources.  Children are not currently
      searched which causes it to skip System RAM in areas like this
      (in the format of /proc/iomem):
      
      a0000000-bfffffff : Persistent Memory (legacy)
        a0000000-afffffff : System RAM
      
      Changing the true->false here allows children to be searched
      as well.  We need this because we add a new "System RAM"
      resource underneath the "persistent memory" resource when
      we use persistent memory in a volatile mode.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      3ed62604
    • D
      mm/memory-hotplug: Allow memory resources to be children · 71009d66
      Dave Hansen 提交于
      commit 2794129e902d8eb69413d884dc6404b8716ed9ed upstream
      
      The mm/resource.c code is used to manage the physical address
      space.  The current resource configuration can be viewed in
      /proc/iomem.  An example of this is at the bottom of this
      description.
      
      The nvdimm subsystem "owns" the physical address resources which
      map to persistent memory and has resources inserted for them as
      "Persistent Memory".  The best way to repurpose this for volatile
      use is to leave the existing resource in place, but add a "System
      RAM" resource underneath it. This clearly communicates the
      ownership relationship of this memory.
      
      The request_resource_conflict() API only deals with the
      top-level resources.  Replace it with __request_region() which
      will search for !IORESOURCE_BUSY areas lower in the resource
      tree than the top level.
      
      We *could* also simply truncate the existing top-level
      "Persistent Memory" resource and take over the released address
      space.  But, this means that if we ever decide to hot-unplug the
      "RAM" and give it back, we need to recreate the original setup,
      which may mean going back to the BIOS tables.
      
      This should have no real effect on the existing collision
      detection because the areas that truly conflict should be marked
      IORESOURCE_BUSY.
      
      00000000-00000fff : Reserved
      00001000-0009fbff : System RAM
      0009fc00-0009ffff : Reserved
      000a0000-000bffff : PCI Bus 0000:00
      000c0000-000c97ff : Video ROM
      000c9800-000ca5ff : Adapter ROM
      000f0000-000fffff : Reserved
        000f0000-000fffff : System ROM
      00100000-9fffffff : System RAM
        01000000-01e071d0 : Kernel code
        01e071d1-027dfdff : Kernel data
        02dc6000-0305dfff : Kernel bss
      a0000000-afffffff : Persistent Memory (legacy)
        a0000000-a7ffffff : System RAM
      b0000000-bffdffff : System RAM
      bffe0000-bfffffff : Reserved
      c0000000-febfffff : PCI Bus 0000:00
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      71009d66