1. 17 1月, 2020 1 次提交
    • X
      alinux: hotfix: Add Cloud Kernel hotfix enhancement · f94e5b1a
      Xunlei Pang 提交于
      We reserve some fields beforehand for core structures prone to change,
      so that we won't hurt when extra fields have to be added for hotfix,
      thereby inceasing the success rate, we even can hot add features with
      this enhancement.
      
      After reserving, normally cache does not matter as the reserved fields
      (usually at tail) are not accessed at all.
      
      Currently involve the following structures:
          MM:
          struct zone
          struct pglist_data
          struct mm_struct
          struct vm_area_struct
          struct mem_cgroup
          struct writeback_control
      
          Block:
          struct gendisk
          struct backing_dev_info
          struct bio
          struct queue_limits
          struct request_queue
          struct blkcg
          struct blkcg_policy
          struct blk_mq_hw_ctx
          struct blk_mq_tag_set
          struct blk_mq_queue_data
          struct blk_mq_ops
          struct elevator_mq_ops
          struct inode
          struct dentry
          struct address_space
          struct block_device
          struct hd_struct
          struct bio_set
      
          Network:
          struct sk_buff
          struct sock
          struct net_device_ops
          struct xt_target
          struct dst_entry
          struct dst_ops
          struct fib_rule
      
          Scheduler:
          struct task_struct
          struct cfs_rq
          struct rq
          struct sched_statistics
          struct sched_entity
          struct signal_struct
          struct task_group
          struct cpuacct
      
          cgroup:
          struct cgroup_root
          struct cgroup_subsys_state
          struct cgroup_subsys
          struct css_set
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      [ caspar: use SPDX-License-Identifier ]
      Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f94e5b1a
  2. 15 1月, 2020 1 次提交
    • X
      alinux: blk-throttle: limit bios to fix amount of pages entering writeback prematurely · 0fd4aa6d
      Xiaoguang Wang 提交于
      Currently in blk_throtl_bio(), if one bio exceeds its throtl_grp's bps
      or iops limit, this bio will be queued throtl_grp's throtl_service_queue,
      then obviously mm subsys will submit more pages, even underlying device
      can not handle these io requests, also this will make large amount of pages
      entering writeback prematurely, later if some process writes some of these
      pages, it will wait for long time.
      
      I have done some tests: one process does buffered writes on a 1GB file,
      and make this process's blkcg max bps limit be 10MB/s, I observe this:
      	#cat /proc/meminfo  | grep -i back
      	Writeback:        900024 kB
      	WritebackTmp:          0 kB
      
      I think this Writeback value is just too big, indeed many bios have been
      queued in throtl_grp's throtl_service_queue, if one process try to write
      the last bio's page in this queue, it will call wait_on_page_writeback(page),
      which must wait the previous bios to finish and will take long time, we
      have also see 120s hung task warning in our server.
      
       INFO: task kworker/u128:0:30072 blocked for more than 120 seconds.
             Tainted: G            E 4.9.147-013.ali3000_015_test.alios7.x86_64 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       kworker/u128:0  D    0 30072      2 0x00000000
       Workqueue: writeback wb_workfn (flush-8:16)
        ffff882ddd066b40 0000000000000000 ffff882e5cad3400 ffff882fbe959e80
        ffff882fa50b1a00 ffffc9003a5a3768 ffffffff8173325d ffffc9003a5a3780
        00ff882e5cad3400 ffff882fbe959e80 ffffffff81360b49 ffff882e5cad3400
       Call Trace:
        [<ffffffff8173325d>] ? __schedule+0x23d/0x6d0
        [<ffffffff81360b49>] ? alloc_request_struct+0x19/0x20
        [<ffffffff81733726>] schedule+0x36/0x80
        [<ffffffff81736c56>] schedule_timeout+0x206/0x4b0
        [<ffffffff81036c69>] ? sched_clock+0x9/0x10
        [<ffffffff81363073>] ? get_request+0x403/0x810
        [<ffffffff8110ca10>] ? ktime_get+0x40/0xb0
        [<ffffffff81732f8a>] io_schedule_timeout+0xda/0x170
        [<ffffffff81733f90>] ? bit_wait+0x60/0x60
        [<ffffffff81733fab>] bit_wait_io+0x1b/0x60
        [<ffffffff81733b28>] __wait_on_bit+0x58/0x90
        [<ffffffff811b0d91>] ? find_get_pages_tag+0x161/0x2e0
        [<ffffffff811aff62>] wait_on_page_bit+0x82/0xa0
        [<ffffffff810d47f0>] ? wake_atomic_t_function+0x60/0x60
        [<ffffffffa02fc181>] mpage_prepare_extent_to_map+0x2d1/0x310 [ext4]
        [<ffffffff8121ff65>] ? kmem_cache_alloc+0x185/0x1a0
        [<ffffffffa0305a2f>] ? ext4_init_io_end+0x1f/0x40 [ext4]
        [<ffffffffa0300294>] ext4_writepages+0x404/0xef0 [ext4]
        [<ffffffff81508c64>] ? scsi_init_io+0x44/0x200
        [<ffffffff81398a0f>] ? fprop_fraction_percpu+0x2f/0x80
        [<ffffffff811c139e>] do_writepages+0x1e/0x30
        [<ffffffff8127c0f5>] __writeback_single_inode+0x45/0x320
        [<ffffffff8127c942>] writeback_sb_inodes+0x272/0x600
        [<ffffffff8127cf6b>] wb_writeback+0x10b/0x300
        [<ffffffff8127d884>] wb_workfn+0xb4/0x380
        [<ffffffff810b85e9>] ? try_to_wake_up+0x59/0x3e0
        [<ffffffff810a5759>] process_one_work+0x189/0x420
        [<ffffffff810a5a3e>] worker_thread+0x4e/0x4b0
        [<ffffffff810a59f0>] ? process_one_work+0x420/0x420
        [<ffffffff810ac026>] kthread+0xe6/0x100
        [<ffffffff810abf40>] ? kthread_park+0x60/0x60
        [<ffffffff81738499>] ret_from_fork+0x39/0x50
      
      To fix this issue, we can simply limit throtl_service_queue's max queued
      bios, currently we limit it to throtl_grp's bps_limit or iops limit, if it
      still exteeds, we just sleep for a while.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      0fd4aa6d
  3. 27 12月, 2019 2 次提交
  4. 01 9月, 2018 2 次提交
    • D
      blkcg: delay blkg destruction until after writeback has finished · 59b57717
      Dennis Zhou (Facebook) 提交于
      Currently, blkcg destruction relies on a sequence of events:
        1. Destruction starts. blkcg_css_offline() is called and blkgs
           release their reference to the blkcg. This immediately destroys
           the cgwbs (writeback).
        2. With blkgs giving up their reference, the blkcg ref count should
           become zero and eventually call blkcg_css_free() which finally
           frees the blkcg.
      
      Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
      and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
      on the completion of all writeback associated with the blkcg. A count of
      the number of cgwbs is maintained and once that goes to zero, blkg
      destruction can follow. This should prevent premature blkg destruction
      related to writeback.
      
      The new process for blkcg cleanup is as follows:
        1. Destruction starts. blkcg_css_offline() is called which offlines
           writeback. Blkg destruction is delayed on the cgwb_refcnt count to
           avoid punting potentially large amounts of outstanding writeback
           to root while maintaining any ongoing policies. Here, the base
           cgwb_refcnt is put back.
        2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
           and handles destruction of blkgs. This is where the css reference
           held by each blkg is released.
        3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
           This finally frees the blkg.
      
      It seems in the past blk-throttle didn't do the most understandable
      things with taking data from a blkg while associating with current. So,
      the simplification and unification of what blk-throttle is doing caused
      this.
      
      Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      59b57717
    • D
      Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()" · 6b065462
      Dennis Zhou (Facebook) 提交于
      This reverts commit 4c699480.
      
      Destroying blkgs is tricky because of the nature of the relationship. A
      blkg should go away when either a blkcg or a request_queue goes away.
      However, blkg's pin the blkcg to ensure they remain valid. To break this
      cycle, when a blkcg is offlined, blkgs put back their css ref. This
      eventually lets css_free() get called which frees the blkcg.
      
      The above commit (4c699480) breaks this order of events by trying to
      destroy blkgs in css_free(). As the blkgs still hold references to the
      blkcg, css_free() is never called.
      
      The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
      addressed in the following patch by delaying destruction of a blkg until
      all writeback associated with the blkcg has been finished.
      
      Fixes: 4c699480 ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6b065462
  5. 12 8月, 2018 1 次提交
  6. 09 8月, 2018 1 次提交
    • B
      blkcg: Introduce blkg_root_lookup() · 6bad9b21
      Bart Van Assche 提交于
      This new function will be used in a later patch to verify whether a
      queue has been dissociated from the cgroup controller before being
      released.
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Alexandru Moise <00moses.alexander00@gmail.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6bad9b21
  7. 30 7月, 2018 1 次提交
  8. 18 7月, 2018 1 次提交
  9. 09 7月, 2018 4 次提交
    • J
      blkcg: add generic throttling mechanism · d09d8df3
      Josef Bacik 提交于
      Since IO can be issued from literally anywhere it's almost impossible to
      do throttling without having some sort of adverse effect somewhere else
      in the system because of locking or other dependencies.  The best way to
      solve this is to do the throttling when we know we aren't holding any
      other kernel resources.  Do this by tracking throttling in a per-blkg
      basis, and if we require throttling flag the task that it needs to check
      before it returns to user space and possibly sleep there.
      
      This is to address the case where a process is doing work that is
      generating IO that can't be throttled, whether that is directly with a
      lot of REQ_META IO, or indirectly by allocating so much memory that it
      is swamping the disk with REQ_SWAP.  We can't use task_add_work as we
      don't want to induce a memory allocation in the IO path, so simply
      saving the request queue in the task and flagging it to do the
      notify_resume thing achieves the same result without the overhead of a
      memory allocation.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d09d8df3
    • J
      blk: introduce REQ_SWAP · 0d1e0c7c
      Josef Bacik 提交于
      Just like REQ_META, it's important to know the IO coming down is swap
      in order to guard against potential IO priority inversion issues with
      cgroups.  Add REQ_SWAP and use it for all swap IO, and add it to our
      bio_issue_as_root_blkg helper.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0d1e0c7c
    • J
      blk-cgroup: allow controllers to output their own stats · 903d23f0
      Josef Bacik 提交于
      blk-iolatency has a few stats that it would like to print out, and
      instead of adding a bunch of crap to the generic code just provide a
      helper so that controllers can add stuff to the stat line if they want
      to.
      
      Hide it behind a boot option since it changes the output of io.stat from
      normal, and these stats are only interesting to developers.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      903d23f0
    • J
      block: introduce bio_issue_as_root_blkg · c7c98fd3
      Josef Bacik 提交于
      Instead of forcing all file systems to get the right context on their
      bio's, simply check for REQ_META to see if we need to issue as the root
      blkg.  We don't want to force all bio's to have the root blkg associated
      with them if REQ_META is set, as some controllers (blk-iolatency) need
      to know who the originating cgroup is so it can backcharge them for the
      work they are doing.  This helper will make sure that the controllers do
      the proper thing wrt the IO priority and backcharging.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c7c98fd3
  10. 17 3月, 2018 1 次提交
    • J
      blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir() · 4c699480
      Joseph Qi 提交于
      We've triggered a WARNING in blk_throtl_bio() when throttling writeback
      io, which complains blkg->refcnt is already 0 when calling blkg_get(),
      and then kernel crashes with invalid page request.
      After investigating this issue, we've found it is caused by a race
      between blkcg_bio_issue_check() and cgroup_rmdir(), which is described
      below:
      
      writeback kworker               cgroup_rmdir
                                        cgroup_destroy_locked
                                          kill_css
                                            css_killed_ref_fn
                                              css_killed_work_fn
                                                offline_css
                                                  blkcg_css_offline
        blkcg_bio_issue_check
          rcu_read_lock
          blkg_lookup
                                                    spin_trylock(q->queue_lock)
                                                    blkg_destroy
                                                    spin_unlock(q->queue_lock)
          blk_throtl_bio
          spin_lock_irq(q->queue_lock)
          ...
          spin_unlock_irq(q->queue_lock)
        rcu_read_unlock
      
      Since rcu can only prevent blkg from releasing when it is being used,
      the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule
      blkg release.
      Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING.
      And then the corresponding blkg_put() will schedule blkg release again,
      which result in double free.
      This race is introduced by commit ae118896 ("blkcg: consolidate blkg
      creation in blkcg_bio_issue_check()"). Before this commit, it will
      lookup first and then try to lookup/create again with queue_lock. Since
      revive this logic is a bit drastic, so fix it by only offlining pd during
      blkcg_css_offline(), and move the rest destruction (especially
      blkg_put()) into blkcg_css_free(), which should be the right way as
      discussed.
      
      Fixes: ae118896 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()")
      Reported-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4c699480
  11. 16 1月, 2018 1 次提交
    • A
      blkcg: simplify statistic accumulation code · ddc21231
      Arnd Bergmann 提交于
      Some older compilers (gcc-4.4 through 4.6 in particular) struggle
      with the way that blkg_rwstat_read() returns a structure, leading
      to excessive stack usage and rather inefficient code:
      
      block/blk-cgroup.c: In function 'blkg_destroy':
      block/blk-cgroup.c:354:1: error: the frame size of 1296 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      block/cfq-iosched.c: In function 'cfqg_stats_add_aux':
      block/cfq-iosched.c:753:1: error: the frame size of 1928 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      block/bfq-cgroup.c: In function 'bfqg_stats_add_aux':
      block/bfq-cgroup.c:299:1: error: the frame size of 1928 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      
      I also notice that there is no point in using atomic accesses
      for the local variables, so storing the temporaries in simple 'u64'
      variables not only avoids the stack usage on older compilers but
      also improves the object code on modern versions.
      
      Fixes: e6269c44 ("blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it")
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddc21231
  12. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  13. 26 9月, 2017 2 次提交
  14. 29 7月, 2017 1 次提交
  15. 21 6月, 2017 1 次提交
    • N
      percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch · 104b4e51
      Nikolay Borisov 提交于
      Currently, percpu_counter_add is a wrapper around __percpu_counter_add
      which is preempt safe due to explicit calls to preempt_disable.  Given
      how __ prefix is used in percpu related interfaces, the naming
      unfortunately creates the false sense that __percpu_counter_add is
      less safe than percpu_counter_add.  In terms of context-safety,
      they're equivalent.  The only difference is that the __ version takes
      a batch parameter.
      
      Make this a bit more explicit by just renaming __percpu_counter_add to
      percpu_counter_add_batch.
      
      This patch doesn't cause any functional changes.
      
      tj: Minor updates to patch description for clarity.  Cosmetic
          indentation updates.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: linux-mm@kvack.org
      Cc: "David S. Miller" <davem@davemloft.net>
      104b4e51
  16. 30 3月, 2017 1 次提交
  17. 29 3月, 2017 1 次提交
    • T
      blkcg: allocate struct blkcg_gq outside request queue spinlock · 7fc6b87a
      Tahsin Erdogan 提交于
      blkg_conf_prep() currently calls blkg_lookup_create() while holding
      request queue spinlock. This means allocating memory for struct
      blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
      failures in call paths like below:
      
        pcpu_alloc+0x68f/0x710
        __alloc_percpu_gfp+0xd/0x10
        __percpu_counter_init+0x55/0xc0
        cfq_pd_alloc+0x3b2/0x4e0
        blkg_alloc+0x187/0x230
        blkg_create+0x489/0x670
        blkg_lookup_create+0x9a/0x230
        blkg_conf_prep+0x1fb/0x240
        __cfqg_set_weight_device.isra.105+0x5c/0x180
        cfq_set_weight_on_dfl+0x69/0xc0
        cgroup_file_write+0x39/0x1c0
        kernfs_fop_write+0x13f/0x1d0
        __vfs_write+0x23/0x120
        vfs_write+0xc2/0x1f0
        SyS_write+0x44/0xb0
        entry_SYSCALL_64_fastpath+0x18/0xad
      
      In the code path above, percpu allocator cannot call vmalloc() due to
      queue spinlock.
      
      A failure in this call path gives grief to tools which are trying to
      configure io weights. We see occasional failures happen shortly after
      reboots even when system is not under any memory pressure. Machines
      with a lot of cpus are more vulnerable to this condition.
      
      Update blkg_create() function to temporarily drop the rcu and queue
      locks when it is allowed by gfp mask.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7fc6b87a
  18. 01 11月, 2016 1 次提交
  19. 28 10月, 2016 1 次提交
    • C
      block: better op and flags encoding · ef295ecf
      Christoph Hellwig 提交于
      Now that we don't need the common flags to overflow outside the range
      of a 32-bit type we can encode them the same way for both the bio and
      request fields.  This in addition allows us to place the operation
      first (and make some room for more ops while we're at it) and to
      stop having to shift around the operation values.
      
      In addition this allows passing around only one value in the block layer
      instead of two (and eventuall also in the file systems, but we can do
      that later) and thus clean up a lot of code.
      
      Last but not least this allows decreasing the size of the cmd_flags
      field in struct request to 32-bits.  Various functions passing this
      value could also be updated, but I'd like to avoid the churn for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ef295ecf
  20. 24 9月, 2016 1 次提交
  21. 10 8月, 2016 1 次提交
    • T
      cgroup: make cgroup_path() and friends behave in the style of strlcpy() · 4c737b41
      Tejun Heo 提交于
      cgroup_path() and friends used to format the path from the end and
      thus the resulting path usually didn't start at the start of the
      passed in buffer.  Also, when the buffer was too small, the partial
      result was truncated from the head rather than tail and there was no
      way to tell how long the full path would be.  These make the functions
      less robust and more awkward to use.
      
      With recent updates to kernfs_path(), cgroup_path() and friends can be
      made to behave in strlcpy() style.
      
      * cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now
        always return the length of the full path.  If buffer is too small,
        it contains nul terminated truncated output.
      
      * All users updated accordingly.
      
      v2: cgroup_path() usage in kernel/sched/debug.c converted.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      4c737b41
  22. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  23. 08 6月, 2016 2 次提交
  24. 27 10月, 2015 1 次提交
  25. 19 8月, 2015 9 次提交
    • T
      blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy · 69d7fde5
      Tejun Heo 提交于
      cgroup is trying to make interface consistent across different
      controllers.  For weight based resource control, the knob should have
      the range [1, 10000] and default to 100.  This patch updates
      cfq-iosched so that the weight range conforms.  The internal
      calculations have enough range and the widening of the weight range
      shouldn't cause any problem.
      
      * blkcg_policy->cpd_bind_fn() is added.  If present, this is invoked
        when blkcg is attached to a hierarchy.
      
      * cfq_cpd_init() is updated to use the new default value on the
        unified hierarchy.
      
      * cfq_cpd_bind() callback is implemented to clear per-blkg configs and
        apply the default config matching the hierarchy type.
      
      * cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
        is moved into !CONFIG_CFQ_GROUP_IOSCHED block.  cfq_cpd_bind() is
        now responsible for initializing the initial weights when blkcg is
        enabled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      69d7fde5
    • T
      blkcg: implement interface for the unified hierarchy · 2ee867dc
      Tejun Heo 提交于
      blkcg interface grew to be the biggest of all controllers and
      unfortunately most inconsistent too.  The interface files are
      inconsistent with a number of cloes duplicates.  Some files have
      recursive variants while others don't.  There's distinction between
      normal and leaf weights which isn't intuitive and there are a lot of
      stat knobs which don't make much sense outside of debugging and expose
      too much implementation details to userland.
      
      In the unified hierarchy, everything is always hierarchical and
      internal nodes can't have tasks rendering the two structural issues
      twisting the current interface.  The interface has to be updated in a
      significant anyway and this is a good chance to revamp it as a whole.
      This patch implements blkcg interface for the unified hierarchy.
      
      * (from a previous patch) blkcg is identified by "io" instead of
        "blkio" on the unified hierarchy.  Given that the whole interface is
        updated anyway, the rename shouldn't carry noticeable conversion
        overhead.
      
      * The original interface consisted of 27 files is replaced with the
        following three files.
      
        blkio.stat	: per-blkcg stats
        blkio.weight	: per-cgroup and per-cgroup-queue weight settings
        blkio.max	: per-cgroup-queue bps and iops max limits
      
      Documentation/cgroups/unified-hierarchy.txt updated accordingly.
      
      v2: blkcg_policy->dfl_cftypes wasn't removed on
          blkcg_policy_unregister() corrupting the cftypes list.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2ee867dc
    • T
      blkcg: misc preparations for unified hierarchy interface · dd165eb3
      Tejun Heo 提交于
      * Export blkg_dev_name()
      
      * Drop unnecessary @cft from __cfq_set_weight().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dd165eb3
    • T
      blkcg: move body parsing from blkg_conf_prep() to its callers · 36aa9e5f
      Tejun Heo 提交于
      Currently, blkg_conf_prep() expects input to be of the following form
      
       MAJ:MIN NUM
      
      and reads the NUM part into blkg_conf_ctx->v.  This is quite
      restrictive and gets in the way in implementing blkcg interface for
      the unified hierarchy.  This patch updates blkg_conf_prep() so that it
      expects
      
       MAJ:MIN BODY_STR
      
      where BODY_STR is an arbitrary string.  blkg_conf_ctx->v is replaced
      with ->body which is a char pointer pointing to the start of BODY_STR.
      Parsing of the body is moved to blkg_conf_prep()'s callers.
      
      To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
      non-const pointer and to accommodate that const is dropped from @input
      too.
      
      This doesn't cause any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      36aa9e5f
    • T
      blkcg: mark existing cftypes as legacy · 880f50e2
      Tejun Heo 提交于
      blkcg is about to grow interface for the unified hierarchy.  Add
      legacy to existing cftypes.
      
      * blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
      * blk-cgroup.c:blkcg_files -> blkcg_legacy_files
      * cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
      * blk-throttle.c:throtl_files -> throtl_legacy_files
      
      Pure renames.  No functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      880f50e2
    • T
      blkcg: rename subsystem name from blkio to io · c165b3e3
      Tejun Heo 提交于
      blkio interface has become messy over time and is currently the
      largest.  In addition to the inconsistent naming scheme, it has
      multiple stat files which report more or less the same thing, a number
      of debug stat files which expose internal details which shouldn't have
      been part of the public interface in the first place, recursive and
      non-recursive stats and leaf and non-leaf knobs.
      
      Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
      don't make any sense on the unified hierarchy as only leaf cgroups can
      contain processes.  cgroups is going through a major interface
      revision with the unified hierarchy involving significant fundamental
      usage changes and given that a significant portion of the interface
      doesn't make sense anymore, it's a good time to reorganize the
      interface.
      
      As the first step, this patch renames the external visible subsystem
      name from "blkio" to "io".  This is more concise, matches the other
      two major subsystem names, "cpu" and "memory", and better suited as
      blkcg will be involved in anything writeback related too whether an
      actual block device is involved or not.
      
      As the subsystem legacy_name is set to "blkio", the only userland
      visible change outside the unified hierarchy is that blkcg is reported
      as "io" instead of "blkio" in the subsystem initialized message during
      boot.  On the unified hierarchy, blkcg now appears as "io".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: cgroups@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c165b3e3
    • T
      blkcg: move io_service_bytes and io_serviced stats into blkcg_gq · 77ea7338
      Tejun Heo 提交于
      Currently, both cfq-iosched and blk-throttle keep track of
      io_service_bytes and io_serviced stats.  While keeping track of them
      separately may be useful during development, it doesn't make much
      sense otherwise.  Also, blk-throttle was counting bio's as IOs while
      cfq-iosched request's, which is more confusing than informative.
      
      This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
      removes the counterparts from cfq-iosched and blk-throttle and let
      them print from the common blkg counters.  The common counters are
      incremented during bio issue in blkcg_bio_issue_check().
      
      The outputs are still filtered by whether the policy has
      blkg_policy_data on a given blkg, so cfq's output won't show up if it
      has never been used for a given blkg.  The only times when the outputs
      would differ significantly are when policies are attached on the fly
      or elevators are switched back and forth.  Those are quite exceptional
      operations and I don't think they warrant keeping separate counters.
      
      v3: Update blkio-controller.txt accordingly.
      
      v2: Account IOs during bio issues instead of request completions so
          that bio-based drivers can be handled the same way.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      77ea7338
    • T
      blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq · f12c74ca
      Tejun Heo 提交于
      Currently, blkg_[rw]stat_recursive_sum() assume that the target
      counter is located in pd (blkg_policy_data); however, some counters
      are planned to be moved to blkg (blkcg_gq).
      
      This patch updates blkg_[rw]stat_recursive_sum() to take blkg and
      blkg_policy pointers instead of pd.  If policy is NULL, it indexes
      into blkg.  If non-NULL, into the blkg's pd of the policy.
      
      The existing usages are updated to maintain the current behaviors.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f12c74ca
    • T
      blkcg: make blkcg_[rw]stat per-cpu · 24bdb8ef
      Tejun Heo 提交于
      blkcg_[rw]stat are used as stat counters for blkcg policies.  It isn't
      per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
      it.  This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
      per-cpu wrapping in blk-throttle.
      
      * blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
        percpu_counter.  This makes syncp unnecessary as remote accesses are
        handled by percpu_counter itself.
      
      * blkg_[rw]stat_init() can now fail due to percpu allocation failure
        and thus are updated to return int.
      
      * percpu_counters need explicit freeing.  blkg_[rw]stat_exit() added.
      
      * As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
        and summing results are stored in ->aux_cnt[] instead.
      
      * Custom per-cpu stat implementation in blk-throttle is removed.
      
      This makes all blkcg stat counters per-cpu without complicating policy
      implmentations.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      24bdb8ef