1. 24 1月, 2020 15 次提交
    • C
      bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() · d5c9c470
      Coly Li 提交于
      In order to skip the most recently freed btree node cahce, currently
      in bch_mca_scan() the first 3 caches in c->btree_cache_freeable list
      are skipped when shrinking bcache node caches in bch_mca_scan(). The
      related code in bch_mca_scan() is,
      
       737 list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) {
       738         if (nr <= 0)
       739                 goto out;
       740
       741         if (++i > 3 &&
       742             !mca_reap(b, 0, false)) {
                   		lines free cache memory
       746         }
       747         nr--;
       748 }
      
      The problem is, if virtual memory code calls bch_mca_scan() and
      the calculated 'nr' is 1 or 2, then in the above loop, nothing will
      be shunk. In such case, if slub/slab manager calls bch_mca_scan()
      for many times with small scan number, it does not help to shrink
      cache memory and just wasts CPU cycles.
      
      This patch just selects btree node caches from tail of the
      c->btree_cache_freeable list, then the newly freed host cache can
      still be allocated by mca_alloc(), and at least 1 node can be shunk.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d5c9c470
    • C
      bcache: remove member accessed from struct btree · 125d98ed
      Coly Li 提交于
      The member 'accessed' of struct btree is used in bch_mca_scan() when
      shrinking btree node caches. The original idea is, if b->accessed is
      set, clean it and look at next btree node cache from c->btree_cache
      list, and only shrink the caches whose b->accessed is cleaned. Then
      only cold btree node cache will be shrunk.
      
      But when I/O pressure is high, it is very probably that b->accessed
      of a btree node cache will be set again in bch_btree_node_get()
      before bch_mca_scan() selects it again. Then there is no chance for
      bch_mca_scan() to shrink enough memory back to slub or slab system.
      
      This patch removes member accessed from struct btree, then once a
      btree node ache is selected, it will be immediately shunk. By this
      change, bch_mca_scan() may release btree node cahce more efficiently.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      125d98ed
    • G
      bcache: print written and keys in trace_bcache_btree_write · d44330b7
      Guoju Fang 提交于
      It's useful to dump written block and keys on btree write, this patch
      add them into trace_bcache_btree_write.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d44330b7
    • C
      bcache: avoid unnecessary btree nodes flushing in btree_flush_write() · 2aa8c529
      Coly Li 提交于
      the commit 91be66e1 ("bcache: performance improvement for
      btree_flush_write()") was an effort to flushing btree node with oldest
      btree node faster in following methods,
      - Only iterate dirty btree nodes in c->btree_cache, avoid scanning a lot
        of clean btree nodes.
      - Take c->btree_cache as a LRU-like list, aggressively flushing all
        dirty nodes from tail of c->btree_cache util the btree node with
        oldest journal entry is flushed. This is to reduce the time of holding
        c->bucket_lock.
      
      Guoju Fang and Shuang Li reported that they observe unexptected extra
      write I/Os on cache device after applying the above patch. Guoju Fang
      provideed more detailed diagnose information that the aggressive
      btree nodes flushing may cause 10x more btree nodes to flush in his
      workload. He points out when system memory is large enough to hold all
      btree nodes in memory, c->btree_cache is not a LRU-like list any more.
      Then the btree node with oldest journal entry is very probably not-
      close to the tail of c->btree_cache list. In such situation much more
      dirty btree nodes will be aggressively flushed before the target node
      is flushed. When slow SATA SSD is used as cache device, such over-
      aggressive flushing behavior will cause performance regression.
      
      After spending a lot of time on debug and diagnose, I find the real
      condition is more complicated, aggressive flushing dirty btree nodes
      from tail of c->btree_cache list is not a good solution.
      - When all btree nodes are cached in memory, c->btree_cache is not
        a LRU-like list, the btree nodes with oldest journal entry won't
        be close to the tail of the list.
      - There can be hundreds dirty btree nodes reference the oldest journal
        entry, before flushing all the nodes the oldest journal entry cannot
        be reclaimed.
      When the above two conditions mixed together, a simply flushing from
      tail of c->btree_cache list is really NOT a good idea.
      
      Fortunately there is still chance to make btree_flush_write() work
      better. Here is how this patch avoids unnecessary btree nodes flushing,
      - Only acquire c->journal.lock when getting oldest journal entry of
        fifo c->journal.pin. In rested locations check the journal entries
        locklessly, so their values can be changed on other cores
        in parallel.
      - In loop list_for_each_entry_safe_reverse(), checking latest front
        point of fifo c->journal.pin. If it is different from the original
        point which we get with locking c->journal.lock, it means the oldest
        journal entry is reclaim on other cores. At this moment, all selected
        dirty nodes recorded in array btree_nodes[] are all flushed and clean
        on other CPU cores, it is unncessary to iterate c->btree_cache any
        longer. Just quit the list_for_each_entry_safe_reverse() loop and
        the following for-loop will skip all the selected clean nodes.
      - Find a proper time to quit the list_for_each_entry_safe_reverse()
        loop. Check the refcount value of orignial fifo front point, if the
        value is larger than selected node number of btree_nodes[], it means
        more matching btree nodes should be scanned. Otherwise it means no
        more matching btee nodes in rest of c->btree_cache list, the loop
        can be quit. If the original oldest journal entry is reclaimed and
        fifo front point is updated, the refcount of original fifo front point
        will be 0, then the loop will be quit too.
      - Not hold c->bucket_lock too long time. c->bucket_lock is also required
        for space allocation for cached data, hold it for too long time will
        block regular I/O requests. When iterating list c->btree_cache, even
        there are a lot of maching btree nodes, in order to not holding
        c->bucket_lock for too long time, only BTREE_FLUSH_NR nodes are
        selected and to flush in following for-loop.
      With this patch, only btree nodes referencing oldest journal entry
      are flushed to cache device, no aggressive flushing for  unnecessary
      btree node any more. And in order to avoid blocking regluar I/O
      requests, each time when btree_flush_write() called, at most only
      BTREE_FLUSH_NR btree nodes are selected to flush, even there are more
      maching btree nodes in list c->btree_cache.
      
      At last, one more thing to explain: Why it is safe to read front point
      of c->journal.pin without holding c->journal.lock inside the
      list_for_each_entry_safe_reverse() loop ?
      
      Here is my answer: When reading the front point of fifo c->journal.pin,
      we don't need to know the exact value of front point, we just want to
      check whether the value is different from the original front point
      (which is accurate value because we get it while c->jouranl.lock is
      held). For such purpose, it works as expected without holding
      c->journal.lock. Even the front point is changed on other CPU core and
      not updated to local core, and current iterating btree node has
      identical journal entry local as original fetched fifo front point, it
      is still safe. Because after holding mutex b->write_lock (with memory
      barrier) this btree node can be found as clean and skipped, the loop
      will quite latter when iterate on next node of list c->btree_cache.
      
      Fixes: 91be66e1 ("bcache: performance improvement for btree_flush_write()")
      Reported-by: NGuoju Fang <fangguoju@gmail.com>
      Reported-by: NShuang Li <psymon@bonuscloud.io>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2aa8c529
    • C
      bcache: add code comments for state->pool in __btree_sort() · 7a0bc2a8
      Coly Li 提交于
      To explain the pages allocated from mempool state->pool can be
      swapped in __btree_sort(), because state->pool is a page pool,
      which allocates pages by alloc_pages() indeed.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7a0bc2a8
    • B
      lib: crc64: include <linux/crc64.h> for 'crc64_be' · 0e0c1231
      Ben Dooks (Codethink) 提交于
      The crc64_be() is declared in <linux/crc64.h> so include
      this where the symbol is defined to avoid the following
      warning:
      
      lib/crc64.c:43:12: warning: symbol 'crc64_be' was not declared. Should it be static?
      Signed-off-by: NBen Dooks (Codethink) <ben.dooks@codethink.co.uk>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0e0c1231
    • C
      bcache: use read_cache_page_gfp to read the superblock · 6321bef0
      Christoph Hellwig 提交于
      Avoid a pointless dependency on buffer heads in bcache by simply open
      coding reading a single page.  Also add a SB_OFFSET define for the
      byte offset of the superblock instead of using magic numbers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6321bef0
    • C
      bcache: store a pointer to the on-disk sb in the cache and cached_dev structures · 475389ae
      Christoph Hellwig 提交于
      This allows to properly build the superblock bio including the offset in
      the page using the normal bio helpers.  This fixes writing the superblock
      for page sizes larger than 4k where the sb write bio would need an offset
      in the bio_vec.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      475389ae
    • C
      bcache: return a pointer to the on-disk sb from read_super · cfa0c56d
      Christoph Hellwig 提交于
      Returning the properly typed actual data structure insteaf of the
      containing struct page will save the callers some work going
      forward.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cfa0c56d
    • C
      bcache: transfer the sb_page reference to register_{bdev,cache} · fc8f19cc
      Christoph Hellwig 提交于
      Avoid an extra reference count roundtrip by transferring the sb_page
      ownership to the lower level register helpers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fc8f19cc
    • C
      bcache: fix use-after-free in register_bcache() · ae3cd299
      Coly Li 提交于
      The patch "bcache: rework error unwinding in register_bcache" introduces
      a use-after-free regression in register_bcache(). Here are current code,
      	2510 out_free_path:
      	2511         kfree(path);
      	2512 out_module_put:
      	2513         module_put(THIS_MODULE);
      	2514 out:
      	2515         pr_info("error %s: %s", path, err);
      	2516         return ret;
      If some error happens and the above code path is executed, at line 2511
      path is released, but referenced at line 2515. Then KASAN reports a use-
      after-free error message.
      
      This patch changes line 2515 in the following way to fix the problem,
      	2515         pr_info("error %s: %s", path?path:"", err);
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ae3cd299
    • C
      bcache: properly initialize 'path' and 'err' in register_bcache() · 29cda393
      Coly Li 提交于
      Patch "bcache: rework error unwinding in register_bcache" from
      Christoph Hellwig changes the local variables 'path' and 'err'
      in undefined initial state. If the code in register_bcache() jumps
      to label 'out:' or 'out_module_put:' by goto, these two variables
      might be reference with undefined value by the following line,
      
      	out_module_put:
      	        module_put(THIS_MODULE);
      	out:
      	        pr_info("error %s: %s", path, err);
      	        return ret;
      
      Therefore this patch initializes these two local variables properly
      in register_bcache() to avoid such issue.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      29cda393
    • C
      bcache: rework error unwinding in register_bcache · 50246693
      Christoph Hellwig 提交于
      Split the successful and error return path, and use one goto label for each
      resource to unwind.  This also fixes some small errors like leaking the
      module reference count in the reboot case (which seems entirely harmless)
      or printing the wrong warning messages for early failures.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      50246693
    • C
      bcache: use a separate data structure for the on-disk super block · a702a692
      Christoph Hellwig 提交于
      Split out an on-disk version struct cache_sb with the proper endianness
      annotations.  This fixes a fair chunk of sparse warnings, but there are
      some left due to the way the checksum is defined.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a702a692
    • L
      bcache: cached_dev_free needs to put the sb page · e8547d42
      Liang Chen 提交于
      Same as cache device, the buffer page needs to be put while
      freeing cached_dev.  Otherwise a page would be leaked every
      time a cached_dev is stopped.
      Signed-off-by: NLiang Chen <liangchen.linux@gmail.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e8547d42
  2. 14 1月, 2020 16 次提交
  3. 09 1月, 2020 1 次提交
  4. 07 1月, 2020 2 次提交
  5. 19 12月, 2019 4 次提交
    • P
      blk-mq: optimise blk_mq_flush_plug_list() · 95ed0c5b
      Pavel Begunkov 提交于
      Instead of using list_del_init() in a loop, that generates a lot of
      unnecessary memory read/writes, iterate from the first request of a
      batch and cut out a sublist with list_cut_before().
      
      Apart from removing the list node initialisation part, this is more
      register-friendly, and the assembly uses the stack less intensively.
      
      list_empty() at the beginning is done with hope, that the compiler can
      optimise out the same check in the following list_splice_init().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      95ed0c5b
    • P
      list: introduce list_for_each_continue() · 28ca0d6d
      Pavel Begunkov 提交于
      As other *continue() helpers, this continues iteration from a given
      position.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      28ca0d6d
    • P
      blk-mq: optimise rq sort function · 7d30a621
      Pavel Begunkov 提交于
      Check "!=" in multi-layer comparisons. The same memory usage, fewer
      instructions, and 2 from 4 jumps are replaced with SETcc.
      
      Note, that list_sort() doesn't differ 0 and <0.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7d30a621
    • L
      Merge tag 'sound-5.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 80a0c2e5
      Linus Torvalds 提交于
      Pull sound fixes from Takashi Iwai:
       "A slightly high amount at this time, but all good and small fixes:
      
         - A PCM core fix that initializes the buffer properly for avoiding
           information leaks; it is a long-standing minor problem, but good to
           fix better now
      
         - A few ASoC core fixes for the init / cleanup ordering issues that
           surfaced after the recent refactoring
      
         - Lots of SOF and topology-related fixes went in, as usual as such
           hot topics
      
         - Several ASoC codec and platform-specific small fixes: wm89xx,
           realtek, and max98090, AMD, Intel-SST
      
         - A fix for the previous incomplete regression of HD-audio, now
           hitting Nvidia HDMI
      
         - A few HD-audio CA0132 codec fixes"
      
      * tag 'sound-5.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (27 commits)
        ALSA: hda - Downgrade error message for single-cmd fallback
        ASoC: wm8962: fix lambda value
        ALSA: hda: Fix regression by strip mask fix
        ALSA: hda/ca0132 - Fix work handling in delayed HP detection
        ALSA: hda/ca0132 - Avoid endless loop
        ALSA: hda/ca0132 - Keep power on during processing DSP response
        ALSA: pcm: Avoid possible info leaks from PCM stream buffers
        ASoC: Intel: common: work-around incorrect ACPI HID for CML boards
        ASoC: SOF: Intel: split cht and byt debug window sizes
        ASoC: SOF: loader: fix snd_sof_fw_parse_ext_data
        ASoC: SOF: loader: snd_sof_fw_parse_ext_data log warning on unknown header
        ASoC: simple-card: Don't create separate link when platform is present
        ASoC: topology: Check return value for soc_tplg_pcm_create()
        ASoC: topology: Check return value for snd_soc_add_dai_link()
        ASoC: core: only flush inited work during free
        ASoC: Intel: bytcr_rt5640: Update quirk for Teclast X89
        ASoC: core: Init pcm runtime work early to avoid warnings
        ASoC: Intel: sst: Add missing include <linux/io.h>
        ASoC: max98090: fix possible race conditions
        ASoC: max98090: exit workaround earlier if PLL is locked
        ...
      80a0c2e5
  6. 18 12月, 2019 2 次提交
    • L
      Merge tag 'for-5.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 2187f215
      Linus Torvalds 提交于
      Pull btrfs fixes from David Sterba:
       "A mix of regression fixes and regular fixes for stable trees:
      
         - fix swapped error messages for qgroup enable/rescan
      
         - fixes for NO_HOLES feature with clone range
      
         - fix deadlock between iget/srcu lock/synchronize srcu while freeing
           an inode
      
         - fix double lock on subvolume cross-rename
      
         - tree log fixes
            * fix missing data checksums after replaying a log tree
            * also teach tree-checker about this problem
            * skip log replay on orphaned roots
      
         - fix maximum devices constraints for RAID1C -3 and -4
      
         - send: don't print warning on read-only mount regarding orphan
           cleanup
      
         - error handling fixes"
      
      * tag 'for-5.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: send: remove WARN_ON for readonly mount
        btrfs: do not leak reloc root if we fail to read the fs root
        btrfs: skip log replay on orphaned roots
        btrfs: handle ENOENT in btrfs_uuid_tree_iterate
        btrfs: abort transaction after failed inode updates in create_subvol
        Btrfs: fix hole extent items with a zero size after range cloning
        Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues
        Btrfs: make tree checker detect checksum items with overlapping ranges
        Btrfs: fix missing data checksums after replaying a log tree
        btrfs: return error pointer from alloc_test_extent_buffer
        btrfs: fix devs_max constraints for raid1c3 and raid1c4
        btrfs: tree-checker: Fix error format string for size_t
        btrfs: don't double lock the subvol_sem for rename exchange
        btrfs: handle error in btrfs_cache_block_group
        btrfs: do not call synchronize_srcu() in inode_tree_del
        Btrfs: fix cloning range with a hole when using the NO_HOLES feature
        btrfs: Fix error messages in qgroup_rescan_init
      2187f215
    • L
      early init: fix error handling when opening /dev/console · 2d3145f8
      Linus Torvalds 提交于
      The comment says "this should never fail", but it definitely can fail
      when you have odd initial boot filesystems, or kernel configurations.
      
      So get the error handling right: filp_open() returns an error pointer.
      Reported-by: NJesse Barnes <jsbarnes@google.com>
      Reported-by: Nyouling 257 <youling257@gmail.com>
      Fixes: 8243186f ("fs: remove ksys_dup()")
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d3145f8