1. 23 2月, 2016 8 次提交
    • J
      jbd2: factor out common descriptor block initialization · 32ab6715
      Jan Kara 提交于
      Descriptor block header is initialized in several places. Factor out the
      common code into jbd2_journal_get_descriptor_buffer().
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      32ab6715
    • J
      jbd2: remove unnecessary arguments of jbd2_journal_write_revoke_records · 9bcf976c
      Jan Kara 提交于
      jbd2_journal_write_revoke_records() takes journal pointer and write_op,
      although journal can be obtained from the passed transaction and
      write_op is always WRITE_SYNC. Remove these superfluous arguments.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      9bcf976c
    • A
      mbcache: add reusable flag to cache entries · 6048c64b
      Andreas Gruenbacher 提交于
      To reduce amount of damage caused by single bad block, we limit number
      of inodes sharing an xattr block to 1024. Thus there can be more xattr
      blocks with the same contents when there are lots of files with the same
      extended attributes. These xattr blocks naturally result in hash
      collisions and can form long hash chains and we unnecessarily check each
      such block only to find out we cannot use it because it is already
      shared by too many inodes.
      
      Add a reusable flag to cache entries which is cleared when a cache entry
      has reached its maximum refcount.  Cache entries which are not marked
      reusable are skipped by mb_cache_entry_find_{first,next}. This
      significantly speeds up mbcache when there are many same xattr blocks.
      For example for xattr-bench with 5 values and each process handling
      20000 files, the run for 64 processes is 25x faster with this patch.
      Even for 8 processes the speedup is almost 3x. We have also verified
      that for situations where there is only one xattr block of each kind,
      the patch doesn't have a measurable cost.
      
      [JK: Remove handling of setting the same value since it is not needed
      anymore, check for races in e_reusable setting, improve changelog,
      add measurements]
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6048c64b
    • A
      mbcache: get rid of _e_hash_list_head · dc8d5e56
      Andreas Gruenbacher 提交于
      Get rid of field _e_hash_list_head in cache entries and add bit field
      e_referenced instead.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      dc8d5e56
    • J
      mbcache2: rename to mbcache · 7a2508e1
      Jan Kara 提交于
      Since old mbcache code is gone, let's rename new code to mbcache since
      number 2 is now meaningless. This is just a mechanical replacement.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7a2508e1
    • J
      mbcache2: Use referenced bit instead of LRU · f0c8b462
      Jan Kara 提交于
      Currently we maintain perfect LRU list by moving entry to the tail of
      the list when it gets used. However these operations on cache-global
      list are relatively expensive.
      
      In this patch we switch to lazy updates of LRU list. Whenever entry gets
      used, we set a referenced bit in it. When reclaiming entries, we give
      referenced entries another round in the LRU. Since the list is not a
      real LRU anymore, rename it to just 'list'.
      
      In my testing this logic gives about 30% boost to workloads with mostly
      unique xattr blocks (e.g. xattr-bench with 10 files and 10000 unique
      xattr values).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      f0c8b462
    • J
      mbcache: remove mbcache · ecd1e644
      Jan Kara 提交于
      Both ext2 and ext4 are now converted to mbcache2. Remove the old mbcache
      code.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ecd1e644
    • J
      mbcache2: reimplement mbcache · f9a61eb4
      Jan Kara 提交于
      Original mbcache was designed to have more features than what ext?
      filesystems ended up using. It supported entry being in more hashes, it
      had a home-grown rwlocking of each entry, and one cache could cache
      entries from multiple filesystems. This genericity also resulted in more
      complex locking, larger cache entries, and generally more code
      complexity.
      
      This is reimplementation of the mbcache functionality to exactly fit the
      purpose ext? filesystems use it for. Cache entries are now considerably
      smaller (7 instead of 13 longs), the code is considerably smaller as
      well (414 vs 913 lines of code), and IMO also simpler. The new code is
      also much more lightweight.
      
      I have measured the speed using artificial xattr-bench benchmark, which
      spawns P processes, each process sets xattr for F different files, and
      the value of xattr is randomly chosen from a pool of V values. Averages
      of runtimes for 5 runs for various combinations of parameters are below.
      The first value in each cell is old mbache, the second value is the new
      mbcache.
      
      V=10
      F\P	1		2		4		8		16		32		64
      10	0.158,0.157	0.208,0.196	0.500,0.277	0.798,0.400	3.258,0.584	13.807,1.047	61.339,2.803
      100	0.172,0.167	0.279,0.222	0.520,0.275	0.825,0.341	2.981,0.505	12.022,1.202	44.641,2.943
      1000	0.185,0.174	0.297,0.239	0.445,0.283	0.767,0.340	2.329,0.480	6.342,1.198	16.440,3.888
      
      V=100
      F\P	1		2		4		8		16		32		64
      10	0.162,0.153	0.200,0.186	0.362,0.257	0.671,0.496	1.433,0.943	3.801,1.345	7.938,2.501
      100	0.153,0.160	0.221,0.199	0.404,0.264	0.945,0.379	1.556,0.485	3.761,1.156	7.901,2.484
      1000	0.215,0.191	0.303,0.246	0.471,0.288	0.960,0.347	1.647,0.479	3.916,1.176	8.058,3.160
      
      V=1000
      F\P	1		2		4		8		16		32		64
      10	0.151,0.129	0.210,0.163	0.326,0.245	0.685,0.521	1.284,0.859	3.087,2.251	6.451,4.801
      100	0.154,0.153	0.211,0.191	0.276,0.282	0.687,0.506	1.202,0.877	3.259,1.954	8.738,2.887
      1000	0.145,0.179	0.202,0.222	0.449,0.319	0.899,0.333	1.577,0.524	4.221,1.240	9.782,3.579
      
      V=10000
      F\P	1		2		4		8		16		32		64
      10	0.161,0.154	0.198,0.190	0.296,0.256	0.662,0.480	1.192,0.818	2.989,2.200	6.362,4.746
      100	0.176,0.174	0.236,0.203	0.326,0.255	0.696,0.511	1.183,0.855	4.205,3.444	19.510,17.760
      1000	0.199,0.183	0.240,0.227	1.159,1.014	2.286,2.154	6.023,6.039	---,10.933	---,36.620
      
      V=100000
      F\P	1		2		4		8		16		32		64
      10	0.171,0.162	0.204,0.198	0.285,0.230	0.692,0.500	1.225,0.881	2.990,2.243	6.379,4.771
      100	0.151,0.171	0.220,0.210	0.295,0.255	0.720,0.518	1.226,0.844	3.423,2.831	19.234,17.544
      1000	0.192,0.189	0.249,0.225	1.162,1.043	2.257,2.093	5.853,4.997	---,10.399	---,32.198
      
      We see that the new code is faster in pretty much all the cases and
      starting from 4 processes there are significant gains with the new code
      resulting in upto 20-times shorter runtimes. Also for large numbers of
      cached entries all values for the old code could not be measured as the
      kernel started hitting softlockups and died before the test completed.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      f9a61eb4
  2. 29 1月, 2016 3 次提交
  3. 27 1月, 2016 3 次提交
  4. 26 1月, 2016 1 次提交
    • M
      irqdomain: Allow domain lookup with DOMAIN_BUS_WIRED token · 530cbe10
      Marc Zyngier 提交于
      Let's take the (outlandish) example of an interrupt controller
      capable of handling both wired interrupts and PCI MSIs.
      
      With the current code, the PCI MSI domain is going to be tagged
      with DOMAIN_BUS_PCI_MSI, and the wired domain with DOMAIN_BUS_ANY.
      
      Things get hairy when we start looking up the domain for a wired
      interrupt (typically when creating it based on some firmware
      information - DT or ACPI).
      
      In irq_create_fwspec_mapping(), we perform the lookup using
      DOMAIN_BUS_ANY, which is actually used as a wildcard. This gives
      us one chance out of two to end up with the wrong domain, and
      we try to configure a wired interrupt with the MSI domain.
      Everything grinds to a halt pretty quickly.
      
      What we really need to do is to start looking for a domain that
      would uniquely identify a wired interrupt domain, and only use
      DOMAIN_BUS_ANY as a fallback.
      
      In order to solve this, let's introduce a new DOMAIN_BUS_WIRED
      token, which is going to be used exactly as described above.
      Of course, this depends on the irqchip to setup the domain
      bus_token, and nobody had to implement this so far.
      
      Only so far.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Link: http://lkml.kernel.org/r/1453816347-32720-2-git-send-email-marc.zyngier@arm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      530cbe10
  5. 24 1月, 2016 5 次提交
  6. 23 1月, 2016 6 次提交
    • R
      dax: add support for fsync/sync · 9973c98e
      Ross Zwisler 提交于
      To properly handle fsync/msync in an efficient way DAX needs to track
      dirty pages so it is able to flush them durably to media on demand.
      
      The tracking of dirty pages is done via the radix tree in struct
      address_space.  This radix tree is already used by the page writeback
      infrastructure for tracking dirty pages associated with an open file,
      and it already has support for exceptional (non struct page*) entries.
      We build upon these features to add exceptional entries to the radix
      tree for DAX dirty PMD or PTE pages at fault time.
      
      [dan.j.williams@intel.com: fix dax_pmd_dbg build warning]
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9973c98e
    • R
      mm: add find_get_entries_tag() · 7e7f7749
      Ross Zwisler 提交于
      Add find_get_entries_tag() to the family of functions that include
      find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
      needed for DAX dirty page handling because we need a list of both page
      offsets and radix tree entries ('indices' and 'entries' in this
      function) that are marked with the PAGECACHE_TAG_TOWRITE tag.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e7f7749
    • R
      dax: support dirty DAX entries in radix tree · f9fe48be
      Ross Zwisler 提交于
      Add support for tracking dirty DAX entries in the struct address_space
      radix tree.  This tree is already used for dirty page writeback, and it
      already supports the use of exceptional (non struct page*) entries.
      
      In order to properly track dirty DAX pages we will insert new
      exceptional entries into the radix tree that represent dirty DAX PTE or
      PMD pages.  These exceptional entries will also contain the writeback
      addresses for the PTE or PMD faults that we can use at fsync/msync time.
      
      There are currently two types of exceptional entries (shmem and shadow)
      that can be placed into the radix tree, and this adds a third.  We rely
      on the fact that only one type of exceptional entry can be found in a
      given radix tree based on its usage.  This happens for free with DAX vs
      shmem but we explicitly prevent shadow entries from being added to radix
      trees for DAX mappings.
      
      The only shadow entries that would be generated for DAX radix trees
      would be to track zero page mappings that were created for holes.  These
      pages would receive minimal benefit from having shadow entries, and the
      choice to have only one type of exceptional entry in a given radix tree
      makes the logic simpler both in clear_exceptional_entry() and in the
      rest of DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9fe48be
    • R
      pmem: add wb_cache_pmem() to the PMEM API · 3f4a2670
      Ross Zwisler 提交于
      __arch_wb_cache_pmem() was already an internal implementation detail of
      the x86 PMEM API, but this functionality needs to be exported as part of
      the general PMEM API to handle the fsync/msync case for DAX mmaps.
      
      One thing worth noting is that we really do want this to be part of the
      PMEM API as opposed to a stand-alone function like clflush_cache_range()
      because of ordering restrictions.  By having wb_cache_pmem() as part of
      the PMEM API we can leave it unordered, call it multiple times to write
      back large amounts of memory, and then order the multiple calls with a
      single wmb_pmem().
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f4a2670
    • A
      make sure that freeing shmem fast symlinks is RCU-delayed · 3ed47db3
      Al Viro 提交于
      Cc: stable@vger.kernel.org # v4.2+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3ed47db3
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  7. 22 1月, 2016 10 次提交
  8. 21 1月, 2016 4 次提交