1. 04 4月, 2014 3 次提交
    • A
      nilfs2: implementation of NILFS_IOCTL_SET_SUINFO ioctl · 2cc88f3a
      Andreas Rohner 提交于
      With this ioctl the segment usage entries in the SUFILE can be updated
      from userspace.
      
      This is useful, because it allows the userspace GC to modify and update
      segment usage entries for specific segments, which enables it to avoid
      unnecessary write operations.
      
      If a segment needs to be cleaned, but there is no or very little
      reclaimable space in it, the cleaning operation basically degrades to a
      useless moving operation.  In the end the only thing that changes is the
      location of the data and a timestamp in the segment usage information.
      With this ioctl the GC can skip the cleaning and update the segment
      usage entries directly instead.
      
      This is basically a shortcut to cleaning the segment.  It is still
      necessary to read the segment summary information, but the writing of
      the live blocks can be skipped if it's not worth it.
      
      [konishi.ryusuke@lab.ntt.co.jp: add description of NILFS_IOCTL_SET_SUINFO ioctl]
      Signed-off-by: NAndreas Rohner <andreas.rohner@gmx.net>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cc88f3a
    • A
      nilfs2: add nilfs_sufile_set_suinfo to update segment usage · 00e9ffcd
      Andreas Rohner 提交于
      Introduce nilfs_sufile_set_suinfo(), which expects an array of
      nilfs_suinfo_update structures and updates the segment usage information
      accordingly.
      
      This is basically a helper function for the newly introduced
      NILFS_IOCTL_SET_SUINFO ioctl.
      
      [konishi.ryusuke@lab.ntt.co.jp: use put_bh() instead of brelse() because we know bh != NULL]
      Signed-off-by: NAndreas Rohner <andreas.rohner@gmx.net>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00e9ffcd
    • J
      mm + fs: store shadow entries in page cache · 91b0abe3
      Johannes Weiner 提交于
      Reclaim will be leaving shadow entries in the page cache radix tree upon
      evicting the real page.  As those pages are found from the LRU, an
      iput() can lead to the inode being freed concurrently.  At this point,
      reclaim must no longer install shadow pages because the inode freeing
      code needs to ensure the page tree is really empty.
      
      Add an address_space flag, AS_EXITING, that the inode freeing code sets
      under the tree lock before doing the final truncate.  Reclaim will check
      for this flag before installing shadow pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b0abe3
  2. 24 1月, 2014 2 次提交
  3. 15 1月, 2014 1 次提交
    • A
      nilfs2: fix segctor bug that causes file system corruption · 70f2fe3a
      Andreas Rohner 提交于
      There is a bug in the function nilfs_segctor_collect, which results in
      active data being written to a segment, that is marked as clean.  It is
      possible, that this segment is selected for a later segment
      construction, whereby the old data is overwritten.
      
      The problem shows itself with the following kernel log message:
      
        nilfs_sufile_do_cancel_free: segment 6533 must be clean
      
      Usually a few hours later the file system gets corrupted:
      
        NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0
        NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660)
      
      The issue can be reproduced with a file system that is nearly full and
      with the cleaner running, while some IO intensive task is running.
      Although it is quite hard to reproduce.
      
      This is what happens:
      
       1. The cleaner starts the segment construction
       2. nilfs_segctor_collect is called
       3. sc_stage is on NILFS_ST_SUFILE and segments are freed
       4. sc_stage is on NILFS_ST_DAT current segment is full
       5. nilfs_segctor_extend_segments is called, which
          allocates a new segment
       6. The new segment is one of the segments freed in step 3
       7. nilfs_sufile_cancel_freev is called and produces an error message
       8. Loop around and the collection starts again
       9. sc_stage is on NILFS_ST_SUFILE and segments are freed
          including the newly allocated segment, which will contain active
          data and can be allocated at a later time
      10. A few hours later another segment construction allocates the
          segment and causes file system corruption
      
      This can be prevented by simply reordering the statements.  If
      nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments
      the freed segments are marked as dirty and cannot be allocated any more.
      Signed-off-by: NAndreas Rohner <andreas.rohner@gmx.net>
      Reviewed-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: NAndreas Rohner <andreas.rohner@gmx.net>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70f2fe3a
  4. 24 11月, 2013 1 次提交
    • K
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet 提交于
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
      4f024f37
  5. 01 10月, 2013 1 次提交
    • V
      nilfs2: fix issue with race condition of competition between segments for dirty blocks · 7f42ec39
      Vyacheslav Dubeyko 提交于
      Many NILFS2 users were reported about strange file system corruption
      (for example):
      
         NILFS: bad btree node (blocknr=185027): level = 0, flags = 0x0, nchildren = 768
         NILFS error (device sda4): nilfs_bmap_last_key: broken bmap (inode number=11540)
      
      But such error messages are consequence of file system's issue that takes
      place more earlier.  Fortunately, Jerome Poulin <jeromepoulin@gmail.com>
      and Anton Eliasson <devel@antoneliasson.se> were reported about another
      issue not so recently.  These reports describe the issue with segctor
      thread's crash:
      
        BUG: unable to handle kernel paging request at 0000000000004c83
        IP: nilfs_end_page_io+0x12/0xd0 [nilfs2]
      
        Call Trace:
         nilfs_segctor_do_construct+0xf25/0x1b20 [nilfs2]
         nilfs_segctor_construct+0x17b/0x290 [nilfs2]
         nilfs_segctor_thread+0x122/0x3b0 [nilfs2]
         kthread+0xc0/0xd0
         ret_from_fork+0x7c/0xb0
      
      These two issues have one reason.  This reason can raise third issue
      too.  Third issue results in hanging of segctor thread with eating of
      100% CPU.
      
      REPRODUCING PATH:
      
      One of the possible way or the issue reproducing was described by
      Jermoe me Poulin <jeromepoulin@gmail.com>:
      
      1. init S to get to single user mode.
      2. sysrq+E to make sure only my shell is running
      3. start network-manager to get my wifi connection up
      4. login as root and launch "screen"
      5. cd /boot/log/nilfs which is a ext3 mount point and can log when NILFS dies.
      6. lscp | xz -9e > lscp.txt.xz
      7. mount my snapshot using mount -o cp=3360839,ro /dev/vgUbuntu/root /mnt/nilfs
      8. start a screen to dump /proc/kmsg to text file since rsyslog is killed
      9. start a screen and launch strace -f -o find-cat.log -t find
      /mnt/nilfs -type f -exec cat {} > /dev/null \;
      10. start a screen and launch strace -f -o apt-get.log -t apt-get update
      11. launch the last command again as it did not crash the first time
      12. apt-get crashes
      13. ps aux > ps-aux-crashed.log
      13. sysrq+W
      14. sysrq+E  wait for everything to terminate
      15. sysrq+SUSB
      
      Simplified way of the issue reproducing is starting kernel compilation
      task and "apt-get update" in parallel.
      
      REPRODUCIBILITY:
      
      The issue is reproduced not stable [60% - 80%].  It is very important to
      have proper environment for the issue reproducing.  The critical
      conditions for successful reproducing:
      
      (1) It should have big modified file by mmap() way.
      
      (2) This file should have the count of dirty blocks are greater that
          several segments in size (for example, two or three) from time to time
          during processing.
      
      (3) It should be intensive background activity of files modification
          in another thread.
      
      INVESTIGATION:
      
      First of all, it is possible to see that the reason of crash is not valid
      page address:
      
        NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82
        NILFS [nilfs_segctor_complete_write]:2101 segbuf->sb_segnum 6783
      
      Moreover, value of b_page (0x1a82) is 6786.  This value looks like segment
      number.  And b_blocknr with b_size values look like block numbers.  So,
      buffer_head's pointer points on not proper address value.
      
      Detailed investigation of the issue is discovered such picture:
      
        [-----------------------------SEGMENT 6783-------------------------------]
        NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
        NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
        NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
        NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
        NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
        NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
        NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111149024, segbuf->sb_segnum 6783
      
        [-----------------------------SEGMENT 6784-------------------------------]
        NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
        NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
        NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff8802174a6798, bh->b_assoc_buffers.prev ffff880221cffee8
        NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
        NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
        NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
        NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
        NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
        NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6784
        NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111150080, segbuf->sb_segnum 6784, segbuf->sb_nbio 0
        [----------] ditto
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111164416, segbuf->sb_segnum 6784, segbuf->sb_nbio 15
      
        [-----------------------------SEGMENT 6785-------------------------------]
        NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
        NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
        NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff880219277e80, bh->b_assoc_buffers.prev ffff880221cffc88
        NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
        NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
        NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
        NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
        NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6785
        NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111165440, segbuf->sb_segnum 6785, segbuf->sb_nbio 0
        [----------] ditto
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111177728, segbuf->sb_segnum 6785, segbuf->sb_nbio 12
      
        NILFS [nilfs_segctor_do_construct]:2399 nilfs_segctor_wait
        NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6783
        NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6784
        NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6785
      
        NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82
      
        BUG: unable to handle kernel paging request at 0000000000001a82
        IP: [<ffffffffa024d0f2>] nilfs_end_page_io+0x12/0xd0 [nilfs2]
      
      Usually, for every segment we collect dirty files in list.  Then, dirty
      blocks are gathered for every dirty file, prepared for write and
      submitted by means of nilfs_segbuf_submit_bh() call.  Finally, it takes
      place complete write phase after calling nilfs_end_bio_write() on the
      block layer.  Buffers/pages are marked as not dirty on final phase and
      processed files removed from the list of dirty files.
      
      It is possible to see that we had three prepare_write and submit_bio
      phases before segbuf_wait and complete_write phase.  Moreover, segments
      compete between each other for dirty blocks because on every iteration
      of segments processing dirty buffer_heads are added in several lists of
      payload_buffers:
      
        [SEGMENT 6784]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
        [SEGMENT 6785]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8
      
      The next pointer is the same but prev pointer has changed.  It means
      that buffer_head has next pointer from one list but prev pointer from
      another.  Such modification can be made several times.  And, finally, it
      can be resulted in various issues: (1) segctor hanging, (2) segctor
      crashing, (3) file system metadata corruption.
      
      FIX:
      This patch adds:
      
      (1) setting of BH_Async_Write flag in nilfs_segctor_prepare_write()
          for every proccessed dirty block;
      
      (2) checking of BH_Async_Write flag in
          nilfs_lookup_dirty_data_buffers() and
          nilfs_lookup_dirty_node_buffers();
      
      (3) clearing of BH_Async_Write flag in nilfs_segctor_complete_write(),
          nilfs_abort_logs(), nilfs_forget_buffer(), nilfs_clear_dirty_page().
      Reported-by: NJerome Poulin <jeromepoulin@gmail.com>
      Reported-by: NAnton Eliasson <devel@antoneliasson.se>
      Cc: Paul Fertser <fercerpav@gmail.com>
      Cc: ARAI Shun-ichi <hermes@ceres.dti.ne.jp>
      Cc: Piotr Szymaniak <szarpaj@grubelek.pl>
      Cc: Juan Barry Manuel Canham <Linux@riotingpacifist.net>
      Cc: Zahid Chowdhury <zahid.chowdhury@starsolutions.com>
      Cc: Elmer Zhang <freeboy6716@gmail.com>
      Cc: Kenneth Langga <klangga@gmail.com>
      Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f42ec39
  6. 13 9月, 2013 1 次提交
  7. 04 9月, 2013 1 次提交
  8. 24 8月, 2013 2 次提交
  9. 05 7月, 2013 1 次提交
  10. 04 7月, 2013 2 次提交
  11. 29 6月, 2013 1 次提交
  12. 25 5月, 2013 1 次提交
    • R
      nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF boundary · 136e8770
      Ryusuke Konishi 提交于
      nilfs2: fix issue of nilfs_set_page_dirty for page at EOF boundary
      
      DESCRIPTION:
       There are use-cases when NILFS2 file system (formatted with block size
      lesser than 4 KB) can be remounted in RO mode because of encountering of
      "broken bmap" issue.
      
      The issue was reported by Anthony Doggett <Anthony2486@interfaces.org.uk>:
       "The machine I've been trialling nilfs on is running Debian Testing,
        Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc
        version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.35-2), but I've
        also reproduced it (identically) with Debian Unstable amd64 and Debian
        Experimental (using the 3.8-trunk kernel).  The problematic partitions
        were formatted with "mkfs.nilfs2 -b 1024 -B 8192"."
      
      SYMPTOMS:
      (1) System log contains error messages likewise:
      
          [63102.496756] nilfs_direct_assign: invalid pointer: 0
          [63102.496786] NILFS error (device dm-17): nilfs_bmap_assign: broken bmap (inode number=28)
          [63102.496798]
          [63102.524403] Remounting filesystem read-only
      
      (2) The NILFS2 file system is remounted in RO mode.
      
      REPRODUSING PATH:
      (1) Create volume group with name "unencrypted" by means of vgcreate utility.
      (2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):
      
      ----------------[BEGIN SCRIPT]--------------------
      
      VG=unencrypted
      lvcreate --size 2G --name ntest $VG
      mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
      mkdir /var/tmp/n
      mkdir /var/tmp/n/ntest
      mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
      mkdir /var/tmp/n/ntest/thedir
      cd /var/tmp/n/ntest/thedir
      sleep 2
      date
      darcs init
      sleep 2
      dmesg|tail -n 5
      date
      darcs whatsnew || true
      date
      sleep 2
      dmesg|tail -n 5
      ----------------[END SCRIPT]--------------------
      
      REPRODUCIBILITY: 100%
      
      INVESTIGATION:
      As it was discovered, the issue takes place during segment
      construction after executing such sequence of user-space operations:
      
        open("_darcs/index", O_RDWR|O_CREAT|O_NOCTTY, 0666) = 7
        fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
        ftruncate(7, 60)
      
      The error message "NILFS error (device dm-17): nilfs_bmap_assign: broken
      bmap (inode number=28)" takes place because of trying to get block
      number for third block of the file with logical offset #3072 bytes.  As
      it is possible to see from above output, the file has 60 bytes of the
      whole size.  So, it is enough one block (1 KB in size) allocation for
      the whole file.  Trying to operate with several blocks instead of one
      takes place because of discovering several dirty buffers for this file
      in nilfs_segctor_scan_file() method.
      
      The root cause of this issue is in nilfs_set_page_dirty function which
      is called just before writing to an mmapped page.
      
      When nilfs_page_mkwrite function handles a page at EOF boundary, it
      fills hole blocks only inside EOF through __block_page_mkwrite().
      
      The __block_page_mkwrite() function calls set_page_dirty() after filling
      hole blocks, thus nilfs_set_page_dirty function (=
      a_ops->set_page_dirty) is called.  However, the current implementation
      of nilfs_set_page_dirty() wrongly marks all buffers dirty even for page
      at EOF boundary.
      
      As a result, buffers outside EOF are inconsistently marked dirty and
      queued for write even though they are not mapped with nilfs_get_block
      function.
      
      FIX:
      This modifies nilfs_set_page_dirty() not to mark hole blocks dirty.
      
      Thanks to Vyacheslav Dubeyko for his effort on analysis and proposals
      for this issue.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Reported-by: NAnthony Doggett <Anthony2486@interfaces.org.uk>
      Reported-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
      Tested-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      136e8770
  13. 08 5月, 2013 1 次提交
  14. 01 5月, 2013 3 次提交
    • V
      nilfs2: remove unneeded test in nilfs_writepage() · eb53b6db
      Vyacheslav Dubeyko 提交于
      page->mapping->host cannot be NULL in nilfs_writepage(), so remove the
      unneeded test.
      
      The fixes the smatch warning: "fs/nilfs2/inode.c:211 nilfs_writepage()
      error: we previously assumed 'inode' could be null (see line 195)".
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb53b6db
    • V
      nilfs2: fix using of PageLocked() in nilfs_clear_dirty_page() · dc33f5f3
      Vyacheslav Dubeyko 提交于
      Change test_bit(PG_locked, &page->flags) to PageLocked().
      Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc33f5f3
    • V
      nilfs2: fix issue with flush kernel thread after remount in RO mode because of... · 8c26c4e2
      Vyacheslav Dubeyko 提交于
      nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption
      
      The NILFS2 driver remounts itself in RO mode in the case of discovering
      metadata corruption (for example, discovering a broken bmap).  But
      usually, this takes place when there have been file system operations
      before remounting in RO mode.
      
      Thereby, NILFS2 driver can be in RO mode with presence of dirty pages in
      modified inodes' address spaces.  It results in flush kernel thread's
      infinite trying to flush dirty pages in RO mode.  As a result, it is
      possible to see such side effects as: (1) flush kernel thread occupies
      50% - 99% of CPU time; (2) system can't be shutdowned without manual
      power switch off.
      
      SYMPTOMS:
      (1) System log contains error message: "Remounting filesystem read-only".
      (2) The flush kernel thread occupies 50% - 99% of CPU time.
      (3) The system can't be shutdowned without manual power switch off.
      
      REPRODUCTION PATH:
      (1) Create volume group with name "unencrypted" by means of vgcreate utility.
      (2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):
      
        ----------------[BEGIN SCRIPT]--------------------
        #!/bin/bash
      
        VG=unencrypted
        #apt-get install nilfs-tools darcs
        lvcreate --size 2G --name ntest $VG
        mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
        mkdir /var/tmp/n
        mkdir /var/tmp/n/ntest
        mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
        mkdir /var/tmp/n/ntest/thedir
        cd /var/tmp/n/ntest/thedir
        sleep 2
        date
        darcs init
        sleep 2
        dmesg|tail -n 5
        date
        darcs whatsnew || true
        date
        sleep 2
        dmesg|tail -n 5
        ----------------[END SCRIPT]--------------------
      
      (3) Try to shutdown the system.
      
      REPRODUCIBILITY: 100%
      
      FIX:
      
      This patch implements checking mount state of NILFS2 driver in
      nilfs_writepage(), nilfs_writepages() and nilfs_mdt_write_page()
      methods.  If it is detected the RO mount state then all dirty pages are
      simply discarded with warning messages is written in system log.
      
      [akpm@linux-foundation.org: fix printk warning]
      Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Anthony Doggett <Anthony2486@interfaces.org.uk>
      Cc: ARAI Shun-ichi <hermes@ceres.dti.ne.jp>
      Cc: Piotr Szymaniak <szarpaj@grubelek.pl>
      Cc: Zahid Chowdhury <zahid.chowdhury@starsolutions.com>
      Cc: Elmer Zhang <freeboy6716@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c26c4e2
  15. 04 3月, 2013 1 次提交
    • E
      fs: Limit sys_mount to only request filesystem modules. · 7f78e035
      Eric W. Biederman 提交于
      Modify the request_module to prefix the file system type with "fs-"
      and add aliases to all of the filesystems that can be built as modules
      to match.
      
      A common practice is to build all of the kernel code and leave code
      that is not commonly needed as modules, with the result that many
      users are exposed to any bug anywhere in the kernel.
      
      Looking for filesystems with a fs- prefix limits the pool of possible
      modules that can be loaded by mount to just filesystems trivially
      making things safer with no real cost.
      
      Using aliases means user space can control the policy of which
      filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
      with blacklist and alias directives.  Allowing simple, safe,
      well understood work-arounds to known problematic software.
      
      This also addresses a rare but unfortunate problem where the filesystem
      name is not the same as it's module name and module auto-loading
      would not work.  While writing this patch I saw a handful of such
      cases.  The most significant being autofs that lives in the module
      autofs4.
      
      This is relevant to user namespaces because we can reach the request
      module in get_fs_type() without having any special permissions, and
      people get uncomfortable when a user specified string (in this case
      the filesystem type) goes all of the way to request_module.
      
      After having looked at this issue I don't think there is any
      particular reason to perform any filtering or permission checks beyond
      making it clear in the module request that we want a filesystem
      module.  The common pattern in the kernel is to call request_module()
      without regards to the users permissions.  In general all a filesystem
      module does once loaded is call register_filesystem() and go to sleep.
      Which means there is not much attack surface exposed by loading a
      filesytem module unless the filesystem is mounted.  In a user
      namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
      which most filesystems do not set today.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Reported-by: NKees Cook <keescook@google.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      7f78e035
  16. 26 2月, 2013 1 次提交
  17. 23 2月, 2013 1 次提交
  18. 22 2月, 2013 1 次提交
    • D
      mm: only enforce stable page writes if the backing device requires it · 1d1d1a76
      Darrick J. Wong 提交于
      Create a helper function to check if a backing device requires stable
      page writes and, if so, performs the necessary wait.  Then, make it so
      that all points in the memory manager that handle making pages writable
      use the helper function.  This should provide stable page write support
      to most filesystems, while eliminating unnecessary waiting for devices
      that don't require the feature.
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own stable page guarantees or they don't block at all.
      The blocking behavior is back to what it was before 3.0 if you don't
      have a disk requiring stable page writes.
      
      Here's the result of using dbench to test latency on ext2:
      
      3.8.0-rc3:
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       WriteX        109347     0.028    59.817
       ReadX         347180     0.004     3.391
       Flush          15514    29.828   287.283
      
      Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
       WriteX        105556     0.029     4.273
       ReadX         335004     0.005     4.112
       Flush          14982    30.540   298.634
      
      Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, the maximum write latency drops considerably with this
      patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
      similarly, but see the cover letter for those results.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d1d1a76
  19. 05 2月, 2013 1 次提交
    • V
      nilfs2: fix fix very long mount time issue · a9bae189
      Vyacheslav Dubeyko 提交于
      There exists a situation when GC can work in background alone without
      any other filesystem activity during significant time.
      
      The nilfs_clean_segments() method calls nilfs_segctor_construct() that
      updates superblocks in the case of NILFS_SC_SUPER_ROOT and
      THE_NILFS_DISCONTINUED flags are set.  But when GC is working alone the
      nilfs_clean_segments() is called with unset THE_NILFS_DISCONTINUED flag.
      As a result, the update of superblocks doesn't occurred all this time
      and in the case of SPOR superblocks keep very old values of last super
      root placement.
      
      SYMPTOMS:
      
      Trying to mount a NILFS2 volume after SPOR in such environment ends with
      very long mounting time (it can achieve about several hours in some
      cases).
      
      REPRODUCING PATH:
      
      1. It needs to use external USB HDD, disable automount and doesn't
         make any additional filesystem activity on the NILFS2 volume.
      
      2. Generate temporary file with size about 100 - 500 GB (for example,
         dd if=/dev/zero of=<file_name> bs=1073741824 count=200).  The size of
         file defines duration of GC working.
      
      3. Then it needs to delete file.
      
      4. Start GC manually by means of command "nilfs-clean -p 0".  When you
         start GC by means of such way then, at the end, superblocks is updated
         by once.  So, for simulation of SPOR, it needs to wait sometime (15 -
         40 minutes) and simply switch off USB HDD manually.
      
      5. Switch on USB HDD again and try to mount NILFS2 volume.  As a
         result, NILFS2 volume will mount during very long time.
      
      REPRODUCIBILITY: 100%
      
      FIX:
      
      This patch adds checking that superblocks need to update and set
      THE_NILFS_DISCONTINUED flag before nilfs_clean_segments() call.
      Reported-by: NSergey Alexandrov <splavgm@gmail.com>
      Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Tested-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9bae189
  20. 12 1月, 2013 1 次提交
  21. 21 12月, 2012 1 次提交
  22. 12 12月, 2012 1 次提交
    • R
      mm: redefine address_space.assoc_mapping · 252aa6f5
      Rafael Aquini 提交于
      Overhaul struct address_space.assoc_mapping renaming it to
      address_space.private_data and its type is redefined to void*.  By this
      approach we consistently name the .private_* elements from struct
      address_space as well as allow extended usage for address_space
      association with other data structures through ->private_data.
      
      Also, all users of old ->assoc_mapping element are converted to reflect
      its new name and type change (->private_data).
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      252aa6f5
  23. 09 10月, 2012 1 次提交
    • K
      mm: kill vma flag VM_CAN_NONLINEAR · 0b173bc4
      Konstantin Khlebnikov 提交于
      Move actual pte filling for non-linear file mappings into the new special
      vma operation: ->remap_pages().
      
      Filesystems must implement this method to get non-linear mapping support,
      if it uses filemap_fault() then generic_file_remap_pages() can be used.
      
      Now device drivers can implement this method and obtain nonlinear vma support.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b173bc4
  24. 03 10月, 2012 1 次提交
  25. 01 10月, 2012 1 次提交
    • T
      ext4: fix mtime update in nodelalloc mode · 041bbb6d
      Theodore Ts'o 提交于
      Commits 5e8830dc and 41c4d25f introduced a regression into
      v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not
      take place for files modified via mmap if the page was already in the
      page cache.  This would also affect ext3 file systems mounted using
      the ext4 file system driver.
      
      The problem was that ext4_page_mkwrite() had a shortcut which would
      avoid calling __block_page_mkwrite() under some circumstances, and the
      above two commit transferred the responsibility of calling
      file_update_time() to __block_page_mkwrite --- which woudln't get
      called in some circumstances.
      
      Since __block_page_mkwrite() only has three callers,
      block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the
      best way to solve this is to move the responsibility for calling
      file_update_time() to its caller.
      
      This problem was found via xfstests #215 with a file system mounted
      with -o nodelalloc.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: stable@vger.kernel.org
      041bbb6d
  26. 21 9月, 2012 1 次提交
  27. 04 8月, 2012 1 次提交
  28. 31 7月, 2012 6 次提交