1. 07 1月, 2011 1 次提交
  2. 29 12月, 2010 1 次提交
    • T
      pstore: new filesystem interface to platform persistent storage · ca01d6dd
      Tony Luck 提交于
      Some platforms have a small amount of non-volatile storage that
      can be used to store information useful to diagnose the cause of
      a system crash.  This is the generic part of a file system interface
      that presents information from the crash as a series of files in
      /dev/pstore.  Once the information has been seen, the underlying
      storage is freed by deleting the files.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      ca01d6dd
  3. 21 12月, 2010 1 次提交
  4. 18 12月, 2010 2 次提交
  5. 16 12月, 2010 4 次提交
  6. 15 12月, 2010 2 次提交
    • A
      ext4: fix typo which broke '..' detection in ext4_find_entry() · 6d5c3aa8
      Aaro Koskinen 提交于
      There should be a check for the NUL character instead of '0'.
      
      Fortunately the only thing that cares about this is NFS serving, which
      is why we didn't notice this in the merge window testing.
      Reported-by: NPhil Carmody <ext-phil.2.carmody@nokia.com>
      Signed-off-by: NAaro Koskinen <aaro.koskinen@nokia.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6d5c3aa8
    • T
      ext4: Turn off multiple page-io submission by default · 1449032b
      Theodore Ts'o 提交于
      Jon Nelson has found a test case which causes postgresql to fail with
      the error:
      
      psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581
      
      Under memory pressure, it looks like part of a file can end up getting
      replaced by zero's.  Until we can figure out the cause, we'll roll
      back the change and use block_write_full_page() instead of
      ext4_bio_write_page().  The new, more efficient writing function can
      be used via the mount option mblk_io_submit, so we can test and fix
      the new page I/O code.
      
      To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
      memory such that the system just at the end of triggering writeback
      before running the following sql script:
      
      begin;
      create temporary table foo as select x as a, ARRAY[x] as b FROM
      generate_series(1, 10000000 ) AS x;
      create index foo_a_idx on foo (a);
      create index foo_b_idx on foo USING GIN (b);
      rollback;
      
      If the temporary table is created on a hard drive partition which is
      encrypted using dm_crypt, then under memory pressure, approximately
      30-40% of the time, pgsql will issue the above failure.
      
      This patch should fix this problem, and the problem will come back if
      the file system is mounted with the mblk_io_submit mount option.
      Reported-by: NJon Nelson <jnelson@jamponi.net>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1449032b
  7. 14 12月, 2010 3 次提交
    • C
      Btrfs: prevent RAID level downgrades when space is low · 83a50de9
      Chris Mason 提交于
      The extent allocator has code that allows us to fill
      allocations from any available block group, even if it doesn't
      match the raid level we've requested.
      
      This was put in because adding a new drive to a filesystem
      made with the default mkfs options actually upgrades the metadata from
      single spindle dup to full RAID1.
      
      But, the code also allows us to allocate from a raid0 chunk when we
      really want a raid1 or raid10 chunk.  This can cause big trouble because
      mkfs creates a small (4MB) raid0 chunk for data and metadata which then
      goes unused for raid1/raid10 installs.
      
      The allocator will happily wander in and allocate from that chunk when
      things get tight, which is not correct.
      
      The fix here is to make sure that we provide duplication when the
      caller has asked for it.  It does all the dups to be any raid level,
      which preserves the dup->raid1 upgrade abilities.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      83a50de9
    • C
      Btrfs: account for missing devices in RAID allocation profiles · cd02dca5
      Chris Mason 提交于
      When we mount in RAID degraded mode without adding a new device to
      replace the failed one, we can end up using the wrong RAID flags for
      allocations.
      
      This results in strange combinations of block groups (raid1 in a raid10
      filesystem) and corruptions when we try to allocate blocks from single
      spindle chunks on drives that are actually missing.
      
      The first device has two small 4MB chunks in it that mkfs creates and
      these are usually unused in a raid1 or raid10 setup.  But, in -o degraded,
      the allocator will fall back to these because the mask of desired raid groups
      isn't correct.
      
      The fix here is to count the missing devices as we build up the list
      of devices in the system.  This count is used when picking the
      raid level to make sure we continue using the same levels that were
      in place before we lost a drive.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cd02dca5
    • C
      Btrfs: EIO when we fail to read tree roots · 68433b73
      Chris Mason 提交于
      If we just get a plain IO error when we read tree roots, the code
      wasn't properly sending that error up the chain.  This allowed mounts to
      continue when they should failed, and allowed operations
      on partially setup root structs.  The end result was usually oopsen
      on spinlocks that hadn't been spun up correctly.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      68433b73
  8. 11 12月, 2010 8 次提交
  9. 10 12月, 2010 5 次提交
    • C
      xfs: log timestamp changes to the source inode in rename · 05340d4a
      Christoph Hellwig 提交于
      Now that we don't mark VFS inodes dirty anymore for internal
      timestamp changes, but rely on the transaction subsystem to push
      them out, we need to explicitly log the source inode in rename after
      updating it's timestamps to make sure the changes actually get
      forced out by sync/fsync or an AIL push.
      
      We already account for the fourth inode in the log reservation, as a
      rename of directories needs to update the nlink field, so just
      adding the xfs_trans_log_inode call is enough.
      
      This fixes the xfsqa 065 regression introduced by:
      
      	"xfs: don't use vfs writeback for pure metadata modifications"
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      05340d4a
    • J
      Btrfs: fixup return code for btrfs_del_orphan_item · 7e1fea73
      Josef Bacik 提交于
      If the orphan item doesn't exist, we return 1, which doesn't make any sense to
      the callers.  Instead return -ENOENT if we didn't find the item.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7e1fea73
    • J
      Btrfs: do not do fast caching if we are allocating blocks for tree_root · b8399dee
      Josef Bacik 提交于
      Since the fast caching uses normal tree locking, we can possibly deadlock if we
      get to the caching via a btrfs_search_slot() on the tree_root.  So just check to
      see if the root we are on is the tree root, and just don't do the fast caching.
      Reported-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      b8399dee
    • J
      Btrfs: deal with space cache errors better · 2b20982e
      Josef Bacik 提交于
      Currently if the space cache inode generation number doesn't match the
      generation number in the space cache header we will just fail to load the space
      cache, but we won't mark the space cache as an error, so we'll keep getting that
      error each time somebody tries to cache that block group until we actually clear
      the thing.  Fix this by marking the space cache as having an error so we only
      get the message once.  This patch also makes it so that we don't try and setup
      space cache for a block group that isn't cached, since we won't be able to write
      it out anyway.  None of these problems are actual problems, they are just
      annoying and sub-optimal.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      2b20982e
    • J
      Btrfs: fix use after free in O_DIRECT · 955256f2
      Josef Bacik 提交于
      This fixes a bug where we use dip after we have freed it.  Instead just use the
      file_offset that was passed to the function.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      955256f2
  10. 09 12月, 2010 2 次提交
  11. 08 12月, 2010 11 次提交
    • T
      nfs: remove extraneous and problematic calls to nfs_clear_request · 2df485a7
      Trond Myklebust 提交于
      When a nfs_page is freed, nfs_free_request is called which also calls
      nfs_clear_request to clean out the lock and open contexts and free the
      pagecache page.
      
      However, a couple of places in the nfs code call nfs_clear_request
      themselves. What happens here if the refcount on the request is still high?
      We'll be releasing contexts and freeing pointers while the request is
      possibly still in use.
      
      Remove those bare calls to nfs_clear_context. That should only be done when
      the request is being freed.
      
      Note that when doing this, we need to watch out for tests of req->wb_page.
      Previously, nfs_set_page_tag_locked() and nfs_clear_page_tag_locked()
      would check the value of req->wb_page to figure out if the page is mapped
      into the nfsi->nfs_page_tree. We now indicate the page is mapped using
      the new bit PG_MAPPED in req->wb_flags .
      Reported-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      2df485a7
    • M
      nfs: kernel should return EPROTONOSUPPORT when not support NFSv4 · 0de1b7e8
      Mi Jinlong 提交于
        When nfs client(kernel) don't support NFSv4, maybe user build
        kernel without NFSv4, there is a problem.
      
        Using command "mount SERVER-IP:/nfsv3 /mnt/" to mount NFSv3
        filesystem, mount should should success, but fail and get error:
      
          "mount.nfs: an incorrect mount option was specified"
      
        System call mount "nfs"(not "nfs4") with "vers=4",
        if CONFIG_NFS_V4 is not defined, the "vers=4" will be parsed
        as invalid argument and kernel return EINVAL to nfs-utils.
      
        About that, we really want get EPROTONOSUPPORT rather than
        EINVAL. This path make sure kernel parses argument success,
        and return EPROTONOSUPPORT at nfs_validate_mount_data().
      Signed-off-by: NMi Jinlong <mijinlong@cn.fujitsu.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      0de1b7e8
    • S
      NFS: Fix fcntl F_GETLK not reporting some conflicts · 21ac19d4
      Sergey Vlasov 提交于
      The commit 129a84de (locks: fix F_GETLK
      regression (failure to find conflicts)) fixed the posix_test_lock()
      function by itself, however, its usage in NFS changed by the commit
      9d6a8c5c (locks: give posix_test_lock
      same interface as ->lock) remained broken - subsequent NFS-specific
      locking code received F_UNLCK instead of the user-specified lock type.
      To fix the problem, fl->fl_type needs to be saved before the
      posix_test_lock() call and restored if no local conflicts were reported.
      
      Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23892Tested-by: NAlexander Morozov <amorozov@etersoft.ru>
      Signed-off-by: NSergey Vlasov <vsu@altlinux.ru>
      Cc: <stable@kernel.org>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      21ac19d4
    • A
      nfs: Discard ACL cache on mode update · 08a22b39
      Aneesh Kumar K.V 提交于
      An update of mode bits can result in ACL value being changed. We need
      to mark the acl cache invalid when we update mode. Similarly we need
      to update file attribute when we change ACL value
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      08a22b39
    • L
      fanotify: Dont try to open a file descriptor for the overflow event · fdbf3cee
      Lino Sanfilippo 提交于
      We should not try to open a file descriptor for the overflow event since this
      will always fail.
      Signed-off-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      fdbf3cee
    • E
      fanotify: do not leak user reference on allocation failure · 26379198
      Eric Paris 提交于
      If fanotify_init is unable to allocate a new fsnotify group it will
      return but will not drop its reference on the associated user struct.
      Drop that reference on error.
      Reported-by: NVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      26379198
    • E
      inotify: stop kernel memory leak on file creation failure · a2ae4cc9
      Eric Paris 提交于
      If inotify_init is unable to allocate a new file for the new inotify
      group we leak the new group.  This patch drops the reference on the
      group on file allocation failure.
      Reported-by: NVegard Nossum <vegard.nossum@gmail.com>
      cc: stable@kernel.org
      Signed-off-by: NEric Paris <eparis@redhat.com>
      a2ae4cc9
    • L
      fanotify: on group destroy allow all waiters to bypass permission check · 09e5f14e
      Lino Sanfilippo 提交于
      When fanotify_release() is called, there may still be processes waiting for
      access permission. Currently only processes for which an event has already been
      queued into the groups access list will be woken up.  Processes for which no
      event has been queued will continue to sleep and thus cause a deadlock when
      fsnotify_put_group() is called.
      Furthermore there is a race allowing further processes to be waiting on the
      access wait queue after wake_up (if they arrive before clear_marks_by_group()
      is called).
      This patch corrects this by setting a flag to inform processes that the group
      is about to be destroyed and thus not to wait for access permission.
      
      [additional changelog from eparis]
      Lets think about the 4 relevant code paths from the PoV of the
      'operator' 'listener' 'responder' and 'closer'.  Where operator is the
      process doing an action (like open/read) which could require permission.
      Listener is the task (or in this case thread) slated with reading from
      the fanotify file descriptor.  The 'responder' is the thread responsible
      for responding to access requests.  'Closer' is the thread attempting to
      close the fanotify file descriptor.
      
      The 'operator' is going to end up in:
      fanotify_handle_event()
        get_response_from_access()
          (THIS BLOCKS WAITING ON USERSPACE)
      
      The 'listener' interesting code path
      fanotify_read()
        copy_event_to_user()
          prepare_for_access_response()
            (THIS CREATES AN fanotify_response_event)
      
      The 'responder' code path:
      fanotify_write()
        process_access_response()
          (REMOVE A fanotify_response_event, SET RESPONSE, WAKE UP 'operator')
      
      The 'closer':
      fanotify_release()
        (SUPPOSED TO CLEAN UP THE REST OF THIS MESS)
      
      What we have today is that in the closer we remove all of the
      fanotify_response_events and set a bit so no more response events are
      ever created in prepare_for_access_response().
      
      The bug is that we never wake all of the operators up and tell them to
      move along.  You fix that in fanotify_get_response_from_access().  You
      also fix other operators which haven't gotten there yet.  So I agree
      that's a good fix.
      [/additional changelog from eparis]
      
      [remove additional changes to minimize patch size]
      [move initialization so it was inside CONFIG_FANOTIFY_PERMISSION]
      Signed-off-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      09e5f14e
    • L
      fanotify: Dont allow a mask of 0 if setting or removing a mark · 1734dee4
      Lino Sanfilippo 提交于
      In mark_remove_from_mask() we destroy marks that have their event mask cleared.
      Thus we should not allow the creation of those marks in the first place.
      With this patch we check if the mask given from user is 0 in case of FAN_MARK_ADD.
      If so we return an error. Same for FAN_MARK_REMOVE since this does not have any
      effect.
      Signed-off-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      1734dee4
    • L
      fanotify: correct broken ref counting in case adding a mark failed · fa218ab9
      Lino Sanfilippo 提交于
      If adding a mount or inode mark failed fanotify_free_mark() is called explicitly.
      But at this time the mark has already been put into the destroy list of the
      fsnotify_mark kernel thread. If the thread is too slow it will try to decrease
      the reference of a mark, that has already been freed by fanotify_free_mark().
      (If its fast enough it will only decrease the marks ref counter from 2 to 1 - note
      that the counter has been increased to 2 in add_mark() - which has practically no
      effect.)
      
      This patch fixes the ref counting by not calling free_mark() explicitly, but
      decreasing the ref counter and rely on the fsnotify_mark thread to cleanup in
      case adding the mark has failed.
      Signed-off-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      fa218ab9
    • L
      fanotify: if set by user unset FMODE_NONOTIFY before fsnotify_perm() is called · b1085ba8
      Lino Sanfilippo 提交于
      Unsetting FMODE_NONOTIFY in fsnotify_open() is too late, since fsnotify_perm()
      is called before. If FMODE_NONOTIFY is set fsnotify_perm() will skip permission
      checks, so a user can still disable permission checks by setting this flag
      in an open() call.
      This patch corrects this by unsetting the flag before fsnotify_perm is called.
      Signed-off-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      b1085ba8