1. 26 9月, 2017 5 次提交
  2. 15 9月, 2017 9 次提交
  3. 14 9月, 2017 3 次提交
    • M
      mm: treewide: remove GFP_TEMPORARY allocation flag · 0ee931c4
      Michal Hocko 提交于
      GFP_TEMPORARY was introduced by commit e12ba74d ("Group short-lived
      and reclaimable kernel allocations") along with __GFP_RECLAIMABLE.  It's
      primary motivation was to allow users to tell that an allocation is
      short lived and so the allocator can try to place such allocations close
      together and prevent long term fragmentation.  As much as this sounds
      like a reasonable semantic it becomes much less clear when to use the
      highlevel GFP_TEMPORARY allocation flag.  How long is temporary? Can the
      context holding that memory sleep? Can it take locks? It seems there is
      no good answer for those questions.
      
      The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
      __GFP_RECLAIMABLE which in itself is tricky because basically none of
      the existing caller provide a way to reclaim the allocated memory.  So
      this is rather misleading and hard to evaluate for any benefits.
      
      I have checked some random users and none of them has added the flag
      with a specific justification.  I suspect most of them just copied from
      other existing users and others just thought it might be a good idea to
      use without any measuring.  This suggests that GFP_TEMPORARY just
      motivates for cargo cult usage without any reasoning.
      
      I believe that our gfp flags are quite complex already and especially
      those with highlevel semantic should be clearly defined to prevent from
      confusion and abuse.  Therefore I propose dropping GFP_TEMPORARY and
      replace all existing users to simply use GFP_KERNEL.  Please note that
      SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
      so they will be placed properly for memory fragmentation prevention.
      
      I can see reasons we might want some gfp flag to reflect shorterm
      allocations but I propose starting from a clear semantic definition and
      only then add users with proper justification.
      
      This was been brought up before LSF this year by Matthew [1] and it
      turned out that GFP_TEMPORARY really doesn't have a clear semantic.  It
      seems to be a heuristic without any measured advantage for most (if not
      all) its current users.  The follow up discussion has revealed that
      opinions on what might be temporary allocation differ a lot between
      developers.  So rather than trying to tweak existing users into a
      semantic which they haven't expected I propose to simply remove the flag
      and start from scratch if we really need a semantic for short term
      allocations.
      
      [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org
      
      [akpm@linux-foundation.org: fix typo]
      [akpm@linux-foundation.org: coding-style fixes]
      [sfr@canb.auug.org.au: drm/i915: fix up]
        Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ee931c4
    • A
      fscache: fix fscache_objlist_show format processing · ebfddb3d
      Arnd Bergmann 提交于
      gcc points out a minor bug in the handling of unknown cookie types,
      which could result in a string overflow when the integer is copied into
      a 3-byte string:
      
        fs/fscache/object-list.c: In function 'fscache_objlist_show':
        fs/fscache/object-list.c:265:19: error: 'sprintf' may write a terminating nul past the end of the destination [-Werror=format-overflow=]
         sprintf(_type, "%02u", cookie->def->type);
                        ^~~~~~
        fs/fscache/object-list.c:265:4: note: 'sprintf' output between 3 and 4 bytes into a destination of size 3
      
      This is currently harmless as no code sets a type other than 0 or 1, but
      it makes sense to use snprintf() here to avoid overflowing the array if
      that changes.
      
      Link: http://lkml.kernel.org/r/20170714120720.906842-22-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebfddb3d
    • A
      procfs: remove unused variable · 6dec0dd4
      Arnd Bergmann 提交于
      In NOMMU configurations, we get a warning about a variable that has become
      unused:
      
        fs/proc/task_nommu.c: In function 'nommu_vma_show':
        fs/proc/task_nommu.c:148:28: error: unused variable 'priv' [-Werror=unused-variable]
      
      Link: http://lkml.kernel.org/r/20170911200231.3171415-1-arnd@arndb.de
      Fixes: 1240ea0d ("fs, proc: remove priv argument from is_stack")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dec0dd4
  4. 13 9月, 2017 4 次提交
    • R
      xfs: XFS_IS_REALTIME_INODE() should be false if no rt device present · b31ff3cd
      Richard Wareing 提交于
      If using a kernel with CONFIG_XFS_RT=y and we set the RHINHERIT flag on
      a directory in a filesystem that does not have a realtime device and
      create a new file in that directory, it gets marked as a real time file.
      When data is written and a fsync is issued, the filesystem attempts to
      flush a non-existent rt device during the fsync process.
      
      This results in a crash dereferencing a null buftarg pointer in
      xfs_blkdev_issue_flush():
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: xfs_blkdev_issue_flush+0xd/0x20
        .....
        Call Trace:
          xfs_file_fsync+0x188/0x1c0
          vfs_fsync_range+0x3b/0xa0
          do_fsync+0x3d/0x70
          SyS_fsync+0x10/0x20
          do_syscall_64+0x4d/0xb0
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Setting RT inode flags does not require special privileges so any
      unprivileged user can cause this oops to occur.  To reproduce, confirm
      kernel is compiled with CONFIG_XFS_RT=y and run:
      
        # mkfs.xfs -f /dev/pmem0
        # mount /dev/pmem0 /mnt/test
        # mkdir /mnt/test/foo
        # xfs_io -c 'chattr +t' /mnt/test/foo
        # xfs_io -f -c 'pwrite 0 5m' -c fsync /mnt/test/foo/bar
      
      Or just run xfstests with MKFS_OPTIONS="-d rtinherit=1" and wait.
      
      Kernels built with CONFIG_XFS_RT=n are not exposed to this bug.
      
      Fixes: f538d4da ("[XFS] write barrier support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NRichard Wareing <rwareing@fb.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b31ff3cd
    • C
      f2fs: hurry up to issue discard after io interruption · e6c6de18
      Chao Yu 提交于
      Once we encounter I/O interruption during issuing discards, we will delay
      long time before next round, but if system status is I/O idle during the
      time, it may loses opportunity to issue discards. So this patch changes
      to hurry up to issue discard after io interruption.
      
      Besides, this patch also fixes to issue discards accurately with assigned
      rate.
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      e6c6de18
    • C
      f2fs: fix to show correct discard_granularity in sysfs · 80647e5f
      Chao Yu 提交于
      Fix below incorrect display when reading discard_granularity sysfs node.
      
      $ cat /sys/fs/f2fs/<device>/discard_granularity
      $ 16
      $ echo 32 > /sys/fs/f2fs/<device>/discard_granularity
      $ cat /sys/fs/f2fs/<device>/discard_granularity
      $ 16
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      80647e5f
    • C
      f2fs: detect dirty inode in evict_inode · ca7d802a
      Chao Yu 提交于
      Add a bugon in f2fs_evict_inode to detect inconsistent status between
      inode cache and related node page cache.
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      ca7d802a
  5. 12 9月, 2017 10 次提交
    • A
      ovl: fix false positive ESTALE on lookup · 939ae4ef
      Amir Goldstein 提交于
      Commit b9ac5c27 ("ovl: hash overlay non-dir inodes by copy up origin")
      verifies that the origin lower inode stored in the overlayfs inode matched
      the inode of a copy up origin dentry found by lookup.
      
      There is a false positive result in that check when lower fs does not
      support file handles and copy up origin cannot be followed by file handle
      at lookup time.
      
      The false negative happens when finding an overlay inode in cache on a
      copied up overlay dentry lookup. The overlay inode still 'remembers' the
      copy up origin inode, but the copy up origin dentry is not available for
      verification.
      
      Relax the check in case copy up origin dentry is not available.
      
      Fixes: b9ac5c27 ("ovl: hash overlay non-dir inodes by copy up...")
      Cc: <stable@vger.kernel.org> # v4.13
      Reported-by: NJordi Pujol <jordipujolp@gmail.com>
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      939ae4ef
    • M
      fuse: getattr cleanup · 5b97eeac
      Miklos Szeredi 提交于
      The refreshed argument isn't used by any caller, get rid of it.
      
      Use a helper for just updating the inode (no need to fill in a kstat).
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      5b97eeac
    • M
      fuse: honor iocb sync flags on write · e1c0eecb
      Miklos Szeredi 提交于
      If the IOCB_DSYNC flag is set a sync is not being performed by
      fuse_file_write_iter.
      
      Honor IOCB_DSYNC/IOCB_SYNC by setting O_DYSNC/O_SYNC respectively in the
      flags filed of the write request.
      
      We don't need to sync data or metadata, since fuse_perform_write() does
      write-through and the filesystem is responsible for updating file times.
      
      Original patch by Vitaly Zolotusky.
      Reported-by: NNate Clark <nate@neworld.us>
      Cc: Vitaly Zolotusky <vitaly@unitc.com>.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      e1c0eecb
    • M
      fuse: allow server to run in different pid_ns · 5d6d3a30
      Miklos Szeredi 提交于
      Commit 0b6e9ea0 ("fuse: Add support for pid namespaces") broke
      Sandstorm.io development tools, which have been sending FUSE file
      descriptors across PID namespace boundaries since early 2014.
      
      The above patch added a check that prevented I/O on the fuse device file
      descriptor if the pid namespace of the reader/writer was different from the
      pid namespace of the mounter.  With this change passing the device file
      descriptor to a different pid namespace simply doesn't work.  The check was
      added because pids are transferred to/from the fuse userspace server in the
      namespace registered at mount time.
      
      To fix this regression, remove the checks and do the following:
      
      1) the pid in the request header (the pid of the task that initiated the
      filesystem operation) is translated to the reader's pid namespace.  If a
      mapping doesn't exist for this pid, then a zero pid is used.  Note: even if
      a mapping would exist between the initiator task's pid namespace and the
      reader's pid namespace the pid will be zero if either mapping from
      initator's to mounter's namespace or mapping from mounter's to reader's
      namespace doesn't exist.
      
      2) The lk.pid value in setlk/setlkw requests and getlk reply is left alone.
      Userspace should not interpret this value anyway.  Also allow the
      setlk/setlkw operations if the pid of the task cannot be represented in the
      mounter's namespace (pid being zero in that case).
      Reported-by: NKenton Varda <kenton@sandstorm.io>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 0b6e9ea0 ("fuse: Add support for pid namespaces")
      Cc: <stable@vger.kernel.org> # v4.12+
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Seth Forshee <seth.forshee@canonical.com>
      5d6d3a30
    • D
      f2fs: clear radix tree dirty tag of pages whose dirty flag is cleared · 0abd8e70
      Daeho Jeong 提交于
      On a senario like writing out the first dirty page of the inode
      as the inline data, we only cleared dirty flags of the pages, but
      didn't clear the dirty tags of those pages in the radix tree.
      
      If we don't clear the dirty tags of the pages in the radix tree, the
      inodes which contain the pages will be marked with I_DIRTY_PAGES again
      and again, and writepages() for the inodes will be invoked in every
      writeback period. As a result, nothing will be done in every
      writepages() for the inodes and it will just consume CPU time
      meaninglessly.
      Signed-off-by: NDaeho Jeong <daeho.jeong@samsung.com>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      0abd8e70
    • N
      NFS: various changes relating to reporting IO errors. · bf4b4905
      NeilBrown 提交于
      1/ remove 'start' and 'end' args from nfs_file_fsync_commit().
         They aren't used.
      
      2/ Make nfs_context_set_write_error() a "static inline" in internal.h
         so we can...
      
      3/ Use nfs_context_set_write_error() instead of mapping_set_error()
         if nfs_pageio_add_request() fails before sending any request.
         NFS generally keeps errors in the open_context, not the mapping,
         so this is more consistent.
      
      4/ If filemap_write_and_write_range() reports any error, still
         check ctx->error.  The value in ctx->error is likely to be
         more useful.  As part of this, NFS_CONTEXT_ERROR_WRITE is
         cleared slightly earlier, before nfs_file_fsync_commit() is called,
         rather than at the start of that function.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      bf4b4905
    • C
      NFS: Add static NFS I/O tracepoints · 8224b273
      Chuck Lever 提交于
      Tools like tcpdump and rpcdebug can be very useful. But there are
      plenty of environments where they are difficult or impossible to
      use. For example, we've had customers report I/O failures during
      workloads so heavy that collecting network traffic or enabling
      RPC debugging are themselves onerous.
      
      The kernel's static tracepoints are lightweight (less likely to
      introduce timing changes) and efficient (the trace data is compact).
      They also work in scenarios where capturing network traffic is not
      possible due to lack of hardware support (some InfiniBand HCAs) or
      where data or network privacy is a concern.
      
      Introduce tracepoints that show when an NFS READ, WRITE, or COMMIT
      is initiated, and when it completes. Record the arguments and
      results of each operation, which are not shown by existing sunrpc
      module's tracepoints.
      
      For instance, the recorded offset and count can be used to match an
      "initiate" event to a "done" event. If an NFS READ result returns
      fewer bytes than requested or zero, seeing the EOF flag can be
      probative. Seeing an NFS4ERR_BAD_STATEID result is also indication
      of a particular class of problems. The timing information attached
      to each event record can often be useful as well.
      
      Usage example:
      
      [root@manet tmp]# trace-cmd record -e nfs:*initiate* -e nfs:*done
      /sys/kernel/debug/tracing/events/nfs/*initiate*/filter
      /sys/kernel/debug/tracing/events/nfs/*done/filter
      Hit Ctrl^C to stop recording
      ^CKernel buffer statistics:
        Note: "entries" are the entries left in the kernel ring buffer and are not
              recorded in the trace data. They should all be zero.
      
      CPU: 0
      entries: 0
      overrun: 0
      commit overrun: 0
      bytes: 3680
      oldest event ts:    78.367422
      now ts:   100.124419
      dropped events: 0
      read events: 74
      
      ... and so on.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      8224b273
    • T
      pNFS: Use the standard I/O stateid when calling LAYOUTGET · 70d2f7b1
      Trond Myklebust 提交于
      Instead of having a private method for copying the open/delegation stateid,
      use the same call that is used for standard I/O through the MDS.
      
      Note that this means we transmit the stateid with a zero seqid, avoiding
      issues with NFS4ERR_OLD_STATEID.
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      70d2f7b1
    • J
      f2fs: speed up gc_urgent mode with SSR · b3a97a2a
      Jaegeuk Kim 提交于
      This patch activates SSR in gc_urgent mode.
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      b3a97a2a
    • J
      f2fs: better to wait for fstrim completion · 1eb1ef4a
      Jaegeuk Kim 提交于
      In android, we'd better wait for fstrim completion instead of issuing the
      discard commands asynchronous.
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      1eb1ef4a
  6. 11 9月, 2017 1 次提交
    • M
      dax: remove the pmem_dax_ops->flush abstraction · c3ca015f
      Mikulas Patocka 提交于
      Commit abebfbe2 ("dm: add ->flush() dax operation support") is
      buggy. A DM device may be composed of multiple underlying devices and
      all of them need to be flushed. That commit just routes the flush
      request to the first device and ignores the other devices.
      
      It could be fixed by adding more complex logic to the device mapper. But
      there is only one implementation of the method pmem_dax_ops->flush - that
      is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
      don't need the pmem_dax_ops->flush abstraction at all, we can call
      arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
      can't ever reach anything different from arch_wb_cache_pmem().
      
      It should be also pointed out that for some uses of persistent memory it
      is needed to flush only a very small amount of data (such as 1 cacheline),
      and it would be overkill if we go through that device mapper machinery for
      a single flushed cache line.
      
      Fix this by removing the pmem_dax_ops->flush abstraction and call
      arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
      mapper code that forwards the flushes.
      
      Fixes: abebfbe2 ("dm: add ->flush() dax operation support")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c3ca015f
  7. 10 9月, 2017 4 次提交
  8. 09 9月, 2017 4 次提交