1. 12 1月, 2016 3 次提交
    • D
      dde7f55b
    • D
      xfs: handle dquot buffer readahead in log recovery correctly · 7d6a13f0
      Dave Chinner 提交于
      When we do dquot readahead in log recovery, we do not use a verifier
      as the underlying buffer may not have dquots in it. e.g. the
      allocation operation hasn't yet been replayed. Hence we do not want
      to fail recovery because we detect an operation to be replayed has
      not been run yet. This problem was addressed for inodes in commit
      d8914002 ("xfs: inode buffers may not be valid during recovery
      readahead") but the problem was not recognised to exist for dquots
      and their buffers as the dquot readahead did not have a verifier.
      
      The result of not using a verifier is that when the buffer is then
      next read to replay a dquot modification, the dquot buffer verifier
      will only be attached to the buffer if *readahead is not complete*.
      Hence we can read the buffer, replay the dquot changes and then add
      it to the delwri submission list without it having a verifier
      attached to it. This then generates warnings in xfs_buf_ioapply(),
      which catches and warns about this case.
      
      Fix this and make it handle the same readahead verifier error cases
      as for inode buffers by adding a new readahead verifier that has a
      write operation as well as a read operation that marks the buffer as
      not done if any corruption is detected.  Also make sure we don't run
      readahead if the dquot buffer has been marked as cancelled by
      recovery.
      
      This will result in readahead either succeeding and the buffer
      having a valid write verifier, or readahead failing and the buffer
      state requiring the subsequent read to resubmit the IO with the new
      verifier.  In either case, this will result in the buffer always
      ending up with a valid write verifier on it.
      
      Note: we also need to fix the inode buffer readahead error handling
      to mark the buffer with EIO. Brian noticed the code I copied from
      there wrong during review, so fix it at the same time. Add comments
      linking the two functions that handle readahead verifier errors
      together so we don't forget this behavioural link in future.
      
      cc: <stable@vger.kernel.org> # 3.12 - current
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      7d6a13f0
    • D
      xfs: inode recovery readahead can race with inode buffer creation · b79f4a1c
      Dave Chinner 提交于
      When we do inode readahead in log recovery, we do can do the
      readahead before we've replayed the icreate transaction that stamps
      the buffer with inode cores. The inode readahead verifier catches
      this and marks the buffer as !done to indicate that it doesn't yet
      contain valid inodes.
      
      In adding buffer error notification  (i.e. setting b_error = -EIO at
      the same time as as we clear the done flag) to such a readahead
      verifier failure, we can then get subsequent inode recovery failing
      with this error:
      
      XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32
      
      This occurs when readahead completion races with icreate item replay
      such as:
      
      	inode readahead
      		find buffer
      		lock buffer
      		submit RA io
      	....
      	icreate recovery
      	    xfs_trans_get_buffer
      		find buffer
      		lock buffer
      		<blocks on RA completion>
      	.....
      	<ra completion>
      		fails verifier
      		clear XBF_DONE
      		set bp->b_error = -EIO
      		release and unlock buffer
      	<icreate gains lock>
      	icreate initialises buffer
      	marks buffer as done
      	adds buffer to delayed write queue
      	releases buffer
      
      At this point, we have an initialised inode buffer that is up to
      date but has an -EIO state registered against it. When we finally
      get to recovering an inode in that buffer:
      
      	inode item recovery
      	    xfs_trans_read_buffer
      		find buffer
      		lock buffer
      		sees XBF_DONE is set, returns buffer
      	    sees bp->b_error is set
      		fail log recovery!
      
      Essentially, we need xfs_trans_get_buf_map() to clear the error status of
      the buffer when doing a lookup. This function returns uninitialised
      buffers, so the buffer returned can not be in an error state and
      none of the code that uses this function expects b_error to be set
      on return. Indeed, there is an ASSERT(!bp->b_error); in the
      transaction case in xfs_trans_get_buf_map() that would have caught
      this if log recovery used transactions....
      
      This patch firstly changes the inode readahead failure to set -EIO
      on the buffer, and secondly changes xfs_buf_get_map() to never
      return a buffer with an error state set so this first change doesn't
      cause unexpected log recovery failures.
      
      cc: <stable@vger.kernel.org> # 3.12 - current
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b79f4a1c
  2. 11 1月, 2016 1 次提交
    • E
      xfs: eliminate committed arg from xfs_bmap_finish · f6106efa
      Eric Sandeen 提交于
      Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
      associated comments were replicated several times across
      the attribute code, all dealing with what to do if the
      transaction was or wasn't committed.
      
      And in that replicated code, an ASSERT() test of an
      uninitialized variable occurs in several locations:
      
      	error = xfs_attr_thing(&args);
      	if (!error) {
      		error = xfs_bmap_finish(&args.trans, args.flist,
      					&committed);
      	}
      	if (error) {
      		ASSERT(committed);
      
      If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
      never set "committed", and then test it in the ASSERT.
      
      Fix this up by moving the committed state internal to xfs_bmap_finish,
      and add a new inode argument.  If an inode is passed in, it is passed
      through to __xfs_trans_roll() and joined to the transaction there if
      the transaction was committed.
      
      xfs_qm_dqalloc() was a little unique in that it called bjoin rather
      than ijoin, but as Dave points out we can detect the committed state
      but checking whether (*tpp != tp).
      
      Addresses-Coverity-Id: 102360
      Addresses-Coverity-Id: 102361
      Addresses-Coverity-Id: 102363
      Addresses-Coverity-Id: 102364
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f6106efa
  3. 08 1月, 2016 2 次提交
    • D
      xfs: bmapbt checking on debug kernels too expensive · e3543819
      Dave Chinner 提交于
      For large sparse or fragmented files, checking every single entry in
      the bmapbt on every operation is prohibitively expensive. Especially
      as such checks rarely discover problems during normal operations on
      high extent coutn files. Our regression tests don't tend to exercise
      files with hundreds of thousands to millions of extents, so mostly
      this isn't noticed.
      
      However, trying to run things like xfs_mdrestore of large filesystem
      dumps on a debug kernel quickly becomes impossible as the CPU is
      completely burnt up repeatedly walking the sparse file bmapbt that
      is generated for every allocation that is made.
      
      Hence, if the file has more than 10,000 extents, just don't bother
      with walking the tree to check it exhaustively. The btree code has
      checks that ensure that the newly inserted/removed/modified record
      is correctly ordered, so the entrie tree walk in thses cases has
      limited additional value.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e3543819
    • D
      xfs: add tracepoints to readpage calls · 121e213e
      Dave Chinner 提交于
      This allows us to see page cache driven readahead in action as it
      passes through XFS. This helps to understand buffered read
      throughput problems such as readahead IO IO sizes being too small
      for the underlying device to reach max throughput.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      121e213e
  4. 05 1月, 2016 4 次提交
    • D
      Merge branch 'xfs-dax-fixes-for-4.5' into for-next · 4922be51
      Dave Chinner 提交于
      4922be51
    • D
      Merge branch 'xfs-misc-fixes-for-4.5' into for-next · 7eeabbd4
      Dave Chinner 提交于
      7eeabbd4
    • B
      xfs: debug mode log record crc error injection · 609adfc2
      Brian Foster 提交于
      XFS now uses CRC verification over a limited section of the log to
      detect torn writes prior to a crash. This is difficult to test directly
      due to the timing and hardware requirements to cause a short write.
      
      Add a mechanism to inject CRC errors into log records to facilitate
      testing torn write detection during log recovery. This mechanism is
      dangerous and can result in filesystem corruption. Thus, it is only
      available in DEBUG mode for testing/development purposes. Set a non-zero
      value to the following sysfs entry to enable error injection:
      
      	/sys/fs/xfs/<dev>/log/log_badcrc_factor
      
      Once enabled, XFS intentionally writes an invalid CRC to a log record at
      some random point in the future based on the provided frequency. The
      filesystem immediately shuts down once the record has been written to
      the physical log to prevent metadata writeback (e.g., AIL insertion)
      once the log write completes. This helps reasonably simulate a torn
      write to the log as the affected record must be safe to discard. The
      next mount after the intentional shutdown requires log recovery and
      should detect and recover from the torn write.
      
      Note again that this _will_ result in data loss or worse. For testing
      and development purposes only!
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      609adfc2
    • B
      xfs: detect and trim torn writes during log recovery · 7088c413
      Brian Foster 提交于
      Certain types of storage, such as persistent memory, do not provide
      sector atomicity for writes. This means that if a crash occurs while XFS
      is writing log records, only part of those records might make it to the
      storage. This is problematic because log recovery uses the cycle value
      packed at the top of each log block to locate the head/tail of the log.
      This can lead to CRC verification failures during log recovery and an
      unmountable fs for a filesystem that is otherwise consistent.
      
      Update log recovery to incorporate log record CRC verification as part
      of the head/tail discovery process. Once the head is located via the
      traditional algorithm, run a CRC-only pass over the records up to the
      head of the log. If CRC verification fails, assume that the records are
      torn as a matter of policy and trim the head block back to the start of
      the first bad record.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      7088c413
  5. 04 1月, 2016 22 次提交
  6. 01 1月, 2016 5 次提交
    • L
      Merge tag 'pci-v4.4-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · 9c982e86
      Linus Torvalds 提交于
      Pull PCI bugfix from Bjorn Helgaas:
       "Here's another fix for v4.4.
      
        This fixes 32-bit config reads for the HiSilicon driver.  Obviously
        the driver is completely broken without this fix (apparently it
        actually was tested internally, but got broken somehow in the process
        of upstreaming it).
      
        Summary:
      
        HiSilicon host bridge driver
          Fix 32-bit config reads (Dongdong Liu)"
      
      * tag 'pci-v4.4-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
        PCI: hisi: Fix hisi_pcie_cfg_read() 32-bit reads
      9c982e86
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · 7c672dd6
      Linus Torvalds 提交于
      Pull sparc fixes from David Miller:
       "Just some missing syscall wire ups"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
        sparc: Wire up mlock2 system call.
        sparc: Add all necessary direct socket system calls.
      7c672dd6
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 8f5daf2a
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
      
       1) Prevent XFRM per-cpu counter updates for one namespace from being
          applied to another namespace.  Fix from DanS treetman.
      
       2) Fix RCU de-reference in iwl_mvm_get_key_sta_id(), from Johannes
          Berg.
      
       3) Remove ethernet header assumption in nft_do_chain_netdev(), from
          Pablo Neira Ayuso.
      
       4) Fix cpsw PHY ident with multiple slaves and fixed-phy, from Pascal
          Speck.
      
       5) Fix use after free in sixpack_close and mkiss_close.
      
       6) Fix VXLAN fw assertion on bnx2x, from Yuval Mintz.
      
       7) natsemi doesn't check for DMA mapping errors, from Alexey
          Khoroshilov.
      
       8) Fix inverted test in ip6addrlbl_get(), from ANdrey Ryabinin.
      
       9) Missing initialization of needed_headroom in geneve tunnel driver,
          from Paolo Abeni.
      
      10) Fix conntrack template leak in openvswitch, from Joe Stringer.
      
      11) Mission initialization of wq->flags in sock_alloc_inode(), from
          Nicolai Stange.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
        sctp: sctp should release assoc when sctp_make_abort_user return NULL in sctp_close
        net, socket, socket_wq: fix missing initialization of flags
        drivers: net: cpsw: fix error return code
        openvswitch: Fix template leak in error cases.
        sctp: label accepted/peeled off sockets
        sctp: use GFP_USER for user-controlled kmalloc
        qlcnic: fix a loop exit condition better
        net: cdc_ncm: avoid changing RX/TX buffers on MTU changes
        geneve: initialize needed_headroom
        ipv6: honor ifindex in case we receive ll addresses in router advertisements
        addrconf: always initialize sysctl table data
        ipv6/addrlabel: fix ip6addrlbl_get()
        switchdev: bridge: Pass ageing time as clock_t instead of jiffies
        sh_eth: fix 16-bit descriptor field access endianness too
        veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
        net: usb: cdc_ncm: Adding Dell DW5813 LTE AT&T Mobile Broadband Card
        net: usb: cdc_ncm: Adding Dell DW5812 LTE Verizon Mobile Broadband Card
        natsemi: add checks for dma mapping errors
        rhashtable: Kill harmless RCU warning in rhashtable_walk_init
        openvswitch: correct encoding of set tunnel action attributes
        ...
      8f5daf2a
    • D
      sparc: Wire up mlock2 system call. · 42d85c52
      David S. Miller 提交于
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42d85c52
    • D
      sparc: Add all necessary direct socket system calls. · 8b30ca73
      David S. Miller 提交于
      The GLIBC folks would like to eliminate socketcall support
      eventually, and this makes sense regardless so wire them
      all up.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b30ca73
  7. 31 12月, 2015 3 次提交
    • X
      sctp: sctp should release assoc when sctp_make_abort_user return NULL in sctp_close · 068d8bd3
      Xin Long 提交于
      In sctp_close, sctp_make_abort_user may return NULL because of memory
      allocation failure. If this happens, it will bypass any state change
      and never free the assoc. The assoc has no chance to be freed and it
      will be kept in memory with the state it had even after the socket is
      closed by sctp_close().
      
      So if sctp_make_abort_user fails to allocate memory, we should abort
      the asoc via sctp_primitive_ABORT as well. Just like the annotation in
      sctp_sf_cookie_wait_prm_abort and sctp_sf_do_9_1_prm_abort said,
      "Even if we can't send the ABORT due to low memory delete the TCB.
      This is a departure from our typical NOMEM handling".
      
      But then the chunk is NULL (low memory) and the SCTP_CMD_REPLY cmd would
      dereference the chunk pointer, and system crash. So we should add
      SCTP_CMD_REPLY cmd only when the chunk is not NULL, just like other
      places where it adds SCTP_CMD_REPLY cmd.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      068d8bd3
    • D
      Merge tag 'wireless-drivers-for-davem-2015-12-28' of... · a0ccc3f2
      David S. Miller 提交于
      Merge tag 'wireless-drivers-for-davem-2015-12-28' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers
      
      Kalle Valo says:
      
      ====================
      iwlwifi
      
      * don't load firmware that won't exist for 7260
      * fix RCU splat
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0ccc3f2
    • N
      net, socket, socket_wq: fix missing initialization of flags · 574aab1e
      Nicolai Stange 提交于
      Commit ceb5d58b ("net: fix sock_wake_async() rcu protection") from
      the current 4.4 release cycle introduced a new flags member in
      struct socket_wq and moved SOCKWQ_ASYNC_NOSPACE and SOCKWQ_ASYNC_WAITDATA
      from struct socket's flags member into that new place.
      
      Unfortunately, the new flags field is never initialized properly, at least
      not for the struct socket_wq instance created in sock_alloc_inode().
      
      One particular issue I encountered because of this is that my GNU Emacs
      failed to draw anything on my desktop -- i.e. what I got is a transparent
      window, including the title bar. Bisection lead to the commit mentioned
      above and further investigation by means of strace told me that Emacs
      is indeed speaking to my Xorg through an O_ASYNC AF_UNIX socket. This is
      reproducible 100% of times and the fact that properly initializing the
      struct socket_wq ->flags fixes the issue leads me to the conclusion that
      somehow SOCKWQ_ASYNC_WAITDATA got set in the uninitialized ->flags,
      preventing my Emacs from receiving any SIGIO's due to data becoming
      available and it got stuck.
      
      Make sock_alloc_inode() set the newly created struct socket_wq's ->flags
      member to zero.
      
      Fixes: ceb5d58b ("net: fix sock_wake_async() rcu protection")
      Signed-off-by: NNicolai Stange <nicstange@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      574aab1e