1. 08 5月, 2013 13 次提交
    • K
      aio: kill ki_retry · 41ef4eb8
      Kent Overstreet 提交于
      Thanks to Zach Brown's work to rip out the retry infrastructure, we don't
      need this anymore - ki_retry was only called right after the kiocb was
      initialized.
      
      This also refactors and trims some duplicated code, as well as cleaning up
      the refcounting/error handling a bit.
      
      [akpm@linux-foundation.org: use fmode_t in aio_run_iocb()]
      [akpm@linux-foundation.org: fix file_start_write/file_end_write tests]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41ef4eb8
    • K
      aio: kill ki_key · 8a660890
      Kent Overstreet 提交于
      ki_key wasn't actually used for anything previously - it was always 0.
      Drop it to trim struct kiocb a bit.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a660890
    • K
      aio: kill batch allocation · a1c8eae7
      Kent Overstreet 提交于
      Previously, allocating a kiocb required touching quite a few global
      (well, per kioctx) cachelines...  so batching up allocation to amortize
      those was worthwhile.  But we've gotten rid of some of those, and in
      another couple of patches kiocb allocation won't require writing to any
      shared cachelines, so that means we can just rip this code out.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1c8eae7
    • K
      aio: use cancellation list lazily · 0460fef2
      Kent Overstreet 提交于
      Cancelling kiocbs requires adding them to a per kioctx linked list,
      which is one of the few things we need to take the kioctx lock for in
      the fast path.  But most kiocbs can't be cancelled - so if we just do
      this lazily, we can avoid quite a bit of locking overhead.
      
      While we're at it, instead of using a flag bit switch to using ki_cancel
      itself to indicate that a kiocb has been cancelled/completed.  This lets
      us get rid of ki_flags entirely.
      
      [akpm@linux-foundation.org: remove buggy BUG()]
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0460fef2
    • K
      wait: add wait_event_hrtimeout() · 774a08b3
      Kent Overstreet 提交于
      Analagous to wait_event_timeout() and friends, this adds
      wait_event_hrtimeout() and wait_event_interruptible_hrtimeout().
      
      Note that unlike the versions that use regular timers, these don't
      return the amount of time remaining when they return - instead, they
      return 0 or -ETIME if they timed out.  because I was uncomfortable with
      the semantics of doing it the other way (that I could get it right,
      anyways).
      
      If the timer expires, there's no real guarantee that expire_time -
      current_time would be <= 0 - due to timer slack certainly, and I'm not
      sure I want to know the implications of the different clock bases in
      hrtimers.
      
      If the timer does expire and the code calculates that the time remaining
      is nonnegative, that could be even worse if the calling code then reuses
      that timeout.  Probably safer to just return 0 then, but I could imagine
      weird bugs or at least unintended behaviour arising from that too.
      
      I came to the conclusion that if other users end up actually needing the
      amount of time remaining, the sanest thing to do would be to create a
      version that uses absolute timeouts instead of relative.
      
      [akpm@linux-foundation.org: fix description of `timeout' arg]
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      774a08b3
    • K
      aio: make aio_put_req() lockless · 11599eba
      Kent Overstreet 提交于
      Freeing a kiocb needed to touch the kioctx for three things:
      
       * Pull it off the reqs_active list
       * Decrementing reqs_active
       * Issuing a wakeup, if the kioctx was in the process of being freed.
      
      This patch moves these to aio_complete(), for a couple reasons:
      
       * aio_complete() already has to issue the wakeup, so if we drop the
         kioctx refcount before aio_complete does its wakeup we don't have to
         do it twice.
       * aio_complete currently has to take the kioctx lock, so it makes sense
         for it to pull the kiocb off the reqs_active list too.
       * A later patch is going to change reqs_active to include unreaped
         completions - this will mean allocating a kiocb doesn't have to look
         at the ringbuffer. So taking the decrement of reqs_active out of
         kiocb_free() is useful prep work for that patch.
      
      This doesn't really affect cancellation, since existing (usb) code that
      implements a cancel function still calls aio_complete() - we just have
      to make sure that aio_complete does the necessary teardown for cancelled
      kiocbs.
      
      It does affect code paths where we free kiocbs that were never
      submitted; they need to decrement reqs_active and pull the kiocb off the
      reqs_active list.  This occurs in two places: kiocb_batch_free(), which
      is going away in a later patch, and the error path in io_submit_one.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11599eba
    • K
      aio: move private stuff out of aio.h · 4e179bca
      Kent Overstreet 提交于
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e179bca
    • K
      aio: kill return value of aio_complete() · 2d68449e
      Kent Overstreet 提交于
      Nothing used the return value, and it probably wasn't possible to use it
      safely for the locked versions (aio_complete(), aio_put_req()).  Just
      kill it.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Acked-by: NZach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d68449e
    • Z
      aio: remove retry-based AIO · 41003a7b
      Zach Brown 提交于
      This removes the retry-based AIO infrastructure now that nothing in tree
      is using it.
      
      We want to remove retry-based AIO because it is fundemantally unsafe.
      It retries IO submission from a kernel thread that has only assumed the
      mm of the submitting task.  All other task_struct references in the IO
      submission path will see the kernel thread, not the submitting task.
      This design flaw means that nothing of any meaningful complexity can use
      retry-based AIO.
      
      This removes all the code and data associated with the retry machinery.
      The most significant benefit of this is the removal of the locking
      around the unused run list in the submission path.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Signed-off-by: NZach Brown <zab@redhat.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41003a7b
    • Z
      aio: remove dead code from aio.h · 4b49bb8a
      Zach Brown 提交于
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b49bb8a
    • A
      remove unused random32() and srandom32() · 22ea9c07
      Akinobu Mita 提交于
      After finishing a naming transition, remove unused backward
      compatibility wrapper macros
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22ea9c07
    • N
      hugetlbfs: fix mmap failure in unaligned size request · af73e4d9
      Naoya Horiguchi 提交于
      The current kernel returns -EINVAL unless a given mmap length is
      "almost" hugepage aligned.  This is because in sys_mmap_pgoff() the
      given length is passed to vm_mmap_pgoff() as it is without being aligned
      with hugepage boundary.
      
      This is a regression introduced in commit 40716e29 ("hugetlbfs: fix
      alignment of huge page requests"), where alignment code is pushed into
      hugetlb_file_setup() and the variable len in caller side is not changed.
      
      To fix this, this patch partially reverts that commit, and adds
      alignment code in caller side.  And it also introduces hstate_sizelog()
      in order to get proper hstate to specified hugepage size.
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881
      
      [akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: <iceman_dvd@yahoo.com>
      Cc: Steven Truelove <steven.truelove@utoronto.ca>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af73e4d9
    • A
      include/linux/mm.h: complete the mm_walk definition · 0f157a5b
      Andrew Morton 提交于
      That nameless-function-arguments thing drives me batty.  Fix.
      
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f157a5b
  2. 06 5月, 2013 2 次提交
  3. 05 5月, 2013 2 次提交
  4. 04 5月, 2013 1 次提交
    • F
      sched: Keep at least 1 tick per second for active dynticks tasks · 265f22a9
      Frederic Weisbecker 提交于
      The scheduler doesn't yet fully support environments
      with a single task running without a periodic tick.
      
      In order to ensure we still maintain the duties of scheduler_tick(),
      keep at least 1 tick per second.
      
      This makes sure that we keep the progression of various scheduler
      accounting and background maintainance even with a very low granularity.
      Examples include cpu load, sched average, CFS entity vruntime,
      avenrun and events such as load balancing, amongst other details
      handled in sched_class::task_tick().
      
      This limitation will be removed in the future once we get
      these individual items to work in full dynticks CPUs.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      265f22a9
  5. 03 5月, 2013 1 次提交
  6. 02 5月, 2013 21 次提交
    • D
      net: Restore NETIF_F_* bit ordering. · 4ada8db3
      David Miller 提交于
      Commit 8ad227ff ("net: vlan: add 802.1ad support") added some new
      NETIF_F_* features bits, but it added them in the middle of existing
      values.
      
      Userland depends upon the flag bits via the per-netdevice 'flags' sysfs
      file.
      
      So restore the previous ordering by adding the new flags at the end.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ada8db3
    • P
      KVM: PPC: Book3S: Add API for in-kernel XICS emulation · 5975a2e0
      Paul Mackerras 提交于
      This adds the API for userspace to instantiate an XICS device in a VM
      and connect VCPUs to it.  The API consists of a new device type for
      the KVM_CREATE_DEVICE ioctl, a new capability KVM_CAP_IRQ_XICS, which
      functions similarly to KVM_CAP_IRQ_MPIC, and the KVM_IRQ_LINE ioctl,
      which is used to assert and deassert interrupt inputs of the XICS.
      
      The XICS device has one attribute group, KVM_DEV_XICS_GRP_SOURCES.
      Each attribute within this group corresponds to the state of one
      interrupt source.  The attribute number is the same as the interrupt
      source number.
      
      This does not support irq routing or irqfd yet.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      5975a2e0
    • A
      libceph: create source file "net/ceph/snapshot.c" · 4f0dcb10
      Alex Elder 提交于
      This creates a new source file "net/ceph/snapshot.c" to contain
      utility routines related to ceph snapshot contexts.  The main
      motivation was to define ceph_create_snap_context() as a common way
      to create these structures, but I've moved the definitions of
      ceph_get_snap_context() and ceph_put_snap_context() there too.
      (The benefit of inlining those is very small, and I'd rather
      keep this collection of functions together.)
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      4f0dcb10
    • A
      libceph: validate timespec conversions · c3f56102
      Alex Elder 提交于
      A ceph timespec contains 32-bit unsigned values for its seconds and
      nanoseconds components.  For a standard timespec, both fields are
      signed, and the seconds field is almost surely 64 bits.
      
      Add some explicit casts so the fact that this conversion is taking
      place is obvious.  Also trip a bug if we ever try to put out of
      range (negative or too big) values into a ceph timespec.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      c3f56102
    • A
      libceph: add signed type limits · b587398a
      Alex Elder 提交于
      Flesh out the limits defined in <linux/ceph/decode.h> to include the
      maximum and minimum values for signed type S8, S16, S32, and S64.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      b587398a
    • A
      libceph: support pages for class request data · 6c57b554
      Alex Elder 提交于
      Add the ability to provide an array of pages as outbound request
      data for object class method calls.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      6c57b554
    • A
      libceph: support raw data requests · 49719778
      Alex Elder 提交于
      Allow osd request ops that aren't otherwise structured (not class,
      extent, or watch ops) to specify "raw" data to be used to hold
      incoming data for the op.  Make use of this capability for the osd
      STAT op.
      
      Prefix the name of the private function osd_req_op_init() with "_",
      and expose a new function by that (earlier) name whose purpose is to
      initialize osd ops with (only) implied data.
      
      For now we'll just support the use of a page array for an osd op
      with incoming raw data.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      49719778
    • A
      libceph: kill off osd data write_request parameters · 406e2c9f
      Alex Elder 提交于
      In the incremental move toward supporting distinct data items in an
      osd request some of the functions had "write_request" parameters to
      indicate, basically, whether the data belonged to in_data or the
      out_data.  Now that we maintain the data fields in the op structure
      there is no need to indicate the direction, so get rid of the
      "write_request" parameters.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      406e2c9f
    • A
      libceph: change how "safe" callback is used · 26be8808
      Alex Elder 提交于
      An osd request currently has two callbacks.  They inform the
      initiator of the request when we've received confirmation for the
      target osd that a request was received, and when the osd indicates
      all changes described by the request are durable.
      
      The only time the second callback is used is in the ceph file system
      for a synchronous write.  There's a race that makes some handling of
      this case unsafe.  This patch addresses this problem.  The error
      handling for this callback is also kind of gross, and this patch
      changes that as well.
      
      In ceph_sync_write(), if a safe callback is requested we want to add
      the request on the ceph inode's unsafe items list.  Because items on
      this list must have their tid set (by ceph_osd_start_request()), the
      request added *after* the call to that function returns.  The
      problem with this is that there's a race between starting the
      request and adding it to the unsafe items list; the request may
      already be complete before ceph_sync_write() even begins to put it
      on the list.
      
      To address this, we change the way the "safe" callback is used.
      Rather than just calling it when the request is "safe", we use it to
      notify the initiator the bounds (start and end) of the period during
      which the request is *unsafe*.  So the initiator gets notified just
      before the request gets sent to the osd (when it is "unsafe"), and
      again when it's known the results are durable (it's no longer
      unsafe).  The first call will get made in __send_request(), just
      before the request message gets sent to the messenger for the first
      time.  That function is only called by __send_queued(), which is
      always called with the osd client's request mutex held.
      
      We then have this callback function insert the request on the ceph
      inode's unsafe list when we're told the request is unsafe.  This
      will avoid the race because this call will be made under protection
      of the osd client's request mutex.  It also nicely groups the setup
      and cleanup of the state associated with managing unsafe requests.
      
      The name of the "safe" callback field is changed to "unsafe" to
      better reflect its new purpose.  It has a Boolean "unsafe" parameter
      to indicate whether the request is becoming unsafe or is now safe.
      Because the "msg" parameter wasn't used, we drop that.
      
      This resolves the original problem reportedin:
          http://tracker.ceph.com/issues/4706Reported-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      26be8808
    • A
      libceph: make method call data be a separate data item · 04017e29
      Alex Elder 提交于
      Right now the data for a method call is specified via a pointer and
      length, and it's copied--along with the class and method name--into
      a pagelist data item to be sent to the osd.  Instead, encode the
      data in a data item separate from the class and method names.
      
      This will allow large amounts of data to be supplied to methods
      without copying.  Only rbd uses the class functionality right now,
      and when it really needs this it will probably need to use a page
      array rather than a page list.  But this simple implementation
      demonstrates the functionality on the osd client, and that's enough
      for now.
      
      This resolves:
          http://tracker.ceph.com/issues/4104Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      04017e29
    • A
      libceph: add, don't set data for a message · 90af3602
      Alex Elder 提交于
      Change the names of the functions that put data on a pagelist to
      reflect that we're adding to whatever's already there rather than
      just setting it to the one thing.  Currently only one data item is
      ever added to a message, but that's about to change.
      
      This resolves:
          http://tracker.ceph.com/issues/2770Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      90af3602
    • A
      libceph: implement multiple data items in a message · ca8b3a69
      Alex Elder 提交于
      This patch adds support to the messenger for more than one data item
      in its data list.
      
      A message data cursor has two more fields to support this:
          - a count of the number of bytes left to be consumed across
            all data items in the list, "total_resid"
          - a pointer to the head of the list (for validation only)
      
      The cursor initialization routine has been split into two parts: the
      outer one, which initializes the cursor for traversing the entire
      list of data items; and the inner one, which initializes the cursor
      to start processing a single data item.
      
      When a message cursor is first initialized, the outer initialization
      routine sets total_resid to the length provided.  The data pointer
      is initialized to the first data item on the list.  From there, the
      inner initialization routine finishes by setting up to process the
      data item the cursor points to.
      
      Advancing the cursor consumes bytes in total_resid.  If the resid
      field reaches zero, it means the current data item is fully
      consumed.  If total_resid indicates there is more data, the cursor
      is advanced to point to the next data item, and then the inner
      initialization routine prepares for using that.  (A check is made at
      this point to make sure we don't wrap around the front of the list.)
      
      The type-specific init routines are modified so they can be given a
      length that's larger than what the data item can support.  The resid
      field is initialized to the smaller of the provided length and the
      length of the entire data item.
      
      When total_resid reaches zero, we're done.
      
      This resolves:
          http://tracker.ceph.com/issues/3761Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      ca8b3a69
    • A
      libceph: replace message data pointer with list · 5240d9f9
      Alex Elder 提交于
      In place of the message data pointer, use a list head which links
      through message data items.  For now we only support a single entry
      on that list.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      5240d9f9
    • A
      libceph: have cursor point to data · 8ae4f4f5
      Alex Elder 提交于
      Rather than having a ceph message data item point to the cursor it's
      associated with, have the cursor point to a data item.  This will
      allow a message cursor to be used for more than one data item.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      8ae4f4f5
    • A
      libceph: move cursor into message · 36153ec9
      Alex Elder 提交于
      A message will only be processing a single data item at a time, so
      there's no need for each data item to have its own cursor.
      
      Move the cursor embedded in the message data structure into the
      message itself.  To minimize the impact, keep the data->cursor
      field, but make it be a pointer to the cursor in the message.
      
      Move the definition of ceph_msg_data above ceph_msg_data_cursor so
      the cursor can point to the data without a forward definition rather
      than vice-versa.
      
      This and the upcoming patches are part of:
          http://tracker.ceph.com/issues/3761Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      36153ec9
    • A
      libceph: record bio length · c851c495
      Alex Elder 提交于
      The bio is the only data item type that doesn't record its full
      length.  Fix that.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      c851c495
    • A
      libceph: fix possible CONFIG_BLOCK build problem · ea96571f
      Alex Elder 提交于
      This patch:
          15a0d7b libceph: record message data length
      did not enclose some bio-specific code inside CONFIG_BLOCK as
      it should have.  Fix that.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      ea96571f
    • A
      libceph: kill off osd request r_data_in and r_data_out · 5476492f
      Alex Elder 提交于
      Finally!  Convert the osd op data pointers into real structures, and
      make the switch over to using them instead of having all ops share
      the in and/or out data structures in the osd request.
      
      Set up a new function to traverse the set of ops and release any
      data associated with them (pages).
      
      This and the patches leading up to it resolve:
          http://tracker.ceph.com/issues/4657Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      5476492f
    • A
      libceph: set the data pointers when encoding ops · ec9123c5
      Alex Elder 提交于
      Still using the osd request r_data_in and r_data_out pointer, but
      we're basically only referring to it via the data pointers in the
      osd ops.  And we're transferring that information to the request
      or reply message only when the op indicates it's needed, in
      osd_req_encode_op().
      
      To avoid a forward reference, ceph_osdc_msg_data_set() was moved up
      in the file.
      
      Don't bother calling ceph_osd_data_init(), in ceph_osd_alloc(),
      because the ops array will already be zeroed anyway.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      ec9123c5
    • A
      libceph: combine initializing and setting osd data · a4ce40a9
      Alex Elder 提交于
      This ends up being a rather large patch but what it's doing is
      somewhat straightforward.
      
      Basically, this is replacing two calls with one.  The first of the
      two calls is initializing a struct ceph_osd_data with data (either a
      page array, a page list, or a bio list); the second is setting an
      osd request op so it associates that data with one of the op's
      parameters.  In place of those two will be a single function that
      initializes the op directly.
      
      That means we sort of fan out a set of the needed functions:
          - extent ops with pages data
          - extent ops with pagelist data
          - extent ops with bio list data
      and
          - class ops with page data for receiving a response
      
      We also have define another one, but it's only used internally:
          - class ops with pagelist data for request parameters
      
      Note that we *still* haven't gotten rid of the osd request's
      r_data_in and r_data_out fields.  All the osd ops refer to them for
      their data.  For now, these data fields are pointers assigned to the
      appropriate r_data_* field when these new functions are called.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      a4ce40a9
    • A
      libceph: format class info at init time · 5f562df5
      Alex Elder 提交于
      An object class method is formatted using a pagelist which contains
      the class name, the method name, and the data concatenated into an
      osd request's outbound data.
      
      Currently when a class op is initialized in osd_req_op_cls_init(),
      the lengths of and pointers to these three items are recorded.
      Later, when the op is getting formatted into the request message, a
      new pagelist is created and that is when these items get copied into
      the pagelist.
      
      This patch makes it so the pagelist to hold these items is created
      when the op is initialized instead.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      5f562df5