1. 03 7月, 2013 1 次提交
    • J
      vfs: export lseek_execute() to modules · 46a1c2c7
      Jie Liu 提交于
      For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
      SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
      matter in lseek_execute() to update the current file offset
      to the desired offset if it is valid, ceph also does the
      simliar things at ceph_llseek().
      
      To reduce the duplications, this patch make lseek_execute()
      public accessible so that we can call it directly from the
      underlying file systems.
      
      Thanks Dave Chinner for this suggestion.
      
      [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]
      
      v2->v1:
      - Add kernel-doc comments for lseek_execute()
      - Call lseek_execute() in ceph->llseek()
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Ted Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Sage Weil <sage@inktank.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      46a1c2c7
  2. 29 6月, 2013 2 次提交
    • J
      locks: protect most of the file_lock handling with i_lock · 1c8c601a
      Jeff Layton 提交于
      Having a global lock that protects all of this code is a clear
      scalability problem. Instead of doing that, move most of the code to be
      protected by the i_lock instead. The exceptions are the global lists
      that the ->fl_link sits on, and the ->fl_block list.
      
      ->fl_link is what connects these structures to the
      global lists, so we must ensure that we hold those locks when iterating
      over or updating these lists.
      
      Furthermore, sound deadlock detection requires that we hold the
      blocked_list state steady while checking for loops. We also must ensure
      that the search and update to the list are atomic.
      
      For the checking and insertion side of the blocked_list, push the
      acquisition of the global lock into __posix_lock_file and ensure that
      checking and update of the  blocked_list is done without dropping the
      lock in between.
      
      On the removal side, when waking up blocked lock waiters, take the
      global lock before walking the blocked list and dequeue the waiters from
      the global list prior to removal from the fl_block list.
      
      With this, deadlock detection should be race free while we minimize
      excessive file_lock_lock thrashing.
      
      Finally, in order to avoid a lock inversion problem when handling
      /proc/locks output we must ensure that manipulations of the fl_block
      list are also protected by the file_lock_lock.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1c8c601a
    • A
      [readdir] convert ceph · 77acfa29
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      77acfa29
  3. 22 5月, 2013 2 次提交
    • L
      ceph: use ->invalidatepage() length argument · 569d39fc
      Lukas Czerner 提交于
      ->invalidatepage() aop now accepts range to invalidate so we can make
      use of it in ceph_invalidatepage().
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Acked-by: NSage Weil <sage@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      569d39fc
    • L
      mm: change invalidatepage prototype to accept length · d47992f8
      Lukas Czerner 提交于
      Currently there is no way to truncate partial page where the end
      truncate point is not at the end of the page. This is because it was not
      needed and the functionality was enough for file system truncate
      operation to work properly. However more file systems now support punch
      hole feature and it can benefit from mm supporting truncating page just
      up to the certain point.
      
      Specifically, with this functionality truncate_inode_pages_range() can
      be changed so it supports truncating partial page at the end of the
      range (currently it will BUG_ON() if 'end' is not at the end of the
      page).
      
      This commit changes the invalidatepage() address space operation
      prototype to accept range to be invalidated and update all the instances
      for it.
      
      We also change the block_invalidatepage() in the same way and actually
      make a use of the new length argument implementing range invalidation.
      
      Actual file system implementations will follow except the file systems
      where the changes are really simple and should not change the behaviour
      in any way .Implementation for truncate_page_range() which will be able
      to accept page unaligned ranges will follow as well.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      d47992f8
  4. 18 5月, 2013 2 次提交
    • J
      ceph: ceph_pagelist_append might sleep while atomic · 39be95e9
      Jim Schutt 提交于
      Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc()
      while holding a lock, but it's spoiled because ceph_pagelist_addpage()
      always calls kmap(), which might sleep.  Here's the result:
      
      [13439.295457] ceph: mds0 reconnect start
      [13439.300572] BUG: sleeping function called from invalid context at include/linux/highmem.h:58
      [13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1
          . . .
      [13439.376225] Call Trace:
      [13439.378757]  [<ffffffff81076f4c>] __might_sleep+0xfc/0x110
      [13439.384353]  [<ffffffffa03f4ce0>] ceph_pagelist_append+0x120/0x1b0 [libceph]
      [13439.391491]  [<ffffffffa0448fe9>] ceph_encode_locks+0x89/0x190 [ceph]
      [13439.398035]  [<ffffffff814ee849>] ? _raw_spin_lock+0x49/0x50
      [13439.403775]  [<ffffffff811cadf5>] ? lock_flocks+0x15/0x20
      [13439.409277]  [<ffffffffa045e2af>] encode_caps_cb+0x41f/0x4a0 [ceph]
      [13439.415622]  [<ffffffff81196748>] ? igrab+0x28/0x70
      [13439.420610]  [<ffffffffa045e9f8>] ? iterate_session_caps+0xe8/0x250 [ceph]
      [13439.427584]  [<ffffffffa045ea25>] iterate_session_caps+0x115/0x250 [ceph]
      [13439.434499]  [<ffffffffa045de90>] ? set_request_path_attr+0x2d0/0x2d0 [ceph]
      [13439.441646]  [<ffffffffa0462888>] send_mds_reconnect+0x238/0x450 [ceph]
      [13439.448363]  [<ffffffffa0464542>] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph]
      [13439.455250]  [<ffffffffa0462e42>] check_new_map+0x352/0x500 [ceph]
      [13439.461534]  [<ffffffffa04631ad>] ceph_mdsc_handle_map+0x1bd/0x260 [ceph]
      [13439.468432]  [<ffffffff814ebc7e>] ? mutex_unlock+0xe/0x10
      [13439.473934]  [<ffffffffa043c612>] extra_mon_dispatch+0x22/0x30 [ceph]
      [13439.480464]  [<ffffffffa03f6c2c>] dispatch+0xbc/0x110 [libceph]
      [13439.486492]  [<ffffffffa03eec3d>] process_message+0x1ad/0x1d0 [libceph]
      [13439.493190]  [<ffffffffa03f1498>] ? read_partial_message+0x3e8/0x520 [libceph]
          . . .
      [13439.587132] ceph: mds0 reconnect success
      [13490.720032] ceph: mds0 caps stale
      [13501.235257] ceph: mds0 recovery completed
      [13501.300419] ceph: mds0 caps renewed
      
      Fix it up by encoding locks into a buffer first, and when the number
      of encoded locks is stable, copy that into a ceph_pagelist.
      
      [elder@inktank.com: abbreviated the stack info a bit.]
      
      Cc: stable@vger.kernel.org # 3.4+
      Signed-off-by: NJim Schutt <jaschut@sandia.gov>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      39be95e9
    • J
      ceph: add cpu_to_le32() calls when encoding a reconnect capability · c420276a
      Jim Schutt 提交于
      In his review, Alex Elder mentioned that he hadn't checked that
      num_fcntl_locks and num_flock_locks were properly decoded on the
      server side, from a le32 over-the-wire type to a cpu type.
      I checked, and AFAICS it is done; those interested can consult
          Locker::_do_cap_update()
      in src/mds/Locker.cc and src/include/encoding.h in the Ceph server
      code (git://github.com/ceph/ceph).
      
      I also checked the server side for flock_len decoding, and I believe
      that also happens correctly, by virtue of having been declared
      __le32 in struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h.
      
      Cc: stable@vger.kernel.org # 3.4+
      Signed-off-by: NJim Schutt <jaschut@sandia.gov>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      c420276a
  5. 08 5月, 2013 1 次提交
  6. 02 5月, 2013 32 次提交
    • A
      ceph: use ceph_create_snap_context() · 812164f8
      Alex Elder 提交于
      Now that we have a library routine to create snap contexts, use it.
      
      This is part of:
          http://tracker.ceph.com/issues/4857Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      812164f8
    • A
      libceph: kill off osd data write_request parameters · 406e2c9f
      Alex Elder 提交于
      In the incremental move toward supporting distinct data items in an
      osd request some of the functions had "write_request" parameters to
      indicate, basically, whether the data belonged to in_data or the
      out_data.  Now that we maintain the data fields in the op structure
      there is no need to indicate the direction, so get rid of the
      "write_request" parameters.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      406e2c9f
    • R
      ceph: fix printk format warnings in file.c · ac7f29bf
      Randy Dunlap 提交于
      Fix printk format warnings by using %zd for 'ssize_t' variables:
      
      fs/ceph/file.c:751:2: warning: format '%ld' expects argument of type 'long int', but argument 11 has type 'ssize_t' [-Wformat]
      fs/ceph/file.c:762:2: warning: format '%ld' expects argument of type 'long int', but argument 11 has type 'ssize_t' [-Wformat]
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc:	ceph-devel@vger.kernel.org
      Signed-off-by: NSage Weil <sage@inktank.com>
      ac7f29bf
    • Y
      ceph: fix race between writepages and truncate · 1ac0fc8a
      Yan, Zheng 提交于
      ceph_writepages_start() reads inode->i_size in two places. It can get
      different values between successive read, because truncate can change
      inode->i_size at any time. The race can lead to mismatch between data
      length of osd request and pages marked as writeback. When osd request
      finishes, it clear writeback page according to its data length. So
      some pages can be left in writeback state forever. The fix is only
      read inode->i_size once, save its value to a local variable and use
      the local variable when i_size is needed.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      1ac0fc8a
    • Y
      ceph: apply write checks in ceph_aio_write · 03d254ed
      Yan, Zheng 提交于
      copy write checks in __generic_file_aio_write to ceph_aio_write.
      To make these checks cover sync write path.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      03d254ed
    • Y
      ceph: take i_mutex before getting Fw cap · 37505d57
      Yan, Zheng 提交于
      There is deadlock as illustrated bellow. The fix is taking i_mutex
      before getting Fw cap reference.
      
            write                    truncate                 MDS
      ---------------------     --------------------      --------------
      get Fw cap
                                lock i_mutex
      lock i_mutex (blocked)
                                request setattr.size  ->
                                                      <-   revoke Fw cap
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      37505d57
    • A
      libceph: change how "safe" callback is used · 26be8808
      Alex Elder 提交于
      An osd request currently has two callbacks.  They inform the
      initiator of the request when we've received confirmation for the
      target osd that a request was received, and when the osd indicates
      all changes described by the request are durable.
      
      The only time the second callback is used is in the ceph file system
      for a synchronous write.  There's a race that makes some handling of
      this case unsafe.  This patch addresses this problem.  The error
      handling for this callback is also kind of gross, and this patch
      changes that as well.
      
      In ceph_sync_write(), if a safe callback is requested we want to add
      the request on the ceph inode's unsafe items list.  Because items on
      this list must have their tid set (by ceph_osd_start_request()), the
      request added *after* the call to that function returns.  The
      problem with this is that there's a race between starting the
      request and adding it to the unsafe items list; the request may
      already be complete before ceph_sync_write() even begins to put it
      on the list.
      
      To address this, we change the way the "safe" callback is used.
      Rather than just calling it when the request is "safe", we use it to
      notify the initiator the bounds (start and end) of the period during
      which the request is *unsafe*.  So the initiator gets notified just
      before the request gets sent to the osd (when it is "unsafe"), and
      again when it's known the results are durable (it's no longer
      unsafe).  The first call will get made in __send_request(), just
      before the request message gets sent to the messenger for the first
      time.  That function is only called by __send_queued(), which is
      always called with the osd client's request mutex held.
      
      We then have this callback function insert the request on the ceph
      inode's unsafe list when we're told the request is unsafe.  This
      will avoid the race because this call will be made under protection
      of the osd client's request mutex.  It also nicely groups the setup
      and cleanup of the state associated with managing unsafe requests.
      
      The name of the "safe" callback field is changed to "unsafe" to
      better reflect its new purpose.  It has a Boolean "unsafe" parameter
      to indicate whether the request is becoming unsafe or is now safe.
      Because the "msg" parameter wasn't used, we drop that.
      
      This resolves the original problem reportedin:
          http://tracker.ceph.com/issues/4706Reported-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      26be8808
    • A
      ceph: let osd client clean up for interrupted request · 7d7d51ce
      Alex Elder 提交于
      In ceph_sync_write(), if a safe callback is supplied with a request,
      and an error is returned by ceph_osdc_wait_request(), a block of
      code is executed to remove the request from the unsafe writes list
      and drop references to capabilities acquired just prior to a call to
      ceph_osdc_wait_request().
      
      The only function used for this callback is sync_write_commit(),
      and it does *exactly* what that block of error handling code does.
      
      Now in ceph_osdc_wait_request(), if an error occurs (due to an
      interupt during a wait_for_completion_interruptible() call),
      complete_request() gets called, and that calls the request's
      safe_callback method if it's defined.
      
      So this means that this cleanup activity gets called twice in this
      case, which is erroneous (and in fact leads to a crash).
      
      Fix this by just letting the osd client handle the cleanup in
      the event of an interrupt.
      
      This resolves one problem mentioned in:
          http://tracker.ceph.com/issues/4706Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NYan, Zheng <zheng.z.yan@intel.com>
      7d7d51ce
    • Y
      ceph: fix symlink inode operations · 0b932672
      Yan, Zheng 提交于
      add getattr/setattr and xattrs related methods.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NGreg Farnum <greg@inktank.com>
      0b932672
    • S
      ceph: Use pseudo-random numbers to choose mds · a84cd293
      Sam Lang 提交于
      We don't need to use up entropy to choose an mds,
      so use prandom_u32() to get a pseudo-random number.
      
      Also, we don't need to choose a random mds if only
      one mds is available, so add special casing for the
      common case.
      
      Fixes http://tracker.ceph.com/issues/3579Signed-off-by: NSam Lang <sam.lang@inktank.com>
      Reviewed-by: NGreg Farnum <greg@inktank.com>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      a84cd293
    • A
      libceph: add, don't set data for a message · 90af3602
      Alex Elder 提交于
      Change the names of the functions that put data on a pagelist to
      reflect that we're adding to whatever's already there rather than
      just setting it to the one thing.  Currently only one data item is
      ever added to a message, but that's about to change.
      
      This resolves:
          http://tracker.ceph.com/issues/2770Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      90af3602
    • A
      libceph: combine initializing and setting osd data · a4ce40a9
      Alex Elder 提交于
      This ends up being a rather large patch but what it's doing is
      somewhat straightforward.
      
      Basically, this is replacing two calls with one.  The first of the
      two calls is initializing a struct ceph_osd_data with data (either a
      page array, a page list, or a bio list); the second is setting an
      osd request op so it associates that data with one of the op's
      parameters.  In place of those two will be a single function that
      initializes the op directly.
      
      That means we sort of fan out a set of the needed functions:
          - extent ops with pages data
          - extent ops with pagelist data
          - extent ops with bio list data
      and
          - class ops with page data for receiving a response
      
      We also have define another one, but it's only used internally:
          - class ops with pagelist data for request parameters
      
      Note that we *still* haven't gotten rid of the osd request's
      r_data_in and r_data_out fields.  All the osd ops refer to them for
      their data.  For now, these data fields are pointers assigned to the
      appropriate r_data_* field when these new functions are called.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      a4ce40a9
    • A
      libceph: specify osd op by index in request · c99d2d4a
      Alex Elder 提交于
      An osd request now holds all of its source op structures, and every
      place that initializes one of these is in fact initializing one
      of the entries in the the osd request's array.
      
      So rather than supplying the address of the op to initialize, have
      caller specify the osd request and an indication of which op it
      would like to initialize.  This better hides the details the
      op structure (and faciltates moving the data pointers they use).
      
      Since osd_req_op_init() is a common routine, and it's not used
      outside the osd client code, give it static scope.  Also make
      it return the address of the specified op (so all the other
      init routines don't have to repeat that code).
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      c99d2d4a
    • A
      libceph: add data pointers in osd op structures · 8c042b0d
      Alex Elder 提交于
      An extent type osd operation currently implies that there will
      be corresponding data supplied in the data portion of the request
      (for write) or response (for read) message.  Similarly, an osd class
      method operation implies a data item will be supplied to receive
      the response data from the operation.
      
      Add a ceph_osd_data pointer to each of those structures, and assign
      it to point to eithre the incoming or the outgoing data structure in
      the osd message.  The data is not always available when an op is
      initially set up, so add two new functions to allow setting them
      after the op has been initialized.
      
      Begin to make use of the data item pointer available in the osd
      operation rather than the request data in or out structure in
      places where it's convenient.  Add some assertions to verify
      pointers are always set the way they're expected to be.
      
      This is a sort of stepping stone toward really moving the data
      into the osd request ops, to allow for some validation before
      making that jump.
      
      This is the first in a series of patches that resolve:
          http://tracker.ceph.com/issues/4657Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      8c042b0d
    • A
      libceph: keep source rather than message osd op array · 79528734
      Alex Elder 提交于
      An osd request keeps a pointer to the osd operations (ops) array
      that it builds in its request message.
      
      In order to allow each op in the array to have its own distinct
      data, we will need to keep track of each op's data, and that
      information does not go over the wire.
      
      As long as we're tracking the data we might as well just track the
      entire (source) op definition for each of the ops.  And if we're
      doing that, we'll have no more need to keep a pointer to the
      wire-encoded version.
      
      This patch makes the array of source ops be kept with the osd
      request structure, and uses that instead of the version encoded in
      the message in places where that was previously used.  The array
      will be embedded in the request structure, and the maximum number of
      ops we ever actually use is currently 2.  So reduce CEPH_OSD_MAX_OP
      to 2 to reduce the size of the structure.
      
      The result of doing this sort of ripples back up, and as a result
      various function parameters and local variables become unnecessary.
      
      Make r_num_ops be unsigned, and move the definition of struct
      ceph_osd_req_op earlier to ensure it's defined where needed.
      
      It does not yet add per-op data, that's coming soon.
      
      This resolves:
          http://tracker.ceph.com/issues/4656Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      79528734
    • A
      libceph: a few more osd data cleanups · 87060c10
      Alex Elder 提交于
      These are very small changes that make use osd_data local pointers
      as shorthands for structures being operated on.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      87060c10
    • A
      libceph: define osd data initialization helpers · 43bfe5de
      Alex Elder 提交于
      Define and use functions that encapsulate the initializion of a
      ceph_osd_data structure.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      43bfe5de
    • A
      ceph: build osd request message later for writepages · e5975c7c
      Alex Elder 提交于
      Hold off building the osd request message in ceph_writepages_start()
      until just before it will be submitted to the osd client for
      execution.
      
      We'll still create the request and allocate the page pointer array
      after we learn we have at least one page to write.  A local variable
      will be used to keep track of the allocated array of pages.  Wait
      until just before submitting the request for assigning that page
      array pointer to the request message.
      
      Create ands use a new function osd_req_op_extent_update() whose
      purpose is to serve this one spot where the length value supplied
      when an osd request's op was initially formatted might need to get
      changed (reduced, never increased) before submitting the request.
      
      Previously, ceph_writepages_start() assigned the message header's
      data length because of this update.  That's no longer necessary,
      because ceph_osdc_build_request() will recalculate the right
      value to use based on the content of the ops in the request.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      e5975c7c
    • A
      libceph: hold off building osd request · 02ee07d3
      Alex Elder 提交于
      Defer building the osd request until just before submitting it in
      all callers except ceph_writepages_start().  (That caller will be
      handed in the next patch.)
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      02ee07d3
    • A
      ceph: kill ceph alloc_page_vec() · 88486957
      Alex Elder 提交于
      There is a helper function alloc_page_vec() that, despite its
      generic sounding name depends heavily on an osd request structure
      being populated with certain information.
      
      There is only one place this function is used, and it ends up
      being a bit simpler to just open code what it does, so get
      rid of the helper.
      
      The real motivation for this is deferring building the of the osd
      request message, and this is a step in that direction.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      88486957
    • A
      ceph: define ceph_writepages_osd_request() · 94fe8420
      Alex Elder 提交于
      Mostly for readability, define ceph_writepages_osd_request() and
      use it to allocate the osd request for ceph_writepages_start().
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      94fe8420
    • A
      libceph: don't build request in ceph_osdc_new_request() · acead002
      Alex Elder 提交于
      This patch moves the call to ceph_osdc_build_request() out of
      ceph_osdc_new_request() and into its caller.
      
      This is in order to defer formatting osd operation information into
      the request message until just before request is started.
      
      The only unusual (ab)user of ceph_osdc_build_request() is
      ceph_writepages_start(), where the final length of write request may
      change (downward) based on the current inode size or the oldest
      snapshot context with dirty data for the inode.
      
      The remaining callers don't change anything in the request after has
      been built.
      
      This means the ops array is now supplied by the caller.  It also
      means there is no need to pass the mtime to ceph_osdc_new_request()
      (it gets provided to ceph_osdc_build_request()).  And rather than
      passing a do_sync flag, have the number of ops in the ops array
      supplied imply adding a second STARTSYNC operation after the READ or
      WRITE requested.
      
      This and some of the patches that follow are related to having the
      messenger (only) be responsible for filling the content of the
      message header, as described here:
          http://tracker.ceph.com/issues/4589Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      acead002
    • A
      ceph: use page_offset() in ceph_writepages_start() · 25d71cb9
      Alex Elder 提交于
      There's one spot in ceph_writepages_start() that open-codes what
      page_offset() does safely.  Use the macro so we don't have to worry
      about wrapping.
      
      This resolves:
          http://tracker.ceph.com/issues/4648Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      25d71cb9
    • A
      ceph: set up page array mempool with correct size · 3bf53337
      Alex Elder 提交于
      In create_fs_client() a memory pool is set up be used for arrays of
      pages that might be needed in ceph_writepages_start() if memory is
      tight.  There are two problems with the way it's initialized:
          - The size provided is the number of pages we want in the
            array, but it should be the number of bytes required for
            that many page pointers.
          - The number of pages computed can end up being 0, while we
            will always need at least one page.
      
      This patch fixes both of these problems.
      
      This resolves the two simple problems defined in:
          http://tracker.ceph.com/issues/4603Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      3bf53337
    • S
      libceph: wrap auth ops in wrapper functions · 27859f97
      Sage Weil 提交于
      Use wrapper functions that check whether the auth op exists so that callers
      do not need a bunch of conditional checks.  Simplifies the external
      interface.
      Signed-off-by: NSage Weil <sage@inktank.com>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      27859f97
    • S
      libceph: add update_authorizer auth method · 0bed9b5c
      Sage Weil 提交于
      Currently the messenger calls out to a get_authorizer con op, which will
      create a new authorizer if it doesn't yet have one.  In the meantime, when
      we rotate our service keys, the authorizer doesn't get updated.  Eventually
      it will be rejected by the server on a new connection attempt and get
      invalidated, and we will then rebuild a new authorizer, but this is not
      ideal.
      
      Instead, if we do have an authorizer, call a new update_authorizer op that
      will verify that the current authorizer is using the latest secret.  If it
      is not, we will build a new one that does.  This avoids the transient
      failure.
      
      This fixes one of the sorry sequence of events for bug
      
      	http://tracker.ceph.com/issues/4282Signed-off-by: NSage Weil <sage@inktank.com>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      0bed9b5c
    • H
      ceph: fix buffer pointer advance in ceph_sync_write · 022f3e2e
      Henry C Chang 提交于
      We should advance the user data pointer by _len_ instead of _written_.
      _len_ is the data length written in each iteration while _written_ is the
      accumulated data length we have writtent out.
      Signed-off-by: NHenry C Chang <henry.cy.chang@gmail.com>
      Reviewed-by: NGreg Farnum <greg@inktank.com>
      Tested-by: NSage Weil <sage@inktank.com>
      022f3e2e
    • Y
      ceph: use i_release_count to indicate dir's completeness · 2f276c51
      Yan, Zheng 提交于
      Current ceph code tracks directory's completeness in two places.
      ceph_readdir() checks i_release_count to decide if it can set the
      I_COMPLETE flag in i_ceph_flags. All other places check the I_COMPLETE
      flag. This indirection introduces locking complexity.
      
      This patch adds a new variable i_complete_count to ceph_inode_info.
      Set i_release_count's value to it when marking a directory complete.
      By comparing the two variables, we know if a directory is complete
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      2f276c51
    • A
      ceph: only set message data pointers if non-empty · ebf18f47
      Alex Elder 提交于
      Change it so we only assign outgoing data information for messages
      if there is outgoing data to send.
      
      This then allows us to add a few more (currently commented-out)
      assertions.
      
      This is related to:
          http://tracker.ceph.com/issues/4284Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NGreg Farnum <greg@inktank.com>
      ebf18f47
    • A
      libceph: isolate other message data fields · 27fa8385
      Alex Elder 提交于
      Define ceph_msg_data_set_pagelist(), ceph_msg_data_set_bio(), and
      ceph_msg_data_set_trail() to clearly abstract the assignment of the
      remaining data-related fields in a ceph message structure.  Use the
      new functions in the osd client and mds client.
      
      This partially resolves:
          http://tracker.ceph.com/issues/4263Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      27fa8385
    • A
      libceph: set page info with byte length · f1baeb2b
      Alex Elder 提交于
      When setting page array information for message data, provide the
      byte length rather than the page count ceph_msg_data_set_pages().
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      f1baeb2b
    • A
      libceph: isolate message page field manipulation · 02afca6c
      Alex Elder 提交于
      Define a function ceph_msg_data_set_pages(), which more clearly
      abstracts the assignment page-related fields for data in a ceph
      message structure.  Use this new function in the osd client and mds
      client.
      
      Ideally, these fields would never be set more than once (with
      BUG_ON() calls to guarantee that).  At the moment though the osd
      client sets these every time it receives a message, and in the event
      of a communication problem this can happen more than once.  (This
      will be resolved shortly, but setting up these helpers first makes
      it all a bit easier to work with.)
      
      Rearrange the field order in a ceph_msg structure to group those
      that are used to define the possible data payloads.
      
      This partially resolves:
          http://tracker.ceph.com/issues/4263Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      02afca6c