1. 12 11月, 2017 1 次提交
  2. 02 11月, 2017 1 次提交
    • M
      WritePrepared Txn: Optimize for recoverable state · 17731a43
      Maysam Yabandeh 提交于
      Summary:
      GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit.
      This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush.
      Closes https://github.com/facebook/rocksdb/pull/3071
      
      Differential Revision: D6148023
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7
      17731a43
  3. 03 10月, 2017 1 次提交
    • M
      WritePrepared Txn: Rollback · d27258d3
      Maysam Yabandeh 提交于
      Summary:
      Implement the rollback of WritePrepared txns. For each modified value, it reads the value before the txn and write it back. This would cancel out the effect of transaction. It also remove the rolled back txn from prepared heap.
      Closes https://github.com/facebook/rocksdb/pull/2946
      
      Differential Revision: D5937575
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: a6d3c47f44db3729f44b287a80f97d08dc4e888d
      d27258d3
  4. 14 9月, 2017 1 次提交
    • M
      WritePrepared Txn: Lock-free CommitMap · 09713a64
      Maysam Yabandeh 提交于
      Summary:
      We had two proposals for lock-free commit maps. This patch implements the latter one that was simpler. We can later experiment with both proposals.
      
      In this impl each entry is an std::atomic of uint64_t, which are accessed via memory_order_acquire/release. In x86_64 arch this is compiled to simple reads and writes from memory.
      Closes https://github.com/facebook/rocksdb/pull/2861
      
      Differential Revision: D5800724
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 41abae9a4a5df050a8eb696c43de11c2770afdda
      09713a64
  5. 09 9月, 2017 1 次提交
  6. 17 8月, 2017 1 次提交
    • M
      Update WritePrepared with the pseudo code · eb642530
      Maysam Yabandeh 提交于
      Summary:
      Implement the main body of WritePrepared pseudo code. This includes PrepareInternal and CommitInternal, as well as AddCommitted which updates the commit map. It also provides a IsInSnapshot method that could be later called form the read path to decide if a version is in the read snapshot or it should other be skipped.
      
      This patch lacks unit tests and does not attempt to offer an efficient implementation. The idea is that to have the API specified so that we can work on related tasks in parallel.
      Closes https://github.com/facebook/rocksdb/pull/2713
      
      Differential Revision: D5640021
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: bfa7a05e8d8498811fab714ce4b9c21530514e1c
      eb642530
  7. 08 8月, 2017 1 次提交
    • M
      Refactor PessimisticTransaction · bdc056f8
      Maysam Yabandeh 提交于
      Summary:
      This patch splits Commit and Prepare into lock-related logic and db-write-related logic. It moves lock-related logic to PessimisticTransaction to be reused by all children classes and movies the existing impl of db-write-related to PrepareInternal, CommitSingleInternal, and CommitInternal in WriteCommittedTxnImpl.
      Closes https://github.com/facebook/rocksdb/pull/2691
      
      Differential Revision: D5569464
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: d1b8698e69801a4126c7bc211745d05c636f5325
      bdc056f8
  8. 06 8月, 2017 1 次提交
  9. 03 8月, 2017 1 次提交
    • M
      Refactor TransactionImpl · c3d5c4d3
      Maysam Yabandeh 提交于
      Summary:
      This patch refactors TransactionImpl by separating the logic for pessimistic concurrency control from the implementation of how to write the data to rocksdb. The existing implementation is named WriteCommittedTxnImpl as it writes committed data to the db. A template named WritePreparedTxnImpl is also added which will be later completed to provide a an alternative implementation.
      Closes https://github.com/facebook/rocksdb/pull/2676
      
      Differential Revision: D5549998
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 16298e86b43ca4849324c1f35c731913c6d17bec
      c3d5c4d3
  10. 29 7月, 2017 1 次提交
    • S
      Replace dynamic_cast<> · 21696ba5
      Siying Dong 提交于
      Summary:
      Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
      Some nontrivial changes:
      1. Add Comparator::GetRootComparator() to get around the internal comparator hack
      2. Add the two experiemental functions to DB
      3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
      4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
      Closes https://github.com/facebook/rocksdb/pull/2645
      
      Differential Revision: D5502723
      
      Pulled By: siying
      
      fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
      21696ba5
  11. 22 7月, 2017 2 次提交
  12. 16 7月, 2017 1 次提交
  13. 25 6月, 2017 1 次提交
    • M
      Optimize for serial commits in 2PC · 499ebb3a
      Maysam Yabandeh 提交于
      Summary:
      Throughput: 46k tps in our sysbench settings (filling the details later)
      
      The idea is to have the simplest change that gives us a reasonable boost
      in 2PC throughput.
      
      Major design changes:
      1. The WAL file internal buffer is not flushed after each write. Instead
      it is flushed before critical operations (WAL copy via fs) or when
      FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
      via mutex_.
      2. Use two sequence numbers: last seq, and last seq for write. Last seq
      is the last visible sequence number for reads. Last seq for write is the
      next sequence number that should be used to write to WAL/memtable. This
      allows to have a memtable write be in parallel to WAL writes.
      3. BatchGroup is not used for writes. This means that we can have
      parallel writers which changes a major assumption in the code base. To
      accommodate for that i) allow only 1 WriteImpl that intends to write to
      memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
      come via group commit phase which is serial anyway, ii) make all the
      parts in the code base that assumed to be the only writer (via
      EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
      protected via a stat_mutex_.
      
      Note: the first commit has the approach figured out but is not clean.
      Submitting the PR anyway to get the early feedback on the approach. If
      we are ok with the approach I will go ahead with this updates:
      0) Rebase with Yi's pipelining changes
      1) Currently batching is disabled by default to make sure that it will be
      consistent with all unit tests. Will make this optional via a config.
      2) A couple of unit tests are disabled. They need to be updated with the
      serial commit of 2PC taken into account.
      3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
      releasing mutex_ beforehand (the same way EnterUnbatched does). This
      needs to be cleaned up.
      Closes https://github.com/facebook/rocksdb/pull/2345
      
      Differential Revision: D5210732
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
      499ebb3a
  14. 28 4月, 2017 1 次提交
  15. 11 4月, 2017 2 次提交
    • M
      Fix shared lock upgrades · 9300ef54
      Manuel Ung 提交于
      Summary:
      Upgrading a shared lock was silently succeeding because the actual locking code was skipped. This is because if the keys are tracked, it is assumed that they are already locked and do not require locking. Fix this by recording in tracked keys whether the key was locked exclusively or not.
      
      Note that lock downgrades are impossible, which is the behaviour we expect.
      
      This fixes facebook/mysql-5.6#587.
      Closes https://github.com/facebook/rocksdb/pull/2122
      
      Differential Revision: D4861489
      
      Pulled By: IslamAbdelRahman
      
      fbshipit-source-id: 58c7ebe7af098bf01b9774b666d3e9867747d8fd
      9300ef54
    • M
      Limit maximum memory used in the WriteBatch representation · 1f8b119e
      Manuel Ung 提交于
      Summary:
      Extend TransactionOptions to include max_write_batch_size which determines the maximum size of the writebatch representation. If memory limit is exceeded, the operation will abort with subcode kMemoryLimit.
      Closes https://github.com/facebook/rocksdb/pull/2124
      
      Differential Revision: D4861842
      
      Pulled By: lth
      
      fbshipit-source-id: 46fd172ea67cc90bbba829bf0d70cfab2261c161
      1f8b119e
  16. 06 12月, 2016 1 次提交
    • M
      Implement non-exclusive locks · 2005c88a
      Manuel Ung 提交于
      Summary:
      This is an implementation of non-exclusive locks for pessimistic transactions. It is relatively simple and does not prevent starvation (ie. it's possible that request for exclusive access will never be granted if there are always threads holding shared access). It is done by changing `KeyLockInfo` to hold an set a transaction ids, instead of just one, and adding a flag specifying whether this lock is currently held with exclusive access or not.
      
      Some implementation notes:
      - Some lock diagnostic functions had to be updated to return a set of transaction ids for a given lock, eg. `GetWaitingTxn` and `GetLockStatusData`.
      - Deadlock detection is a bit more complicated since a transaction can now wait on multiple other transactions. A BFS is done in this case, and deadlock detection depth is now just a limit on the number of transactions we visit.
      - Expirable transactions do not work efficiently with shared locks at the moment, but that's okay for now.
      Closes https://github.com/facebook/rocksdb/pull/1573
      
      Differential Revision: D4239097
      
      Pulled By: lth
      
      fbshipit-source-id: da7c074
      2005c88a
  17. 20 10月, 2016 1 次提交
    • M
      Implement deadlock detection · 4edd39fd
      Manuel Ung 提交于
      Summary: Implement deadlock detection. This is done by maintaining a TxnID -> TxnID map which represents the edges in the wait for graph (this is named `wait_txn_map_`).
      
      Test Plan: transaction_test
      
      Reviewers: IslamAbdelRahman, sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D64491
      4edd39fd
  18. 08 10月, 2016 2 次提交
    • R
      Expose Transaction State Publicly · 37737c3a
      Reid Horuff 提交于
      Summary:
      This exposes a transactions state through a public api rather than through a public member variable. I also do some name refactoring.
      ExecutionStatus => TransactionState
      exec_status_ => trx_state_
      
      Test Plan: It compiles and transaction_test passes.
      
      Reviewers: IslamAbdelRahman
      
      Reviewed By: IslamAbdelRahman
      
      Subscribers: andrewkr, mung, dhruba, sdong
      
      Differential Revision: https://reviews.facebook.net/D64689
      37737c3a
    • R
      Add facility to write only a portion of WriteBatch to WAL · 2c1f9529
      Reid Horuff 提交于
      Summary:
      When constructing a write batch a client may now call MarkWalTerminationPoint() on that batch. No batch operations after this call will be added written to the WAL but will still be inserted into the Memtable. This facility is used to remove one of the three WriteImpl calls in 2PC transactions. This produces a ~1% perf improvement.
      
      ```
      RocksDB - unoptimized 2pc, sync_binlog=1, disable_2pc=off
      INFO 2016-08-31 14:30:38,814 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2619 seconds. Requests/second = 28628
      
      RocksDB - optimized 2pc , sync_binlog=1, disable_2pc=off
      INFO 2016-08-31 16:26:59,442 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2581 seconds. Requests/second = 29054
      ```
      
      Test Plan: Two unit tests added.
      
      Reviewers: sdong, yiwu, IslamAbdelRahman
      
      Reviewed By: yiwu
      
      Subscribers: hermanlee4, dhruba, andrewkr
      
      Differential Revision: https://reviews.facebook.net/D64599
      2c1f9529
  19. 01 10月, 2016 1 次提交
    • M
      Expose transaction id, lock state information and transaction wait information · be1f1092
      Manuel Ung 提交于
      Summary:
      This diff does 3 things:
      
      Expose TransactionID so that we can identify transactions when we retrieve locking and lock wait information. This is exposed as `Transaction::GetID`.
      
      Expose lock state information by locking all stripes in all column families and copying their contents to a data structure. This is exposed as `TransactionDB::GetLockStatusData`.
      
      Adds support for tracking the transaction and the key being waited on, and exposes this as `Transaction::GetWaitingTxn`.
      
      Test Plan: unit tests
      
      Reviewers: horuff, sdong
      
      Reviewed By: sdong
      
      Subscribers: vasilep, hermanlee4, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D64413
      be1f1092
  20. 12 8月, 2016 1 次提交
  21. 18 5月, 2016 1 次提交
  22. 11 5月, 2016 1 次提交
  23. 08 3月, 2016 1 次提交
  24. 01 3月, 2016 1 次提交
    • A
      TransactionDB:ReinitializeTransaction · 5ea9aa3c
      agiardullo 提交于
      Summary: Add function to reinitialize a transaction object so that it can be reused.  This is an optimization so users can potentially avoid reallocating transaction objects.
      
      Test Plan: added tests
      
      Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
      
      Reviewed By: sdong
      
      Subscribers: jkedgar, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D53835
      5ea9aa3c
  25. 10 2月, 2016 2 次提交
    • B
      Updated all copyright headers to the new format. · 21e95811
      Baraa Hamodi 提交于
      21e95811
    • A
      Transaction::UndoGetForUpdate · fe93bf9b
      agiardullo 提交于
      Summary: MyRocks wants to be able to un-lock a key that was just locked by GetForUpdate().  To do this safely, I am now keeping track of the number of reads(for update) and writes for each key in a transaction.  UndoGetForUpdate() will only unlock a key if it hasn't been written and the read count reaches 0.
      
      Test Plan: more unit tests
      
      Reviewers: igor, rven, yhchiang, spetrunia, sdong
      
      Reviewed By: spetrunia, sdong
      
      Subscribers: spetrunia, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D47043
      fe93bf9b
  26. 03 2月, 2016 1 次提交
  27. 29 1月, 2016 1 次提交
  28. 12 12月, 2015 1 次提交
    • A
      Use SST files for Transaction conflict detection · 3bfd3d39
      agiardullo 提交于
      Summary:
      Currently, transactions can fail even if there is no actual write conflict.  This is due to relying on only the memtables to check for write-conflicts.  Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.
      
      With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts.  This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot.  Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).
      
      Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread.  Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.
      
      Test Plan: unit tests, db bench
      
      Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb, yoshinorim
      
      Differential Revision: https://reviews.facebook.net/D50475
      3bfd3d39
  29. 10 10月, 2015 1 次提交
    • A
      Deferred snapshot creation in transactions · def74f87
      agiardullo 提交于
      Summary: Support for Transaction::CreateSnapshotOnNextOperation().  This is to fix a write-conflict race-condition that Yoshinori was running into when testing MyRocks with LinkBench.
      
      Test Plan: New tests
      
      Reviewers: yhchiang, spetrunia, rven, igor, yoshinorim, sdong
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D48099
      def74f87
  30. 30 9月, 2015 1 次提交
  31. 12 9月, 2015 1 次提交
  32. 10 9月, 2015 1 次提交
    • A
      Transaction stats · aa6eed0c
      agiardullo 提交于
      Summary: Added funtions to fetch the number of locked keys in a transaction, the number of pending puts/merge/deletes, and the elapsed time
      
      Test Plan: unit tests
      
      Reviewers: yoshinorim, jkedgar, rven, sdong, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D45417
      aa6eed0c
  33. 09 9月, 2015 1 次提交
    • A
      TransactionDB Custom Locking API · 5e94f68f
      agiardullo 提交于
      Summary:
      Prototype of API to allow MyRocks to override default Mutex/CondVar used by transactions with their own implementations.  They would simply need to pass their own implementations of Mutex/CondVar to the templated TransactionDB::Open().
      
      Default implementation of TransactionDBMutex/TransactionDBCondVar provided (but the code is not currently changed to use this).
      
      Let me know if this API makes sense or if it should be changed
      
      Test Plan: n/a
      
      Reviewers: yhchiang, rven, igor, sdong, spetrunia
      
      Reviewed By: spetrunia
      
      Subscribers: maykov, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D43761
      5e94f68f
  34. 25 8月, 2015 1 次提交
    • A
      Common base class for transactions · 20d1e547
      agiardullo 提交于
      Summary:
      As I keep adding new features to transactions, I keep creating more duplicate code.  This diff cleans this up by creating a base implementation class for Transaction and OptimisticTransaction to inherit from.
      
      The code in TransactionBase.h/.cc is all just copied from elsewhere.  The only entertaining part of this class worth looking at is the virtual TryLock method which allows OptimisticTransactions and Transactions to share the same common code for Put/Get/etc.
      
      The rest of this diff is mostly red and easy on the eyes.
      
      Test Plan: No functionality change.  existing tests pass.
      
      Reviewers: sdong, jkedgar, rven, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D45135
      20d1e547
  35. 12 8月, 2015 2 次提交
    • A
      Have Transactions use WriteBatch::RollbackToSavePoint · c3466eab
      agiardullo 提交于
      Summary:
      Clean up transactions to use the new RollbackToSavePoint api in WriteBatchWithIndex.
      
      Note, this diff depends on Pessimistic Transactions diff and ManagedSnapshot diff (D40869 and D43293).
      
      Test Plan: unit tests
      
      Reviewers: rven, yhchiang, kradhakrishnan, spetrunia, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D43371
      c3466eab
    • A
      Pessimistic Transactions · c2f2cb02
      agiardullo 提交于
      Summary:
      Initial implementation of Pessimistic Transactions.  This diff contains the api changes discussed in D38913.  This diff is pretty large, so let me know if people would prefer to meet up to discuss it.
      
      MyRocks folks:  please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues.
      
      Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint().  After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex.  We can then decide which route is preferable.
      
      Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing.
      
      Test Plan: Unit tests, db_bench parallel testing.
      
      Reviewers: igor, rven, sdong, yhchiang, yoshinorim
      
      Reviewed By: sdong
      
      Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D40869
      c2f2cb02