1. 13 12月, 2019 2 次提交
    • J
      Support concurrent CF iteration and drop (#6147) · c2029f97
      Jermy Li 提交于
      Summary:
      It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI.
      
      This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early.
      
      fixed https://github.com/facebook/rocksdb/issues/5982
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6147
      
      Differential Revision: D18926378
      
      fbshipit-source-id: 1dff6d068c603d012b81446812368bfee95a5e15
      c2029f97
    • C
      wait pending memtable writes on file ingestion or compact range (#6113) · a8445912
      Connor 提交于
      Summary:
      **Summary:**
      This PR fixes two unordered_write related issues:
      - ingestion job may skip the necessary memtable flush https://github.com/facebook/rocksdb/issues/6026
      - compact range may cause memtable is flushed before pending unordered write finished
          1. `CompactRange` triggers memtable flush but doesn't wait for pending-writes
          2.  there are some pending writes but memtable is already flushed
          3.  the memtable related WAL is removed( note that the pending-writes were recorded in that WAL).
          4.  pending-writes write to newer created memtable
          5. there is a restart
          6. missing the previous pending-writes because WAL is removed but they aren't included in SST.
      
      **How to solve:**
      - Wait pending memtable writes before ingestion job check memtable key range
      - Wait pending memtable writes before flush memtable.
      **Note that: `CompactRange` calls `RangesOverlapWithMemtables` too without waiting for pending waits, but I'm not sure whether it affects the correctness.**
      
      **Test Plan:**
      make check
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6113
      
      Differential Revision: D18895674
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: da22b4476fc7e06c176020e7cc171eb78189ecaf
      a8445912
  2. 12 12月, 2019 1 次提交
  3. 21 11月, 2019 2 次提交
    • Y
      Fix a data race between GetColumnFamilyMetaData and MarkFilesBeingCompacted (#6056) · 0ce0edbe
      Yanqin Jin 提交于
      Summary:
      Use db mutex to protect the execution of Version::GetColumnFamilyMetaData()
      called in DBImpl::GetColumnFamilyMetaData().
      Without mutex, GetColumnFamilyMetaData() races with MarkFilesBeingCompacted()
      for access to FileMetaData::being_compacted.
      Other than mutex, there are several more alternatives.
      
      - Make FileMetaData::being_compacted an atomic variable. This will make
        FileMetaData non-copy-able.
      
      - Separate being_compacted from FileMetaData. This requires re-organizing data
        structures that are already used in many places.
      
      Test Plan (dev server):
      ```
      make check
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6056
      
      Differential Revision: D18620488
      
      Pulled By: riversand963
      
      fbshipit-source-id: 87f89660b5d5e2ab4ef7962b7b2a7d00e346aa3b
      0ce0edbe
    • S
      Sanitize input in DB::MultiGet() API (#6054) · 27ec3b34
      sdong 提交于
      Summary:
      The new DB::MultiGet() doesn't validate input for num_keys > 1 and GCC-9 complains about it. Fix it by directly return when num_keys == 0
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6054
      
      Test Plan: Build with GCC-9 and see it passes.
      
      Differential Revision: D18608958
      
      fbshipit-source-id: 1c279aff3c7fe6e9d5a6d085ed02550ecea4fdb2
      27ec3b34
  4. 13 11月, 2019 1 次提交
    • A
      Batched MultiGet API for multiple column families (#5816) · 6c7b1a0c
      anand76 提交于
      Summary:
      Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.
      
      As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816
      
      Test Plan:
      make check
      make asan_check
      asan_crash_test
      
      Differential Revision: D18408676
      
      Pulled By: anand1976
      
      fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
      6c7b1a0c
  5. 26 10月, 2019 1 次提交
  6. 11 10月, 2019 1 次提交
    • V
      MultiGet batching in memtable (#5818) · 4c49e38f
      Vijay Nadimpalli 提交于
      Summary:
      RocksDB has a MultiGet() API that implements batched key lookup for higher performance (https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L468). Currently, batching is implemented in BlockBasedTableReader::MultiGet() for SST file lookups. One of the ways it improves performance is by pipelining bloom filter lookups (by prefetching required cachelines for all the keys in the batch, and then doing the probe) and thus hiding the cache miss latency. The same concept can be extended to the memtable as well. This PR involves implementing a pipelined bloom filter lookup in DynamicBloom, and implementing MemTable::MultiGet() that can leverage it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5818
      
      Test Plan:
      Existing tests
      
      Performance Test:
      Ran the below command which fills up the memtable and makes sure there are no flushes and then call multiget. Ran it on master and on the new change and see atleast 1% performance improvement across all the test runs I did. Sometimes the improvement was upto 5%.
      
      TEST_TMPDIR=/data/users/$USER/benchmarks/feature/ numactl -C 10 ./db_bench -benchmarks="fillseq,multireadrandom" -num=600000 -compression_type="none" -level_compaction_dynamic_level_bytes -write_buffer_size=200000000 -target_file_size_base=200000000 -max_bytes_for_level_base=16777216 -reads=90000 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 -statistics -memtable_whole_key_filtering=true -memtable_bloom_size_ratio=10
      
      Differential Revision: D17578869
      
      Pulled By: vjnadimpalli
      
      fbshipit-source-id: 23dc651d9bf49db11d22375bf435708875a1f192
      4c49e38f
  7. 09 10月, 2019 1 次提交
  8. 21 9月, 2019 1 次提交
  9. 18 9月, 2019 1 次提交
  10. 17 9月, 2019 3 次提交
    • A
      Allow users to stop manual compactions (#3971) · 62268300
      andrew 提交于
      Summary:
      Manual compaction may bring in very high load because sometime the amount of data involved in a compaction could be large, which may affect online service. So it would be good if the running compaction making the server busy can be stopped immediately. In this implementation, stopping manual compaction condition is only checked in slow process. We let deletion compaction and trivial move go through.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3971
      
      Test Plan: add tests at more spots.
      
      Differential Revision: D17369043
      
      fbshipit-source-id: 575a624fb992ce0bb07d9443eb209e547740043c
      62268300
    • M
      Charge block cache for cache internal usage (#5797) · 638d2395
      Maysam Yabandeh 提交于
      Summary:
      For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor.
      This PR builds up on https://github.com/facebook/rocksdb/issues/4258
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797
      
      Test Plan:
      - Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata.
      - The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata.
      
      Differential Revision: D17396833
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f
      638d2395
    • S
      Divide file_reader_writer.h and .cc (#5803) · b931f84e
      sdong 提交于
      Summary:
      file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803
      
      Test Plan: Build whole project using make and cmake.
      
      Differential Revision: D17374550
      
      fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987
      b931f84e
  11. 14 9月, 2019 2 次提交
    • I
      Allow ingesting overlapping files (#5539) · 97631357
      Igor Canadi 提交于
      Summary:
      Currently IngestExternalFile() fails when its input files' ranges overlap. This condition doesn't need to hold for files that are to be ingested in L0, though.
      
      This commit allows overlapping files and forces their target level to L0.
      
      Additionally, ingest job's completion is logged to EventLogger, analogous to flush and compaction jobs.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5539
      
      Differential Revision: D17370660
      
      Pulled By: riversand963
      
      fbshipit-source-id: 749a3899b17d1be267a5afd5b0a99d96b38ab2f3
      97631357
    • A
      Refactor ArenaWrappedDBIter into separate files (#5801) · 83a6a614
      anand76 提交于
      Summary:
      Move definition and implementation for ArenaWrappedDBIter into its own .h/.cc files. Also, change inlining of functions to better comply with the Google C++ style guide.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5801
      
      Test Plan: make check
      
      Differential Revision: D17371012
      
      Pulled By: anand1976
      
      fbshipit-source-id: c1361abc2851575111e357a63d88be3b3d6cb341
      83a6a614
  12. 03 9月, 2019 1 次提交
    • V
      Persistent globally unique DB ID in manifest (#5725) · 979fbdc6
      Vijay Nadimpalli 提交于
      Summary:
      Each DB has a globally unique ID. A DB can be physically copied around, or backed-up and restored, and the users should be identify the same DB. This unique ID right now is stored as plain text in file IDENTITY under the DB directory. This approach introduces at least two problems: (1) the file is not checksumed; (2) the source of truth of a DB is the manifest file, which can be copied separately from IDENTITY file, causing the DB ID to be wrong.
      The goal of this PR is solve this problem by moving the  DB ID to manifest. To begin with we will write to both identity file and manifest. Write to Manifest is controlled via the flag write_dbid_to_manifest in Options and default is false.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5725
      
      Test Plan: Added unit tests.
      
      Differential Revision: D16963840
      
      Pulled By: vjnadimpalli
      
      fbshipit-source-id: 8a86a4c8c82c716003c40fd6b9d2d758030d92e9
      979fbdc6
  13. 31 8月, 2019 1 次提交
    • Y
      Fix a bug in file ingestion (#5760) · 44eca41a
      Yanqin Jin 提交于
      Summary:
      Before this PR, when the number of column families involved in a file ingestion exceeds 2, a bug in the looping logic prevents correct file number being assigned to each ingestion job.
      Also skip deleting non-existing hard links during cleanup-after-failure.
      
      Test plan (devserver)
      ```
      $COMPILE_WITH_ASAN=1 make all
      $./external_sst_file_test --gtest_filter=ExternalSSTFileTest/ExternalSSTFileTest.IngestFilesIntoMultipleColumnFamilies_*/*
      $makke check
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5760
      
      Differential Revision: D17142982
      
      Pulled By: riversand963
      
      fbshipit-source-id: 06c1847a4e7a402647bcf28d124e70f2a0f9daf6
      44eca41a
  14. 24 8月, 2019 1 次提交
    • Z
      Refactor trimming logic for immutable memtables (#5022) · 2f41ecfe
      Zhongyi Xie 提交于
      Summary:
      MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
      We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
      The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
      In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
      
      Differential Revision: D14394062
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
      2f41ecfe
  15. 17 8月, 2019 1 次提交
    • S
      Do readahead in VerifyChecksum() (#5713) · e1c468d1
      sdong 提交于
      Summary:
      Right now VerifyChecksum() doesn't do read-ahead. In some use cases, users won't be able to achieve good performance. With this change, by default, RocksDB will do a default readahead, and users will be able to overwrite the readahead size by passing in a ReadOptions.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5713
      
      Test Plan: Add a new unit test.
      
      Differential Revision: D16860874
      
      fbshipit-source-id: 0cff0fe79ac855d3d068e6ccd770770854a68413
      e1c468d1
  16. 16 8月, 2019 1 次提交
    • S
      Add command "list_file_range_deletes" in ldb (#5615) · bd2c753d
      sdong 提交于
      Summary:
      Add a command in ldb so that users can print out tombstones in SST files.
      In order to test the code, change the interface of LDBCommandRunner::RunCommand() so that it doesn't return from the program, but return the status code.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5615
      
      Test Plan: Add a new unit test
      
      Differential Revision: D16550326
      
      fbshipit-source-id: 88ddfe6984bdcbb3a528abdd115089df09eba52e
      bd2c753d
  17. 07 8月, 2019 1 次提交
    • V
      New API to get all merge operands for a Key (#5604) · d150e014
      Vijay Nadimpalli 提交于
      Summary:
      This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
      1. Update subset of columns and read subset of columns -
      Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
      2. Updating very few attributes in a value which is a JSON-like document -
      Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
      ----------------------------------------------------------------------------------------------------
      API :
      Status GetMergeOperands(
            const ReadOptions& options, ColumnFamilyHandle* column_family,
            const Slice& key, PinnableSlice* merge_operands,
            GetMergeOperandsOptions* get_merge_operands_options,
            int* number_of_operands)
      
      Example usage :
      int size = 100;
      int number_of_operands = 0;
      std::vector<PinnableSlice> values(size);
      GetMergeOperandsOptions merge_operands_info;
      db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);
      
      Description :
      Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
      merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604
      
      Test Plan:
      Added unit test and perf test in db_bench that can be run using the command:
      ./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist
      
      Differential Revision: D16657366
      
      Pulled By: vjnadimpalli
      
      fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
      d150e014
  18. 31 7月, 2019 1 次提交
    • E
      Improve CPU Efficiency of ApproximateSize (part 2) (#5609) · 4834dab5
      Eli Pozniansky 提交于
      Summary:
      In some cases, we don't have to get really accurate number. Something like 10% off is fine, we can create a new option for that use case. In this case, we can calculate size for full files first, and avoid estimation inside SST files if full files got us a huge number. For example, if we already covered 100GB of data, we should be able to skip partial dives into 10 SST files of 30MB.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5609
      
      Differential Revision: D16433481
      
      Pulled By: elipoz
      
      fbshipit-source-id: 5830b31e1c656d0fd3a00d7fd2678ddc8f6e601b
      4834dab5
  19. 30 7月, 2019 1 次提交
  20. 26 7月, 2019 2 次提交
    • E
      Added SizeApproximationOptions to DB::GetApproximateSizes (#5626) · 9625a2bc
      Eli Pozniansky 提交于
      Summary:
      The new DB::GetApproximateSizes with SizeApproximationOptions argument, which allows to add more options/knobs to the DB::GetApproximateSizes call (beyond only the include_flags)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5626
      
      Differential Revision: D16496913
      
      Pulled By: elipoz
      
      fbshipit-source-id: ee8c6c182330a285fa056ecfc3905a592b451720
      9625a2bc
    • Y
      Avoid user key copying for Get/Put/Write with user-timestamp (#5502) · ae152ee6
      Yanqin Jin 提交于
      Summary:
      In previous https://github.com/facebook/rocksdb/issues/5079, we added user-specified timestamp to `DB::Get()` and `DB::Put()`. Limitation is that these two functions may cause extra memory allocation and key copy. The reason is that `WriteBatch` does not allocate extra memory for timestamps because it is not aware of timestamp size, and we did not provide an API to assign/update timestamp of each key within a `WriteBatch`.
      We address these issues in this PR by doing the following.
      1. Add a `timestamp_size_` to `WriteBatch` so that `WriteBatch` can take timestamps into account when calling `WriteBatch::Put`, `WriteBatch::Delete`, etc.
      2. Add APIs `WriteBatch::AssignTimestamp` and `WriteBatch::AssignTimestamps` so that application can assign/update timestamps for each key in a `WriteBatch`.
      3. Avoid key copy in `GetImpl` by adding new constructor to `LookupKey`.
      
      Test plan (on devserver):
      ```
      $make clean && COMPILE_WITH_ASAN=1 make -j32 all
      $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
      $make check
      ```
      If the API extension looks good, I will add more unit tests.
      
      Some simple benchmark using db_bench.
      ```
      $rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
      $rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 -disable_wal=true
      ```
      Master is at a78503bd.
      ```
      |        | readrandom | fillrandom |
      | master | 15.53 MB/s | 25.97 MB/s |
      | PR5502 | 16.70 MB/s | 25.80 MB/s |
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5502
      
      Differential Revision: D16340894
      
      Pulled By: riversand963
      
      fbshipit-source-id: 51132cf792be07d1efc3ac33f5768c4ee2608bb8
      ae152ee6
  21. 23 7月, 2019 1 次提交
    • M
      WriteUnPrepared: improve read your own write functionality (#5573) · eae83274
      Manuel Ung 提交于
      Summary:
      There are a number of fixes in this PR (with most bugs found via the added stress tests):
      1. Re-enable reseek optimization. This was initially disabled to avoid infinite loops in https://github.com/facebook/rocksdb/pull/3955 but this can be resolved by remembering not to reseek after a reseek has already been done. This problem only affects forward iteration in `DBIter::FindNextUserEntryInternal`, as we already disable reseeking in `DBIter::FindValueForCurrentKeyUsingSeek`.
      2. Verify that ReadOption.snapshot can be safely used for iterator creation. Some snapshots would not give correct results because snaphsot validation would not be enforced, breaking some assumptions in Prev() iteration.
      3. In the non-snapshot Get() case, reads done at `LastPublishedSequence` may not be enough, because unprepared sequence numbers are not published. Use `std::max(published_seq, max_visible_seq)` to do lookups instead.
      4. Add stress test to test reading own writes.
      5. Minor bug in the allow_concurrent_memtable_write case where we forgot to pass in batch_per_txn_.
      6. Minor performance optimization in `CalcMaxUnpreparedSequenceNumber` by assigning by reference instead of value.
      7. Add some more comments everywhere.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5573
      
      Differential Revision: D16276089
      
      Pulled By: lth
      
      fbshipit-source-id: 18029c944eb427a90a87dee76ac1b23f37ec1ccb
      eae83274
  22. 18 7月, 2019 1 次提交
    • V
      Export Import sst files (#5495) · 22ce4624
      Venki Pallipadi 提交于
      Summary:
      Refresh of the earlier change here - https://github.com/facebook/rocksdb/issues/5135
      
      This is a review request for code change needed for - https://github.com/facebook/rocksdb/issues/3469
      "Add support for taking snapshot of a column family and creating column family from a given CF snapshot"
      
      We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.
      
      (1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
      // Exports all live SST files of a specified Column Family onto export_dir,
      // returning SST files information in metadata.
      // - SST files will be created as hard links when the directory specified
      //   is in the same partition as the db directory, copied otherwise.
      // - export_dir should not already exist and will be created by this API.
      // - Always triggers a flush.
      virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
                                        const std::string& export_dir,
                                        ExportImportFilesMetaData** metadata);
      
      Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
      metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
      returning the list of file metadata.
      
      (2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
      // CreateColumnFamilyWithImport() will create a new column family with
      // column_family_name and import external SST files specified in metadata into
      // this column family.
      // (1) External SST files can be created using SstFileWriter.
      // (2) External SST files can be exported from a particular column family in
      //     an existing DB.
      // Option in import_options specifies whether the external files are copied or
      // moved (default is copy). When option specifies copy, managing files at
      // external_file_path is caller's responsibility. When option specifies a
      // move, the call ensures that the specified files at external_file_path are
      // deleted on successful return and files are not modified on any error
      // return.
      // On error return, column family handle returned will be nullptr.
      // ColumnFamily will be present on successful return and will not be present
      // on error return. ColumnFamily may be present on any crash during this call.
      virtual Status CreateColumnFamilyWithImport(
          const ColumnFamilyOptions& options, const std::string& column_family_name,
          const ImportColumnFamilyOptions& import_options,
          const ExportImportFilesMetaData& metadata,
          ColumnFamilyHandle** handle);
      
      Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.
      
      If incoming sequence number is higher than current local sequence number, local sequence
      number is updated to reflect this.
      
      Note, as the sst files is are being moved across Column Families, Column Family name in sst file
      will no longer match the actual column family on destination DB. The API does not modify Column
      Family name or id in the sst files being imported.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5495
      
      Differential Revision: D16018881
      
      fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
      22ce4624
  23. 16 7月, 2019 1 次提交
    • Z
      add more tracing for stats history (#5566) · b0259e45
      Zhongyi Xie 提交于
      Summary:
      Sample info log output from db_bench:
      In-memory:
      ```
      2019/07/12-21:39:19.478490 7fa01b3f5700 [_impl/db_impl.cc:702] ------- PERSISTING STATS -------
      2019/07/12-21:39:19.478633 7fa01b3f5700 [_impl/db_impl.cc:753] Storing 145 stats with timestamp 1562992759 to in-memory stats history
      2019/07/12-21:39:19.478670 7fa01b3f5700 [_impl/db_impl.cc:766] [Pre-GC] In-memory stats history size: 1051218 bytes, slice count: 103
      2019/07/12-21:39:19.478704 7fa01b3f5700 [_impl/db_impl.cc:775] [Post-GC] In-memory stats history size: 1051218 bytes, slice count: 102
      ```
      On-disk:
      ```
      2019/07/12-21:48:53.862548 7f24943f5700 [_impl/db_impl.cc:702] ------- PERSISTING STATS -------
      2019/07/12-21:48:53.862553 7f24943f5700 [_impl/db_impl.cc:709] Reading 145 stats from statistics
      2019/07/12-21:48:53.862852 7f24943f5700 [_impl/db_impl.cc:737] Writing 145 stats with timestamp 1562993333 to persistent stats CF succeeded
      ```
      ```
      2019/07/12-21:48:51.861711 7f24943f5700 [_impl/db_impl.cc:702] ------- PERSISTING STATS -------
      2019/07/12-21:48:51.861729 7f24943f5700 [_impl/db_impl.cc:709] Reading 145 stats from statistics
      2019/07/12-21:48:51.861921 7f24943f5700 [_impl/db_impl.cc:732] Writing to persistent stats CF failed -- Result incomplete: Write stall
      ...
      2019/07/12-21:48:51.873032 7f2494bf6700 [WARN] [lumn_family.cc:749] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5566
      
      Differential Revision: D16258187
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 292497099b941418590ed4312411bee36e244dc5
      b0259e45
  24. 07 7月, 2019 1 次提交
  25. 02 7月, 2019 1 次提交
  26. 22 6月, 2019 1 次提交
  27. 21 6月, 2019 1 次提交
    • H
      Add more callers for table reader. (#5454) · 705b8eec
      haoyuhuang 提交于
      Summary:
      This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block.
      1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name.
      2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable.
      
      This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454
      
      Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32.
      
      Differential Revision: D15819451
      
      Pulled By: HaoyuHuang
      
      fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d
      705b8eec
  28. 18 6月, 2019 2 次提交
    • Y
      Override check consistency for DBImplSecondary (#5469) · 7d8d5641
      Yanqin Jin 提交于
      Summary:
      `DBImplSecondary` calls `CheckConsistency()` during open. In the past, `DBImplSecondary` did not override this function thus `DBImpl::CheckConsistency()` is called.
      The following can happen. The secondary instance is performing consistency check which calls `GetFileSize(file_path)` but the file at `file_path` is deleted by the primary instance. `DBImpl::CheckConsistency` does not account for this and fails the consistency check. This is undesirable. The solution is that, we call `DBImpl::CheckConsistency()` first. If it passes, then we are good. If not, we give it a second chance and handles the case of file(s) being deleted.
      
      Test plan (on dev server):
      ```
      $make clean && make -j20 all
      $./db_secondary_test
      ```
      All other existing unit tests must pass as well.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5469
      
      Differential Revision: D15861845
      
      Pulled By: riversand963
      
      fbshipit-source-id: 507d72392508caed3cd003bb2e2aa43f993dd597
      7d8d5641
    • Z
      Persistent Stats: persist stats history to disk (#5046) · 671d15cb
      Zhongyi Xie 提交于
      Summary:
      This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and  both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046
      
      Differential Revision: D15863138
      
      Pulled By: miasantreble
      
      fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b
      671d15cb
  29. 15 6月, 2019 1 次提交
    • S
      Validate CF Options when creating a new column family (#5453) · f1219644
      Sagar Vemuri 提交于
      Summary:
      It seems like CF Options are not properly validated  when creating a new column family with `CreateColumnFamily` API; only a selected few checks are done. Calling `ColumnFamilyData::ValidateOptions`, which is the single source for all CFOptions validations,  will help fix this. (`ColumnFamilyData::ValidateOptions` is already called at the time of `DB::Open`).
      
      **Test Plan:**
      Added a new test: `DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions`
      ```
      TEST_TMPDIR=/dev/shm ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions
      ```
      Also ran gtest-parallel to make sure the new test is not flaky.
      ```
      TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions --repeat=10000
      [10000/10000] DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions (15 ms)
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5453
      
      Differential Revision: D15816851
      
      Pulled By: sagar0
      
      fbshipit-source-id: 9e702b9850f5c4a7e0ef8d39e1e6f9b81e7fe1e5
      f1219644
  30. 14 6月, 2019 1 次提交
  31. 12 6月, 2019 1 次提交
  32. 11 6月, 2019 2 次提交
    • M
      Avoid deadlock between mutex_ and log_write_mutex_ (#5437) · c8c1a549
      Maysam Yabandeh 提交于
      Summary:
      To avoid deadlock mutex_ should never be acquired before log_write_mutex_. The patch documents that and also fixes one case in ::FlushWAL that acquires mutex_ through ::WriteStatusCheck when it already holds lock on log_write_mutex_.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5437
      
      Differential Revision: D15749722
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f57b69c44b4b80cc6d7ddf3d3fdf4a9eb5a5a45a
      c8c1a549
    • H
      Create a BlockCacheLookupContext to enable fine-grained block cache tracing. (#5421) · 5efa0d6b
      haoyuhuang 提交于
      Summary:
      BlockCacheLookupContext only contains the caller for now.
      We will trace block accesses at five places:
      1. BlockBasedTable::GetFilter.
      2. BlockBasedTable::GetUncompressedDict.
      3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.)
      4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
      5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
      
      We create the context at:
      1. BlockBasedTable::Get. (kUserGet)
      2. BlockBasedTable::MultiGet. (kUserMGet)
      3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.)
      4. BlockBasedTable::Open. (kPrefetch)
      5. Index/Filter::CacheDependencies. (kPrefetch)
      6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize).
      
      I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable.
      Throughput of this PR: 231334 ops/s.
      Throughput of the master branch: 238428 ops/s.
      
      Experiment setup:
      RocksDB:    version 6.2
      Date:       Mon Jun 10 10:42:51 2019
      CPU:        24 * Intel Core Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       20 bytes each
      Values:     100 bytes each (100 bytes after compression)
      Entries:    1000000
      Prefix:    20 bytes
      Keys per prefix:    0
      RawSize:    114.4 MB (estimated)
      FileSize:   114.4 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: NoCompression
      Compression sampling rate: 0
      Memtablerep: skip_list
      Perf Level: 1
      
      Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000
      
      Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120
      
      TODOs:
      1. Create a caller for external SST file ingestion and differentiate the callers for iterator.
      2. Integrate tracer to trace block cache accesses.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5421
      
      Differential Revision: D15704258
      
      Pulled By: HaoyuHuang
      
      fbshipit-source-id: 4aa8a55f8cb1576ffb367bfa3186a91d8f06d93a
      5efa0d6b