1. 07 8月, 2019 1 次提交
    • V
      New API to get all merge operands for a Key (#5604) · d150e014
      Vijay Nadimpalli 提交于
      Summary:
      This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
      1. Update subset of columns and read subset of columns -
      Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
      2. Updating very few attributes in a value which is a JSON-like document -
      Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
      ----------------------------------------------------------------------------------------------------
      API :
      Status GetMergeOperands(
            const ReadOptions& options, ColumnFamilyHandle* column_family,
            const Slice& key, PinnableSlice* merge_operands,
            GetMergeOperandsOptions* get_merge_operands_options,
            int* number_of_operands)
      
      Example usage :
      int size = 100;
      int number_of_operands = 0;
      std::vector<PinnableSlice> values(size);
      GetMergeOperandsOptions merge_operands_info;
      db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);
      
      Description :
      Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
      merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604
      
      Test Plan:
      Added unit test and perf test in db_bench that can be run using the command:
      ./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist
      
      Differential Revision: D16657366
      
      Pulled By: vjnadimpalli
      
      fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
      d150e014
  2. 02 8月, 2019 1 次提交
    • Z
      Fix duplicated file names in PurgeObsoleteFiles (#5603) · d1c9ede1
      Zhongyi Xie 提交于
      Summary:
      Currently in `DBImpl::PurgeObsoleteFiles`, the list of candidate files is create through a combination of calling LogFileName using `log_delete_files` and `full_scan_candidate_files`.
      
      In full_scan_candidate_files, the filenames look like this
      {file_name = "074715.log", file_path = "/txlogs/3306"},
      but LogFileName produces filenames like this that prepends a slash:
      {file_name = "/074715.log", file_path = "/txlogs/3306"},
      
      This confuses the dedup step here: https://github.com/facebook/rocksdb/blob/bb4178066dc4f18b9b7f1d371e641db027b3edbe/db/db_impl/db_impl_files.cc#L339-L345
      
      Because duplicates still exist, DeleteFile is called on the same file twice, and hits an error on the second try. Error message: Failed to mark /txlogs/3302/764418.log as trash.
      
      The root cause is the use of `kDumbDbName` when generating file names, it creates file names like /074715.log. This PR removes the use of `kDumbDbName` and create paths without leading '/' when dbname can be ignored.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5603
      
      Test Plan: make check
      
      Differential Revision: D16413203
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 6ba8288382c55f7d5e3892d722fc94b57d2e4491
      d1c9ede1
  3. 01 8月, 2019 2 次提交
    • L
      Test the various configurations in parallel in MergeOperatorPinningTest (#5659) · 1dfc5eaa
      Levi Tamasi 提交于
      Summary:
      MergeOperatorPinningTest.Randomized frequently times out under TSAN
      because it tests ~40 option configurations sequentially in a loop. The
      patch parallelizes the tests of the various configurations to make the
      test complete faster.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5659
      
      Test Plan: Tested using buck test mode/dev-tsan ...
      
      Differential Revision: D16587518
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 65bd25c0ad9a23587fed5592e69c1a0097fa27f6
      1dfc5eaa
    • M
      WriteUnPrepared: savepoint support (#5627) · f622ca2c
      Manuel Ung 提交于
      Summary:
      Add savepoint support when the current transaction has flushed unprepared batches.
      
      Rolling back to savepoint is similar to rolling back a transaction. It requires the set of keys that have changed since the savepoint, re-reading the keys at the snapshot at that savepoint, and the restoring the old keys by writing out another unprepared batch.
      
      For this strategy to work though, we must be capable of reading keys at a savepoint. This does not work if keys were written out using the same sequence number before and after a savepoint. Therefore, when we flush out unprepared batches, we must split the batch by savepoint if any savepoints exist.
      
      eg. If we have the following:
      ```
      Put(A)
      Put(B)
      Put(C)
      SetSavePoint()
      Put(D)
      Put(E)
      SetSavePoint()
      Put(F)
      ```
      
      Then we will write out 3 separate unprepared batches:
      ```
      Put(A) 1
      Put(B) 1
      Put(C) 1
      Put(D) 2
      Put(E) 2
      Put(F) 3
      ```
      
      This is so that when we rollback to eg. the first savepoint, we can just read keys at snapshot_seq = 1.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5627
      
      Differential Revision: D16584130
      
      Pulled By: lth
      
      fbshipit-source-id: 6d100dd548fb20c4b76661bd0f8a2647e64477fa
      f622ca2c
  4. 31 7月, 2019 1 次提交
    • E
      Improve CPU Efficiency of ApproximateSize (part 2) (#5609) · 4834dab5
      Eli Pozniansky 提交于
      Summary:
      In some cases, we don't have to get really accurate number. Something like 10% off is fine, we can create a new option for that use case. In this case, we can calculate size for full files first, and avoid estimation inside SST files if full files got us a huge number. For example, if we already covered 100GB of data, we should be able to skip partial dives into 10 SST files of 30MB.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5609
      
      Differential Revision: D16433481
      
      Pulled By: elipoz
      
      fbshipit-source-id: 5830b31e1c656d0fd3a00d7fd2678ddc8f6e601b
      4834dab5
  5. 30 7月, 2019 1 次提交
  6. 26 7月, 2019 2 次提交
    • E
      Added SizeApproximationOptions to DB::GetApproximateSizes (#5626) · 9625a2bc
      Eli Pozniansky 提交于
      Summary:
      The new DB::GetApproximateSizes with SizeApproximationOptions argument, which allows to add more options/knobs to the DB::GetApproximateSizes call (beyond only the include_flags)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5626
      
      Differential Revision: D16496913
      
      Pulled By: elipoz
      
      fbshipit-source-id: ee8c6c182330a285fa056ecfc3905a592b451720
      9625a2bc
    • Y
      Avoid user key copying for Get/Put/Write with user-timestamp (#5502) · ae152ee6
      Yanqin Jin 提交于
      Summary:
      In previous https://github.com/facebook/rocksdb/issues/5079, we added user-specified timestamp to `DB::Get()` and `DB::Put()`. Limitation is that these two functions may cause extra memory allocation and key copy. The reason is that `WriteBatch` does not allocate extra memory for timestamps because it is not aware of timestamp size, and we did not provide an API to assign/update timestamp of each key within a `WriteBatch`.
      We address these issues in this PR by doing the following.
      1. Add a `timestamp_size_` to `WriteBatch` so that `WriteBatch` can take timestamps into account when calling `WriteBatch::Put`, `WriteBatch::Delete`, etc.
      2. Add APIs `WriteBatch::AssignTimestamp` and `WriteBatch::AssignTimestamps` so that application can assign/update timestamps for each key in a `WriteBatch`.
      3. Avoid key copy in `GetImpl` by adding new constructor to `LookupKey`.
      
      Test plan (on devserver):
      ```
      $make clean && COMPILE_WITH_ASAN=1 make -j32 all
      $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
      $make check
      ```
      If the API extension looks good, I will add more unit tests.
      
      Some simple benchmark using db_bench.
      ```
      $rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
      $rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 -disable_wal=true
      ```
      Master is at a78503bd.
      ```
      |        | readrandom | fillrandom |
      | master | 15.53 MB/s | 25.97 MB/s |
      | PR5502 | 16.70 MB/s | 25.80 MB/s |
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5502
      
      Differential Revision: D16340894
      
      Pulled By: riversand963
      
      fbshipit-source-id: 51132cf792be07d1efc3ac33f5768c4ee2608bb8
      ae152ee6
  7. 24 7月, 2019 3 次提交
    • S
      Fix wrong info log printing for num_range_deletions (#5617) · f5b951f7
      sdong 提交于
      Summary:
      num_range_deletions printing is wrong in this log line:
      
      2019/07/18-12:59:15.309271 7f869f9ff700 EVENT_LOG_v1 {"time_micros": 1563479955309228, "cf_name": "5", "job": 955, "event": "table_file_creation", "file_number": 34579, "file_size": 2239842, "table_properties": {"data_size": 1988792, "index_size": 3067, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 1, "filter_size": 170821, "raw_key_size": 1951792, "raw_average_key_size": 16, "raw_value_size": 1731720, "raw_average_value_size": 14, "num_data_blocks": 199, "num_entries": 121987, "num_deletions": 15184, "num_merge_operands": 86512, "num_range_deletions": 86512, "format_version": 0, "fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter", "column_family_name": "5", "column_family_id": 5, "comparator": "leveldb.BytewiseComparator", "merge_operator": "PutOperator", "prefix_extractor_name": "rocksdb.FixedPrefix.7", "property_collectors": "[]", "compression": "ZSTD", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1563479951, "oldest_key_time": 0, "file_creation_time": 1563479954}}
      
      It actually prints "num_merge_operands" number. Fix it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5617
      
      Test Plan: Just build.
      
      Differential Revision: D16453110
      
      fbshipit-source-id: fc1024b3cd5650312ed47a1379f0d2cf8b2d8a8f
      f5b951f7
    • L
      Move the uncompression dictionary object out of the block cache (#5584) · 092f4170
      Levi Tamasi 提交于
      Summary:
      RocksDB has historically stored uncompression dictionary objects in the block
      cache as opposed to storing just the block contents. This neccesitated
      evicting the object upon table close. With the new code, only the raw blocks
      are stored in the cache, eliminating the need for eviction.
      
      In addition, the patch makes the following improvements:
      
      1) Compression dictionary blocks are now prefetched/pinned similarly to
      index/filter blocks.
      2) A copy operation got eliminated when the uncompression dictionary is
      retrieved.
      3) Errors related to retrieving the uncompression dictionary are propagated as
      opposed to silently ignored.
      
      Note: the patch temporarily breaks the compression dictionary evicition stats.
      They will be fixed in a separate phase.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5584
      
      Test Plan: make asan_check
      
      Differential Revision: D16344151
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 2962b295f5b19628f9da88a3fcebbce5a5017a7b
      092f4170
    • E
      Improve CPU Efficiency of ApproximateSize (part 1) (#5613) · 6b7fcc0d
      Eli Pozniansky 提交于
      Summary:
      1. Avoid creating the iterator in order to call BlockBasedTable::ApproximateOffsetOf(). Instead, directly call into it.
      2. Optimize BlockBasedTable::ApproximateOffsetOf() keeps the index block iterator in stack.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5613
      
      Differential Revision: D16442660
      
      Pulled By: elipoz
      
      fbshipit-source-id: 9320be3e918c139b10e758cbbb684706d172e516
      6b7fcc0d
  8. 23 7月, 2019 2 次提交
    • M
      WriteUnPrepared: improve read your own write functionality (#5573) · eae83274
      Manuel Ung 提交于
      Summary:
      There are a number of fixes in this PR (with most bugs found via the added stress tests):
      1. Re-enable reseek optimization. This was initially disabled to avoid infinite loops in https://github.com/facebook/rocksdb/pull/3955 but this can be resolved by remembering not to reseek after a reseek has already been done. This problem only affects forward iteration in `DBIter::FindNextUserEntryInternal`, as we already disable reseeking in `DBIter::FindValueForCurrentKeyUsingSeek`.
      2. Verify that ReadOption.snapshot can be safely used for iterator creation. Some snapshots would not give correct results because snaphsot validation would not be enforced, breaking some assumptions in Prev() iteration.
      3. In the non-snapshot Get() case, reads done at `LastPublishedSequence` may not be enough, because unprepared sequence numbers are not published. Use `std::max(published_seq, max_visible_seq)` to do lookups instead.
      4. Add stress test to test reading own writes.
      5. Minor bug in the allow_concurrent_memtable_write case where we forgot to pass in batch_per_txn_.
      6. Minor performance optimization in `CalcMaxUnpreparedSequenceNumber` by assigning by reference instead of value.
      7. Add some more comments everywhere.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5573
      
      Differential Revision: D16276089
      
      Pulled By: lth
      
      fbshipit-source-id: 18029c944eb427a90a87dee76ac1b23f37ec1ccb
      eae83274
    • S
      row_cache to share entry for recent snapshots (#5600) · 66b5613d
      sdong 提交于
      Summary:
      Right now, users cannot take advantage of row cache, unless no snapshot is used, or Get() is repeated for the same snapshots. This limits the usage of row cache.
      This change eliminate this restriction in some cases. If the snapshot used is newer than the largest sequence number in the file, and write callback function is not registered, the same row cache key is used as no snapshot is given. We still need the callback function restriction for now because the callback function may filter out different keys for different snapshots even if the snapshots are new.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5600
      
      Test Plan: Add a unit test.
      
      Differential Revision: D16386616
      
      fbshipit-source-id: 6b7d214bd215d191b03ccf55926ad4b703ec2e53
      66b5613d
  9. 20 7月, 2019 2 次提交
  10. 19 7月, 2019 1 次提交
  11. 18 7月, 2019 2 次提交
    • A
      Fix LITE mode build failure · ec2b996b
      anand76 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5588
      
      Test Plan: make LITE=1 all check
      
      Differential Revision: D16354543
      
      Pulled By: anand1976
      
      fbshipit-source-id: 327a171439e183ac3a5e5057c511d6bca445e97d
      ec2b996b
    • V
      Export Import sst files (#5495) · 22ce4624
      Venki Pallipadi 提交于
      Summary:
      Refresh of the earlier change here - https://github.com/facebook/rocksdb/issues/5135
      
      This is a review request for code change needed for - https://github.com/facebook/rocksdb/issues/3469
      "Add support for taking snapshot of a column family and creating column family from a given CF snapshot"
      
      We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.
      
      (1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
      // Exports all live SST files of a specified Column Family onto export_dir,
      // returning SST files information in metadata.
      // - SST files will be created as hard links when the directory specified
      //   is in the same partition as the db directory, copied otherwise.
      // - export_dir should not already exist and will be created by this API.
      // - Always triggers a flush.
      virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
                                        const std::string& export_dir,
                                        ExportImportFilesMetaData** metadata);
      
      Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
      metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
      returning the list of file metadata.
      
      (2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
      // CreateColumnFamilyWithImport() will create a new column family with
      // column_family_name and import external SST files specified in metadata into
      // this column family.
      // (1) External SST files can be created using SstFileWriter.
      // (2) External SST files can be exported from a particular column family in
      //     an existing DB.
      // Option in import_options specifies whether the external files are copied or
      // moved (default is copy). When option specifies copy, managing files at
      // external_file_path is caller's responsibility. When option specifies a
      // move, the call ensures that the specified files at external_file_path are
      // deleted on successful return and files are not modified on any error
      // return.
      // On error return, column family handle returned will be nullptr.
      // ColumnFamily will be present on successful return and will not be present
      // on error return. ColumnFamily may be present on any crash during this call.
      virtual Status CreateColumnFamilyWithImport(
          const ColumnFamilyOptions& options, const std::string& column_family_name,
          const ImportColumnFamilyOptions& import_options,
          const ExportImportFilesMetaData& metadata,
          ColumnFamilyHandle** handle);
      
      Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.
      
      If incoming sequence number is higher than current local sequence number, local sequence
      number is updated to reflect this.
      
      Note, as the sst files is are being moved across Column Families, Column Family name in sst file
      will no longer match the actual column family on destination DB. The API does not modify Column
      Family name or id in the sst files being imported.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5495
      
      Differential Revision: D16018881
      
      fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
      22ce4624
  12. 17 7月, 2019 2 次提交
    • S
      Remove RandomAccessFileReader.for_compaction_ (#5572) · 699a569c
      sdong 提交于
      Summary:
      RandomAccessFileReader.for_compaction_ doesn't seem to be used anymore. Remove it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5572
      
      Test Plan: USE_CLANG=1 make all check -j
      
      Differential Revision: D16286178
      
      fbshipit-source-id: aa338049761033dfbe5e8b1707bbb0be2df5be7e
      699a569c
    • L
      Move the filter readers out of the block cache (#5504) · 3bde41b5
      Levi Tamasi 提交于
      Summary:
      Currently, when the block cache is used for the filter block, it is not
      really the block itself that is stored in the cache but a FilterBlockReader
      object. Since this object is not pure data (it has, for instance, pointers that
      might dangle, including in one case a back pointer to the TableReader), it's not
      really sharable. To avoid the issues around this, the current code erases the
      cache entries when the TableReader is closed (which, BTW, is not sufficient
      since a concurrent TableReader might have picked up the object in the meantime).
      Instead of doing this, the patch moves the FilterBlockReader out of the cache
      altogether, and decouples the filter reader object from the filter block.
      In particular, instead of the TableReader owning, or caching/pinning the
      FilterBlockReader (based on the customer's settings), with the change the
      TableReader unconditionally owns the FilterBlockReader, which in turn
      owns/caches/pins the filter block. This change also enables us to reuse the code
      paths historically used for data blocks for filters as well.
      
      Note:
      Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
      separate phase.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
      
      Test Plan: make asan_check
      
      Differential Revision: D16036974
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
      3bde41b5
  13. 16 7月, 2019 2 次提交
    • J
      Fix memorty leak in `rocksdb_wal_iter_get_batch` function (#5515) · cd252036
      Jim Lin 提交于
      Summary:
      `wal_batch.writeBatchPtr.release()` gives up the ownership of the original `WriteBatch`, but there is no new owner, which causes memory leak.
      
      The patch is simple. Removing `release()` prevent ownership change. `std::move` is for speed.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5515
      
      Differential Revision: D16264281
      
      Pulled By: riversand963
      
      fbshipit-source-id: 51c556b7a1c977325c3aa24acb636303847151fa
      cd252036
    • Z
      add more tracing for stats history (#5566) · b0259e45
      Zhongyi Xie 提交于
      Summary:
      Sample info log output from db_bench:
      In-memory:
      ```
      2019/07/12-21:39:19.478490 7fa01b3f5700 [_impl/db_impl.cc:702] ------- PERSISTING STATS -------
      2019/07/12-21:39:19.478633 7fa01b3f5700 [_impl/db_impl.cc:753] Storing 145 stats with timestamp 1562992759 to in-memory stats history
      2019/07/12-21:39:19.478670 7fa01b3f5700 [_impl/db_impl.cc:766] [Pre-GC] In-memory stats history size: 1051218 bytes, slice count: 103
      2019/07/12-21:39:19.478704 7fa01b3f5700 [_impl/db_impl.cc:775] [Post-GC] In-memory stats history size: 1051218 bytes, slice count: 102
      ```
      On-disk:
      ```
      2019/07/12-21:48:53.862548 7f24943f5700 [_impl/db_impl.cc:702] ------- PERSISTING STATS -------
      2019/07/12-21:48:53.862553 7f24943f5700 [_impl/db_impl.cc:709] Reading 145 stats from statistics
      2019/07/12-21:48:53.862852 7f24943f5700 [_impl/db_impl.cc:737] Writing 145 stats with timestamp 1562993333 to persistent stats CF succeeded
      ```
      ```
      2019/07/12-21:48:51.861711 7f24943f5700 [_impl/db_impl.cc:702] ------- PERSISTING STATS -------
      2019/07/12-21:48:51.861729 7f24943f5700 [_impl/db_impl.cc:709] Reading 145 stats from statistics
      2019/07/12-21:48:51.861921 7f24943f5700 [_impl/db_impl.cc:732] Writing to persistent stats CF failed -- Result incomplete: Write stall
      ...
      2019/07/12-21:48:51.873032 7f2494bf6700 [WARN] [lumn_family.cc:749] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5566
      
      Differential Revision: D16258187
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 292497099b941418590ed4312411bee36e244dc5
      b0259e45
  14. 13 7月, 2019 1 次提交
  15. 10 7月, 2019 1 次提交
  16. 08 7月, 2019 2 次提交
    • Y
      Support GetAllKeyVersions() for non-default cf (#5544) · 7c76a7fb
      Yanqin Jin 提交于
      Summary:
      Previously `GetAllKeyVersions()` supports default column family only. This PR add support for other column families.
      
      Test plan (devserver):
      ```
      $make clean && COMPILE_WITH_ASAN=1 make -j32 db_basic_test
      $./db_basic_test --gtest_filter=DBBasicTest.GetAllKeyVersions
      ```
      All other unit tests must pass.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5544
      
      Differential Revision: D16147551
      
      Pulled By: riversand963
      
      fbshipit-source-id: 5a61aece2a32d789e150226a9b8d53f4a5760168
      7c76a7fb
    • Z
      setup wal_in_db_path_ for secondary instance (#5545) · 8d348069
      Zhongyi Xie 提交于
      Summary:
      PR https://github.com/facebook/rocksdb/pull/5520 adds DBImpl:: wal_in_db_path_ and initializes it in DBImpl::Open, this PR fixes the valgrind error for secondary instance:
      ```
      ==236417== Conditional jump or move depends on uninitialised value(s)
      ==236417==    at 0x62242A: rocksdb::DeleteDBFile(rocksdb::ImmutableDBOptions const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool) (file_util.cc:96)
      ==236417==    by 0x512432: rocksdb::DBImpl::DeleteObsoleteFileImpl(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileType, unsigned long) (db_impl_files.cc:261)
      ==236417==    by 0x515A7A: rocksdb::DBImpl::PurgeObsoleteFiles(rocksdb::JobContext&, bool) (db_impl_files.cc:492)
      ==236417==    by 0x499153: rocksdb::ColumnFamilyHandleImpl::~ColumnFamilyHandleImpl() (column_family.cc:75)
      ==236417==    by 0x499880: rocksdb::ColumnFamilyHandleImpl::~ColumnFamilyHandleImpl() (column_family.cc:84)
      ==236417==    by 0x4C9AF9: rocksdb::DB::DestroyColumnFamilyHandle(rocksdb::ColumnFamilyHandle*) (db_impl.cc:3105)
      ==236417==    by 0x44E853: CloseSecondary (db_secondary_test.cc:53)
      ==236417==    by 0x44E853: rocksdb::DBSecondaryTest::~DBSecondaryTest() (db_secondary_test.cc:31)
      ==236417==    by 0x44EC77: ~DBSecondaryTest_PrimaryDropColumnFamily_Test (db_secondary_test.cc:443)
      ==236417==    by 0x44EC77: rocksdb::DBSecondaryTest_PrimaryDropColumnFamily_Test::~DBSecondaryTest_PrimaryDropColumnFamily_Test() (db_secondary_test.cc:443)
      ==236417==    by 0x83D1D7: HandleSehExceptionsInMethodIfSupported<testing::Test, void> (gtest-all.cc:3824)
      ==236417==    by 0x83D1D7: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest-all.cc:3860)
      ==236417==    by 0x8346DB: testing::TestInfo::Run() [clone .part.486] (gtest-all.cc:4078)
      ==236417==    by 0x8348D4: Run (gtest-all.cc:4047)
      ==236417==    by 0x8348D4: testing::TestCase::Run() [clone .part.487] (gtest-all.cc:4190)
      ==236417==    by 0x834D14: Run (gtest-all.cc:6100)
      ==236417==    by 0x834D14: testing::internal::UnitTestImpl::RunAllTests() (gtest-all.cc:6062)
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5545
      
      Differential Revision: D16146224
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 184c90e451352951da4e955f054d4b1a1f29ea29
      8d348069
  17. 07 7月, 2019 1 次提交
  18. 05 7月, 2019 1 次提交
  19. 04 7月, 2019 1 次提交
  20. 03 7月, 2019 2 次提交
    • A
      Support jemalloc compiled with `--with-jemalloc-prefix` (#5521) · 0d57d93a
      Andrew Kryczka 提交于
      Summary:
      Previously, if the jemalloc was built with nonempty string for
      `--with-jemalloc-prefix`, then `HasJemalloc()` would return false on
      Linux, so jemalloc would not be used at runtime. On Mac, it would cause
      a linker failure due to no definitions found for the weak functions
      declared in "port/jemalloc_helper.h". This should be a rare problem
      because (1) on Linux the default `--with-jemalloc-prefix` value is the
      empty string, and (2) Homebrew's build explicitly sets
      `--with-jemalloc-prefix` to the empty string.
      
      However, there are cases where `--with-jemalloc-prefix` is nonempty.
      For example, when building jemalloc from source on Mac, the default
      setting is `--with-jemalloc-prefix=je_`. Such jemalloc builds should be
      usable by RocksDB.
      
      The fix is simple. Defining `JEMALLOC_MANGLE` before including
      "jemalloc.h" causes it to define unprefixed symbols that are aliases for
      each of the prefixed symbols. Thanks to benesch for figuring this out
      and explaining it to me.
      
      Fixes https://github.com/facebook/rocksdb/issues/1462.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5521
      
      Test Plan:
      build jemalloc with prefixed symbols:
      
      ```
      $ ./configure --with-jemalloc-prefix=lol
      $ make
      ```
      
      compile rocksdb against it:
      
      ```
      $ WITH_JEMALLOC_FLAG=1 JEMALLOC=1 EXTRA_LDFLAGS="-L/home/andrew/jemalloc/lib/" EXTRA_CXXFLAGS="-I/home/andrew/jemalloc/include/" make -j12 ./db_bench
      ```
      
      run db_bench and verify jemalloc actually used:
      
      ```
      $ ./db_bench -benchmarks=fillrandom -statistics=true -dump_malloc_stats=true -stats_dump_period_sec=1
      $ grep jemalloc /tmp/rocksdbtest-1000/dbbench/LOG
      2019/06/29-12:20:52.088658 7fc5fb7f6700 [_impl/db_impl.cc:837] ___ Begin jemalloc statistics ___
      ...
      ```
      
      Differential Revision: D16092758
      
      fbshipit-source-id: c2c358346190ed62ceb2a3547a6c4c180b12f7c4
      0d57d93a
    • Y
      Reduce iterator key comparison for upper/lower bound check (2nd attempt) (#5468) · 662ce620
      Yi Wu 提交于
      Summary:
      This is a second attempt for https://github.com/facebook/rocksdb/issues/5111, with the fix to redo iterate bounds check after `SeekXXX()`. This is because MyRocks may change iterate bounds between seek.
      
      See https://github.com/facebook/rocksdb/issues/5111 for original benchmark result and discussion.
      
      Closes https://github.com/facebook/rocksdb/issues/5463.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5468
      
      Test Plan: Existing rocksdb tests, plus myrocks test `rocksdb.optimizer_loose_index_scans` and `rocksdb.group_min_max`.
      
      Differential Revision: D15863332
      
      fbshipit-source-id: ab4aba5899838591806b8673899bd465f3f53e18
      662ce620
  21. 02 7月, 2019 3 次提交
    • H
      Remove multiple declarations o kMicrosInSecond. · 66464d1f
      haoyuhuang 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5526
      
      Test Plan:
      OPT=-g V=1 make J=1 unity_test -j32
      make clean && make -j32
      
      Differential Revision: D16079315
      
      Pulled By: HaoyuHuang
      
      fbshipit-source-id: 294ab439cf0db8dd5da44e30eabf0cbb2bb8c4f6
      66464d1f
    • Y
      Ref and unref cfd before and after calling WaitForFlushMemTables (#5513) · 1e87f2b6
      Yanqin Jin 提交于
      Summary:
      This is to prevent bg flush thread from unrefing and deleting the cfd that has been dropped by a concurrent thread.
      Before RocksDB calls `DBImpl::WaitForFlushMemTables`, we should increase the refcount of each `ColumnFamilyData` so that its ref count will not drop to 0 even if the column family is dropped by another thread. Otherwise the bg flush thread can deref the cfd and deletes it, causing a segfault in `WaitForFlushMemtables` upon accessing `cfd`.
      
      Test plan (on devserver):
      ```
      $make clean && COMPILE_WITH_ASAN=1 make -j32
      $make check
      ```
      All unit tests must pass.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5513
      
      Differential Revision: D16062898
      
      Pulled By: riversand963
      
      fbshipit-source-id: 37dc511f1dc99f036d0201bbd7f0a8f5677c763d
      1e87f2b6
    • Z
      force flushing stats CF to avoid holding old logs (#5509) · 3886dddc
      Zhongyi Xie 提交于
      Summary:
      WAL records RocksDB writes to all column families. When user flushes a a column family, the old WAL will not accept new writes but cannot be deleted yet because it may still contain live data for other column families. (See https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log#life-cycle-of-a-wal for detailed explanation)
      Because of this, if there is a column family that receive very infrequent writes and no manual flush is called for it, it could prevent a lot of WALs from being deleted. PR https://github.com/facebook/rocksdb/pull/5046 introduced persistent stats column family which is a good example of such column families. Depending on the config, it may have long intervals between writes, and user is unaware of it which makes it difficult to call manual flush for it.
      This PR addresses the problem for persistent stats column family by forcing a flush for persistent stats column family when 1) another column family is flushed 2) persistent stats column family's log number is the smallest among all column families, this way persistent stats column family will  keep advancing its log number when necessary, allowing RocksDB to delete old WAL files.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5509
      
      Differential Revision: D16045896
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 286837b633e988417f0096ff38384742d3b40ef4
      3886dddc
  22. 01 7月, 2019 1 次提交
    • A
      MultiGet parallel IO (#5464) · 7259e28d
      anand76 提交于
      Summary:
      Enhancement to MultiGet batching to read data blocks required for keys in a batch in parallel from disk. It uses Env::MultiRead() API to read multiple blocks and reduce latency.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5464
      
      Test Plan:
      1. make check
      2. make asan_check
      3. make asan_crash
      
      Differential Revision: D15911771
      
      Pulled By: anand1976
      
      fbshipit-source-id: 605036b9af0f90ca0020dc87c3a86b4da6e83394
      7259e28d
  23. 27 6月, 2019 1 次提交
  24. 25 6月, 2019 1 次提交
    • M
      Add an option to put first key of each sst block in the index (#5289) · b4d72094
      Mike Kolupaev 提交于
      Summary:
      The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
      
      Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
      
      So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
      
      Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
      
      This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
      
      Differential Revision: D15256423
      
      Pulled By: al13n321
      
      fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
      b4d72094
  25. 22 6月, 2019 1 次提交
  26. 21 6月, 2019 2 次提交
    • H
      Add more callers for table reader. (#5454) · 705b8eec
      haoyuhuang 提交于
      Summary:
      This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block.
      1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name.
      2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable.
      
      This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454
      
      Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32.
      
      Differential Revision: D15819451
      
      Pulled By: HaoyuHuang
      
      fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d
      705b8eec
    • Z
      sanitize and limit block_size under 4GB (#5492) · 24f73436
      Zhongyi Xie 提交于
      Summary:
      `Block::restart_index_`, `Block::restarts_`, and `Block::current_` are defined as uint32_t but  `BlockBasedTableOptions::block_size` is defined as a size_t so user might see corruption as in https://github.com/facebook/rocksdb/issues/5486.
      This PR adds a check in `BlockBasedTableFactory::SanitizeOptions` to disallow such configurations.
      yiwu-arbug
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5492
      
      Differential Revision: D15914047
      
      Pulled By: miasantreble
      
      fbshipit-source-id: c943f153d967e15aee7f2795730ab8259e2be201
      24f73436