1. 02 3月, 2022 2 次提交
  2. 01 3月, 2022 3 次提交
    • A
      Improve build detect for RISCV (#9366) · 7d7e88c7
      Adam Retter 提交于
      Summary:
      Related to: https://github.com/facebook/rocksdb/pull/9215
      
      * Adds build_detect_platform support for RISCV on Linux (at least on SiFive Unmatched platforms)
      
      This still leaves some linking issues on RISCV remaining (e.g. when building `db_test`):
      ```
      /usr/bin/ld: ./librocksdb_debug.a(memtable.o): in function `__gnu_cxx::new_allocator<char>::deallocate(char*, unsigned long)':
      /usr/include/c++/10/ext/new_allocator.h:133: undefined reference to `__atomic_compare_exchange_1'
      /usr/bin/ld: ./librocksdb_debug.a(memtable.o): in function `std::__atomic_base<bool>::compare_exchange_weak(bool&, bool, std::memory_order, std::memory_order)':
      /usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
      /usr/bin/ld: /usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
      /usr/bin/ld: /usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
      /usr/bin/ld: /usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
      /usr/bin/ld: ./librocksdb_debug.a(memtable.o):/usr/include/c++/10/bits/atomic_base.h:464: more undefined references to `__atomic_compare_exchange_1' follow
      /usr/bin/ld: ./librocksdb_debug.a(db_impl.o): in function `rocksdb::DBImpl::NewIteratorImpl(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyData*, unsigned long, rocksdb::ReadCallback*, bool, bool)':
      /home/adamretter/rocksdb/db/db_impl/db_impl.cc:3019: undefined reference to `__atomic_exchange_1'
      /usr/bin/ld: ./librocksdb_debug.a(write_thread.o): in function `rocksdb::WriteThread::Writer::CreateMutex()':
      /home/adamretter/rocksdb/./db/write_thread.h:205: undefined reference to `__atomic_compare_exchange_1'
      /usr/bin/ld: ./librocksdb_debug.a(write_thread.o): in function `rocksdb::WriteThread::SetState(rocksdb::WriteThread::Writer*, unsigned char)':
      /home/adamretter/rocksdb/db/write_thread.cc:222: undefined reference to `__atomic_compare_exchange_1'
      collect2: error: ld returned 1 exit status
      make: *** [Makefile:1449: db_test] Error 1
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9366
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D34377664
      
      Pulled By: mrambacher
      
      fbshipit-source-id: c86f9d0cd1cb0c18de72b06f1bf5847f23f51118
      7d7e88c7
    • A
      Handle failures in block-based table size/offset approximation (#9615) · 0a89cea5
      Andrew Kryczka 提交于
      Summary:
      In crash test with fault injection, we were seeing stack traces like the following:
      
      ```
      https://github.com/facebook/rocksdb/issues/3 0x00007f75f763c533 in __GI___assert_fail (assertion=assertion@entry=0x1c5b2a0 "end_offset >= start_offset", file=file@entry=0x1c580a0 "table/block_based/block_based_table_reader.cc", line=line@entry=3245,
      function=function@entry=0x1c60e60 "virtual uint64_t rocksdb::BlockBasedTable::ApproximateSize(const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::TableReaderCaller)") at assert.c:101
      https://github.com/facebook/rocksdb/issues/4 0x00000000010ea9b4 in rocksdb::BlockBasedTable::ApproximateSize (this=<optimized out>, start=..., end=..., caller=<optimized out>) at table/block_based/block_based_table_reader.cc:3224
      https://github.com/facebook/rocksdb/issues/5 0x0000000000be61fb in rocksdb::TableCache::ApproximateSize (this=0x60f0000161b0, start=..., end=..., fd=..., caller=caller@entry=rocksdb::kCompaction, internal_comparator=..., prefix_extractor=...) at db/table_cache.cc:719
      https://github.com/facebook/rocksdb/issues/6 0x0000000000c3eaec in rocksdb::VersionSet::ApproximateSize (this=<optimized out>, v=<optimized out>, f=..., start=..., end=..., caller=<optimized out>) at ./db/version_set.h:850
      https://github.com/facebook/rocksdb/issues/7 0x0000000000c6ebc3 in rocksdb::VersionSet::ApproximateSize (this=<optimized out>, options=..., v=v@entry=0x621000047500, start=..., end=..., start_level=start_level@entry=0, end_level=<optimized out>, caller=<optimized out>)
      at db/version_set.cc:5657
      https://github.com/facebook/rocksdb/issues/8 0x000000000166e894 in rocksdb::CompactionJob::GenSubcompactionBoundaries (this=<optimized out>) at ./include/rocksdb/options.h:1869
      https://github.com/facebook/rocksdb/issues/9 0x000000000168c526 in rocksdb::CompactionJob::Prepare (this=this@entry=0x7f75f3ffcf00) at db/compaction/compaction_job.cc:546
      ```
      
      The problem occurred in `ApproximateSize()` when the index `Seek()` for the first `ApproximateDataOffsetOf()` encountered an I/O error, while the second `Seek()` did not. In the old code that scenario caused `start_offset == data_size` , thus it was easy to trip the assertion that `end_offset >= start_offset`.
      
      The fix is to set `start_offset == 0` when the first index `Seek()` fails, and `end_offset == data_size` when the second index `Seek()` fails. I doubt these give an "on average correct" answer for how this function is used, but I/O errors in index seeks are hopefully rare, it looked consistent with what was already there, and it was easier to calculate.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9615
      
      Test Plan:
      run the repro command for a while and stopped seeing coredumps -
      
      ```
      $ while !  ./db_stress --block_size=128 --cache_size=32768 --clear_column_family_one_in=0 --column_families=1 --continuous_verification_interval=0 --db=/dev/shm/rocksdb_crashtest --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --index_type=2 --iterpercent=10  --kill_random_test=18887 --max_key=1000000 --max_bytes_for_level_base=2048576 --nooverwritepercent=1 --open_files=-1 --open_read_fault_one_in=32 --ops_per_thread=1000000 --prefixpercent=5 --read_fault_one_in=0 --readpercent=45 --reopen=0 --skip_verifydb=1 --subcompactions=2 --target_file_size_base=524288 --test_batches_snapshots=0 --value_size_mult=32 --write_buffer_size=524288 --writepercent=35  ; do : ; done
      ```
      
      Reviewed By: pdillinger
      
      Differential Revision: D34383069
      
      Pulled By: ajkr
      
      fbshipit-source-id: fac26c3b20ea962e75387515ba5f2724dc48719f
      0a89cea5
    • S
      Fix trivial Javadoc omissions (#9534) · ddb7620a
      stefan-zobel 提交于
      Summary:
      - fix spelling of `valueSizeSofLimit` and add "param" description in ReadOptions
      - add 3 missing "return" in RocksDB
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9534
      
      Reviewed By: riversand963
      
      Differential Revision: D34131186
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 7eb7ec177906052837180b291d67fb1c838c49e1
      ddb7620a
  3. 28 2月, 2022 1 次提交
    • A
      Dedicate cacheline for DB mutex (#9637) · 9983eecd
      Andrew Kryczka 提交于
      Summary:
      We found a case of cacheline bouncing due to writers locking/unlocking `mutex_` and readers accessing `block_cache_tracer_`. We discovered it only after the issue was fixed by https://github.com/facebook/rocksdb/issues/9462 shifting the `DBImpl` members such that `mutex_` and `block_cache_tracer_` were naturally placed in separate cachelines in our regression testing setup. This PR forces the cacheline alignment of `mutex_` so we don't accidentally reintroduce the problem.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9637
      
      Reviewed By: riversand963
      
      Differential Revision: D34502233
      
      Pulled By: ajkr
      
      fbshipit-source-id: 46aa313b7fe83e80c3de254e332b6fb242434c07
      9983eecd
  4. 26 2月, 2022 2 次提交
  5. 24 2月, 2022 2 次提交
    • S
      Streaming Compression API for WAL compression. (#9619) · 21345d28
      Siddhartha Roychowdhury 提交于
      Summary:
      Implement a streaming compression API (compress/uncompress) to use for WAL compression. The log_writer would use the compress class/API to compress a record before writing it out in chunks. The log_reader would use the uncompress class/API to uncompress the chunks and combine into a single record.
      
      Added unit test to verify the API for different sizes/compression types.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9619
      
      Test Plan: make -j24 check
      
      Reviewed By: anand1976
      
      Differential Revision: D34437346
      
      Pulled By: sidroyc
      
      fbshipit-source-id: b180569ad2ddcf3106380f8758b556cc0ad18382
      21345d28
    • B
      Add a secondary cache implementation based on LRUCache 1 (#9518) · f706a9c1
      Bo Wang 提交于
      Summary:
      **Summary:**
      RocksDB uses a block cache to reduce IO and make queries more efficient. The block cache is based on the LRU algorithm (LRUCache) and keeps objects containing uncompressed data, such as Block, ParsedFullFilterBlock etc. It allows the user to configure a second level cache (rocksdb::SecondaryCache) to extend the primary block cache by holding items evicted from it. Some of the major RocksDB users, like MyRocks, use direct IO and would like to use a primary block cache for uncompressed data and a secondary cache for compressed data. The latter allows us to mitigate the loss of the Linux page cache due to direct IO.
      
      This PR includes a concrete implementation of rocksdb::SecondaryCache that integrates with compression libraries such as LZ4 and implements an LRU cache to hold compressed blocks.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9518
      
      Test Plan:
      In this PR, the lru_secondary_cache_test.cc includes the following tests:
      1. The unit tests for the secondary cache with either compression or no compression, such as basic tests, fails tests.
      2. The integration tests with both primary cache and this secondary cache .
      
      **Follow Up:**
      
      1. Statistics (e.g. compression ratio) will be added in another PR.
      2. Once this implementation is ready, I will do some shadow testing and benchmarking with UDB to measure the impact.
      
      Reviewed By: anand1976
      
      Differential Revision: D34430930
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 218d78b672a2f914856d8a90ff32f2f5b5043ded
      f706a9c1
  6. 23 2月, 2022 5 次提交
    • Y
      Support WBWI for keys having timestamps (#9603) · 6f125998
      Yanqin Jin 提交于
      Summary:
      This PR supports inserting keys to a `WriteBatchWithIndex` for column families that enable user-defined timestamps
      and reading the keys back. **The index does not have timestamps.**
      
      Writing a key to WBWI is unchanged, because the underlying WriteBatch already supports it.
      When reading the keys back, we need to make sure to distinguish between keys with and without timestamps before
      comparison.
      
      When user calls `GetFromBatchAndDB()`, no timestamp is needed to query the batch, but a timestamp has to be
      provided to query the db. The assumption is that data in the batch must be newer than data from the db.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9603
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D34354849
      
      Pulled By: riversand963
      
      fbshipit-source-id: d25d1f84e2240ce543e521fa30595082fb8db9a0
      6f125998
    • A
      Fix test race conditions with OnFlushCompleted() (#9617) · 8ca433f9
      Andrew Kryczka 提交于
      Summary:
      We often see flaky tests due to `DB::Flush()` or `DBImpl::TEST_WaitForFlushMemTable()` not waiting until event listeners complete. For example, https://github.com/facebook/rocksdb/issues/9084, https://github.com/facebook/rocksdb/issues/9400, https://github.com/facebook/rocksdb/issues/9528, plus two new ones this week: "EventListenerTest.OnSingleDBFlushTest" and "DBFlushTest.FireOnFlushCompletedAfterCommittedResult". I ran a `make check` with the below race condition-coercing patch and fixed  issues it found besides old BlobDB.
      
      ```
       diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
      index 0e1864788..aaba68c4a 100644
       --- a/db/db_impl/db_impl_compaction_flush.cc
      +++ b/db/db_impl/db_impl_compaction_flush.cc
      @@ -861,6 +861,8 @@ void DBImpl::NotifyOnFlushCompleted(
              mutable_cf_options.level0_stop_writes_trigger);
         // release lock while notifying events
         mutex_.Unlock();
      +  bg_cv_.SignalAll();
      +  sleep(1);
         {
           for (auto& info : *flush_jobs_info) {
             info->triggered_writes_slowdown = triggered_writes_slowdown;
      ```
      
      The reason I did not fix old BlobDB issues is because it appears to have a fundamental (non-test) issue. In particular, it uses an EventListener to keep track of the files. OnFlushCompleted() could be delayed until even after a compaction involving that flushed file completes, causing the compaction to unexpectedly delete an untracked file.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9617
      
      Test Plan: `make check` including the race condition coercing patch
      
      Reviewed By: hx235
      
      Differential Revision: D34384022
      
      Pulled By: ajkr
      
      fbshipit-source-id: 2652ded39b415277c5d6a628414345223930514e
      8ca433f9
    • A
      Enable core dumps in TSAN/UBSAN crash tests (#9616) · 96978e4d
      Andrew Kryczka 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9616
      
      Reviewed By: hx235
      
      Differential Revision: D34383489
      
      Pulled By: ajkr
      
      fbshipit-source-id: e4299000ef38073ec57e6ab5836150fdf8ce43d4
      96978e4d
    • A
      Combine data members of IOStatus with Status (#9549) · d795a730
      anand76 提交于
      Summary:
      Combine the data members retryable_, data_loss_ and scope_ of IOStatus
      with Status, as protected members. IOStatus is now defined as a derived class of Status with
      no new data, but additional methods. This will allow us to eventually
      track the result of FileSystem calls in RocksDB with one variable
      instead of two.
      
      Benchmark commands and results are below. The performance after changes seems slightly better.
      
      ```./db_bench -db=/data/mysql/rocksdb/prefix_scan -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216```
      
      ```./db_bench -use_existing_db=true --db=/data/mysql/rocksdb/prefix_scan -benchmarks="readseq,seekrandom,readseq" -key_size=32 -value_size=512 -num=5000000 -seek_nexts=10000 -use_direct_reads=true -duration=60 -ops_between_duration_checks=1 -readonly=true -adaptive_readahead=false -threads=1 -cache_size=10485760000```
      
      Before -
      seekrandom   :    3715.432 micros/op 269 ops/sec; 1394.9 MB/s (16149 of 16149 found)
      seekrandom   :    3687.177 micros/op 271 ops/sec; 1405.6 MB/s (16273 of 16273 found)
      seekrandom   :    3709.646 micros/op 269 ops/sec; 1397.1 MB/s (16175 of 16175 found)
      
      readseq      :       0.369 micros/op 2711321 ops/sec; 1406.6 MB/s
      readseq      :       0.363 micros/op 2754092 ops/sec; 1428.8 MB/s
      readseq      :       0.372 micros/op 2688046 ops/sec; 1394.6 MB/s
      
      After -
      seekrandom   :    3606.830 micros/op 277 ops/sec; 1436.9 MB/s (16636 of 16636 found)
      seekrandom   :    3594.467 micros/op 278 ops/sec; 1441.9 MB/s (16693 of 16693 found)
      seekrandom   :    3597.919 micros/op 277 ops/sec; 1440.5 MB/s (16677 of 16677 found)
      
      readseq      :       0.354 micros/op 2822809 ops/sec; 1464.5 MB/s
      readseq      :       0.358 micros/op 2795080 ops/sec; 1450.1 MB/s
      readseq      :       0.354 micros/op 2822889 ops/sec; 1464.5 MB/s
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9549
      
      Reviewed By: pdillinger
      
      Differential Revision: D34310362
      
      Pulled By: anand1976
      
      fbshipit-source-id: 54b27756edf9c9ecfe730a2dce542a7a46743096
      d795a730
    • P
      configure microbenchmarks, regenerate targets (#9599) · ba65cfff
      Patrick Somaru 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9599
      
      Reviewed By: jay-zhuang, hodgesds
      
      Differential Revision: D34214408
      
      fbshipit-source-id: 6932200772f52ce77e550646ee3d1a928295844a
      ba65cfff
  7. 22 2月, 2022 1 次提交
    • A
      Fix DBTest2.BackupFileTemperature memory leak (#9610) · 3379d146
      Andrew Kryczka 提交于
      Summary:
      Valgrind was failing with the below error because we forgot to destroy
      the `BackupEngine` object:
      
      ```
      ==421173== Command: ./db_test2 --gtest_filter=DBTest2.BackupFileTemperature
      ==421173==
      Note: Google Test filter = DBTest2.BackupFileTemperature
      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from DBTest2
      [ RUN      ] DBTest2.BackupFileTemperature
      --421173-- WARNING: unhandled amd64-linux syscall: 425
      --421173-- You may be able to write your own handler.
      --421173-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
      --421173-- Nevertheless we consider this a bug.  Please report
      --421173-- it at http://valgrind.org/support/bug_reports.html.
      [       OK ] DBTest2.BackupFileTemperature (3366 ms)
      [----------] 1 test from DBTest2 (3371 ms total)
      
      [----------] Global test environment tear-down
      [==========] 1 test from 1 test case ran. (3413 ms total)
      [  PASSED  ] 1 test.
      ==421173==
      ==421173== HEAP SUMMARY:
      ==421173==     in use at exit: 13,042 bytes in 195 blocks
      ==421173==   total heap usage: 26,022 allocs, 25,827 frees, 27,555,265 bytes allocated
      ==421173==
      ==421173== 8 bytes in 1 blocks are possibly lost in loss record 6 of 167
      ==421173==    at 0x4838DBF: operator new(unsigned long) (vg_replace_malloc.c:344)
      ==421173==    by 0x8D4606: allocate (new_allocator.h:114)
      ==421173==    by 0x8D4606: allocate (alloc_traits.h:445)
      ==421173==    by 0x8D4606: _M_allocate (stl_vector.h:343)
      ==421173==    by 0x8D4606: reserve (vector.tcc:78)
      ==421173==    by 0x8D4606: rocksdb::BackupEngineImpl::Initialize() (backupable_db.cc:1174)
      ==421173==    by 0x8D5473: Initialize (backupable_db.cc:918)
      ==421173==    by 0x8D5473: rocksdb::BackupEngine::Open(rocksdb::BackupEngineOptions const&, rocksdb::Env*, rocksdb::BackupEngine**) (backupable_db.cc:937)
      ==421173==    by 0x50AC8F: Open (backup_engine.h:585)
      ==421173==    by 0x50AC8F: rocksdb::DBTest2_BackupFileTemperature_Test::TestBody() (db_test2.cc:6996)
      ...
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9610
      
      Test Plan:
      ```
      $ make -j24 ROCKSDBTESTS_SUBSET=db_test2 valgrind_check_some
      ```
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D34371210
      
      Pulled By: ajkr
      
      fbshipit-source-id: 68154fcb0c51b28222efa23fa4ee02df8d925a18
      3379d146
  8. 21 2月, 2022 1 次提交
  9. 19 2月, 2022 9 次提交
    • A
      Change enum SizeApproximationFlags to enum class (#9604) · 3699b171
      Akanksha Mahajan 提交于
      Summary:
      Change enum SizeApproximationFlags to enum and class and add
      overloaded operators for the transition between enum class and uint8_t
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9604
      
      Test Plan: Circle CI jobs
      
      Reviewed By: riversand963
      
      Differential Revision: D34360281
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 6351dfdb717ae3c4530d324c3d37a8ecb01dd1ef
      3699b171
    • J
      Add Temperature info in `NewSequentialFile()` (#9499) · d3a2f284
      Jay Zhuang 提交于
      Summary:
      Add Temperature hints information from RocksDB in API
      `NewSequentialFile()`. backup and checkpoint operations need to open the
      source files with `NewSequentialFile()`, which will have the temperature
      hints. Other operations are not covered.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9499
      
      Test Plan: Added unittest
      
      Reviewed By: pdillinger
      
      Differential Revision: D34006115
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 568b34602b76520e53128672bd07e9d886786a2f
      d3a2f284
    • A
      Add Async Read and Poll APIs in FileSystem (#9564) · 559525dc
      Akanksha Mahajan 提交于
      Summary:
      This PR adds support for new APIs Async Read that reads the data
      asynchronously and Poll API that checks if requested read request has
      completed or not.
      
      Usage: In RocksDB, we are currently planning to prefetch data
      asynchronously during sequential scanning and RocksDB will call these
      APIs to prefetch more data in advanced.
      
      Design:
      - ReadAsync API submits the read request to underlying FileSystem in
      order to read data asynchronously. When read request is completed,
      callback function will be called. cb_arg is used by RocksDB to track the
      original request submitted and IOHandle is used by FileSystem to keep track
      of IO requests at their level.
      
      - The Poll API  is added in FileSystem because the call could end up handling
      completions for multiple different files which is not specific to a
      FSRandomAccessFile instance. There could be multiple outstanding file reads
      from different files in future and they can complete in any order.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9564
      
      Test Plan: Test will be added in separate PR.
      
      Reviewed By: anand1976
      
      Differential Revision: D34226216
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 95e64edafb17f543f7232421d51e2665a3267f69
      559525dc
    • B
      Fixes #9565 (#9586) · 67f071fa
      Bo Wang 提交于
      Summary:
      [Compaction::IsTrivialMove](https://github.com/facebook/rocksdb/blob/a2b9be42b6d5ac4d44bcc6a9451a825440000769/db/compaction/compaction.cc#L318) checks whether allow_trivial_move is set, and if so it returns the value of is_trivial_move_. The allow_trivial_move option is there for universal compaction. So when this is set and leveled compaction is enabled, then useful code that follows this block never gets a chance to run.
      
      A check that [compaction_style == kCompactionStyleUniversal](https://github.com/facebook/rocksdb/blob/320d9a8e8a1b6998f92934f87fc71ad8bd6d4596/db/db_impl/db_impl_compaction_flush.cc#L1030) should be added to avoid doing the wrong thing for leveled.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9586
      
      Test Plan:
      To reproduce this:
      First edit db/compaction/compaction.cc with
      ```
       diff --git a/db/compaction/compaction.cc b/db/compaction/compaction.cc
      index 7ae50b91e..52dd489b1 100644
       --- a/db/compaction/compaction.cc
      +++ b/db/compaction/compaction.cc
      @@ -319,6 +319,8 @@ bool Compaction::IsTrivialMove() const {
         // input files are non overlapping
         if ((mutable_cf_options_.compaction_options_universal.allow_trivial_move) &&
             (output_level_ != 0)) {
      +    printf("IsTrivialMove:: return %d because universal allow_trivial_move\n", (int) is_trivial_move_);
      +    // abort();
           return is_trivial_move_;
         }
      ```
      
      And then run
      ```
      ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1641328309 --universal_allow_trivial_move=1
      ```
      Example output with the debug code added
      
      ```
      IsTrivialMove:: return 0 because universal allow_trivial_move
      IsTrivialMove:: return 0 because universal allow_trivial_move
      ```
      
      After this PR, the bug is fixed.
      
      Reviewed By: ajkr
      
      Differential Revision: D34350451
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 3232005cc47c40a7e75d316cfc7960beb5bdff3a
      67f071fa
    • P
      fix issue with buckifier update (#9602) · 736bc832
      pat somaru 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9602
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D34350406
      
      Pulled By: likewhatevs
      
      fbshipit-source-id: caa81f272a429fbf7293f0588ea24cc53b29ee98
      736bc832
    • J
      Add last level and non-last level read statistics (#9519) · f4b2500e
      Jay Zhuang 提交于
      Summary:
      Add last level and non-last level read statistics:
      ```
      LAST_LEVEL_READ_BYTES,
      LAST_LEVEL_READ_COUNT,
      NON_LAST_LEVEL_READ_BYTES,
      NON_LAST_LEVEL_READ_COUNT,
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9519
      
      Test Plan: added unittest
      
      Reviewed By: siying
      
      Differential Revision: D34062539
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 908644c3050878b4234febdc72e3e19d89af38cd
      f4b2500e
    • M
      Make FilterPolicy Customizable (#9590) · 30b08878
      mrambacher 提交于
      Summary:
      Make FilterPolicy into a Customizable class.  Allow new FilterPolicy to be discovered through the ObjectRegistry
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9590
      
      Reviewed By: pdillinger
      
      Differential Revision: D34327367
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 37e7edac90ec9457422b72f359ab8ef48829c190
      30b08878
    • P
      update buckifier, add support for microbenchmarks (#9598) · f066b5ce
      Patrick Somaru 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9598
      
      Reviewed By: jay-zhuang, hodgesds
      
      Differential Revision: D34130191
      
      fbshipit-source-id: e5413f7d6af70a66940022d153b64a3383eccff1
      f066b5ce
    • J
      Add temperature information to the event listener callbacks (#9591) · 2fbc6727
      Jay Zhuang 提交于
      Summary:
      RocksDB try to provide temperature information in the event
      listener callbacks. The information is not guaranteed, as some operation
      like backup won't have these information.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9591
      
      Test Plan: Added unittest
      
      Reviewed By: siying, pdillinger
      
      Differential Revision: D34309339
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 4aca4f270f99fa49186d85d300da42594663d6d7
      2fbc6727
  10. 18 2月, 2022 12 次提交
    • A
      Change type of cache buffer passed to `Cache::CreateCallback()` to `const void*` (#9595) · 54fb2a89
      Andrew Kryczka 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9595
      
      Reviewed By: riversand963
      
      Differential Revision: D34329906
      
      Pulled By: ajkr
      
      fbshipit-source-id: 508601856fa9bee4d40f4a68d14d333ef2143d40
      54fb2a89
    • P
      Mark more OldDefaults as deprecated (#9594) · 48b9de4a
      Peter Dillinger 提交于
      Summary:
      `ColumnFamilyOptions::OldDefaults` and `DBOptions::OldDefaults`
      now deprecated. Were previously overlooked with `Options::OldDefaults` in https://github.com/facebook/rocksdb/issues/9363
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9594
      
      Test Plan: comments only
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D34318592
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 773c97a61e2a8290ae154f363dd61c1f35a9dd16
      48b9de4a
    • A
      Plugin java jni support (#9575) · ce84e502
      Alan Paxton 提交于
      Summary:
      Extend the plugin architecture to allow for the inclusion, building and testing of Java and JNI components of a plugin. This will cause the JAR built by `$ make rocksdbjava` to include the extra functionality provided by the plugin, and will cause `$ make jtest` to add the java tests provided by the plugin to the tests built and run by Java testing.
      
      The plugin's `<plugin>.mk` file can define:
      ```
      <plugin>_JNI_NATIVE_SOURCES
      <plugin>_NATIVE_JAVA_CLASSES
      <plugin>_JAVA_TESTS
      ```
      The plugin should provide java/src, java/test and java/rocksjni directories. When a plugin is required to be build it must be named in the ROCKSDB_PLUGINS environment variable (as per the plugin architecture). This now has the effect of adding the files specified by the above definitions to the appropriate parts of the build.
      
      An example of a plugin with a Java component can be found as part of the hdfs plugin in https://github.com/riversand963/rocksdb-hdfs-env - at the time of writing the Java part of this fails tests, and needs a little work to complete, but it builds correctly under the plugin model.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9575
      
      Reviewed By: hx235
      
      Differential Revision: D34253948
      
      Pulled By: riversand963
      
      fbshipit-source-id: b3dde5da06f3d3c25c54246892097ae2a369b42d
      ce84e502
    • P
      Some better API and other comments (#9533) · 561be005
      Peter Dillinger 提交于
      Summary:
      Various comments, mostly about SliceTransform + prefix extractors.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9533
      
      Test Plan: comments only
      
      Reviewed By: ajkr
      
      Differential Revision: D34094367
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 9742ce3b89ef7fd5c5e748fec862e6361ed44e95
      561be005
    • A
      Remove previously deprecated Java where RocksDB also removed it, or where no... · 8d9c203f
      Alan Paxton 提交于
      Remove previously deprecated Java where RocksDB also removed it, or where no direct equivalent existed. (#9576)
      
      Summary:
      For RocksDB v7 major release. Remove previously deprecated Java API methods and associated tests
      - where equivalent/alternative functionality exists and is already tested AND
      - where the core RocksDB function/feature has also been removed
      - OR the functionality exists only in Java so the previous deprecation only affected Java methods
      
      RETAIN deprecated Java which reflects functionality which is deprecated by, but also still supported by, the core of RocksDB.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9576
      
      Reviewed By: ajkr
      
      Differential Revision: D34314983
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 7cf9c17e3e07be9d289beb99f81b71e8e09ac403
      8d9c203f
    • P
      Hide FilterBits{Builder,Reader} from public API (#9592) · 725833a4
      Peter Dillinger 提交于
      Summary:
      We don't have any evidence of people using these to build custom
      filters. The recommended way of customizing filter handling is to
      defer to various built-in policies based on FilterBuildingContext
      (e.g. to build Monkey filtering policy). With old API, we have
      evidence of people modifying keys going into filter, but most cases
      of that can be handled with prefix_extractor.
      
      Having FilterBitsBuilder+Reader in the public API is an ogoing
      hinderance to code evolution (e.g. recent new Finish and
      MaybePostVerify), and so this change removes them from the public API
      for 7.0. Maybe they will come back in some form later, but lacking
      evidence of them providing value in the public API, we want to take back
      more freedom to evolve these.
      
      With this moved to internal-only, there is no rush to clean up the
      complex Finish signatures, or add memory allocator support, but doing so
      is much easier with them out of public API, for example to use
      CacheAllocationPtr without exposing it in the public API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9592
      
      Test Plan: cosmetic changes only
      
      Reviewed By: hx235
      
      Differential Revision: D34315470
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 03e03bb66a72c73df2c464d2dbbbae906dd8f99b
      725833a4
    • A
      Fix some MultiGet batching stats (#9583) · 627deb7c
      anand76 提交于
      Summary:
      The NUM_INDEX_AND_FILTER_BLOCKS_READ_PER_LEVEL, NUM_DATA_BLOCKS_READ_PER_LEVEL, and NUM_SST_READ_PER_LEVEL stats were being recorded only when the last file in a level happened to have hits. They are supposed to be updated for every level. Also, there was some overcounting of GetContextStats. This PR fixes both the problems.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9583
      
      Test Plan: Update the unit test in db_basic_test
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D34308044
      
      Pulled By: anand1976
      
      fbshipit-source-id: b3b36020fda26ba91bc6e0e47d52d58f4d7f656e
      627deb7c
    • S
      Add record to set WAL compression type if enabled (#9556) · 39b0d921
      Siddhartha Roychowdhury 提交于
      Summary:
      When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs.
      Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556
      
      Reviewed By: anand1976
      
      Differential Revision: D34308216
      
      Pulled By: sidroyc
      
      fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705
      39b0d921
    • J
      Add subcompaction event API (#9311) · f092f0fa
      Jay Zhuang 提交于
      Summary:
      Add event callback for subcompaction and adds a sub_job_id to identify it.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9311
      
      Reviewed By: ajkr
      
      Differential Revision: D33892707
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 57b5e5e594d61b2112d480c18a79a36751f65a4e
      f092f0fa
    • P
      Clarify compiler support release note (#9593) · a86ee02d
      Peter Dillinger 提交于
      Summary:
      in HISTORY.md
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9593
      
      Test Plan: release note only
      
      Reviewed By: siying
      
      Differential Revision: D34318189
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ba2eca8bede2d42a3fefd10b954b92cb54f831f2
      a86ee02d
    • A
      Update build files for java8 build (#9541) · 36ce2e2a
      Alan Paxton 提交于
      Summary:
      For RocksJava 7 we will move from requiring Java 7 to Java 8.
      
      * This simplifies the `Makefile` as we no longer need to deal with Java 7; so we no longer use `javah`.
      * Added a java-version target which is invoked by the java target, and which exits if the version of java being used is not 8 or greater.
      * Enforces java 8 as a minimum.
      * Fixed CMake build.
      
      * Fixed broken java event listener test, as the test was broken and the assertions in the callbacks were not causing assertions in the tests. The callbacks now queue up assertion errors for the main thread of the tests to check.
      * Fixed C++ dangling pointers in the test code.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9541
      
      Reviewed By: pdillinger
      
      Differential Revision: D34214929
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: fdff348758d0a23a742e83c87d5f54073ce16ca6
      36ce2e2a
    • A
      Support C++17 Docker build environments for RocksJava (#9500) · 5e644079
      Adam Retter 提交于
      Summary:
      See https://github.com/facebook/rocksdb/issues/9388#issuecomment-1029583789
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9500
      
      Reviewed By: pdillinger
      
      Differential Revision: D34114687
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 22129d99ccd0dba7e8f1b263ddc5520d939641bf
      5e644079
  11. 17 2月, 2022 2 次提交
    • A
      Add rate limiter priority to ReadOptions (#9424) · babe56dd
      Andrew Kryczka 提交于
      Summary:
      Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.
      
      `RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.
      
      There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).
      
      The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424
      
      Test Plan:
      - new unit tests
      - new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
        - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
        - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
      - crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`
      
      Reviewed By: hx235
      
      Differential Revision: D33747386
      
      Pulled By: ajkr
      
      fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
      babe56dd
    • Y
      Fix a silent data loss for write-committed txn (#9571) · 1cda273d
      Yanqin Jin 提交于
      Summary:
      The following sequence of events can cause silent data loss for write-committed
      transactions.
      ```
      Time    thread 1                                       bg flush
       |   db->Put("a")
       |   txn = NewTxn()
       |   txn->Put("b", "v")
       |   txn->Prepare()       // writes only to 5.log
       |   db->SwitchMemtable() // memtable 1 has "a"
       |                        // close 5.log,
       |                        // creates 8.log
       |   trigger flush
       |                                                  pick memtable 1
       |                                                  unlock db mutex
       |                                                  write new sst
       |   txn->ctwb->Put("gtid", "1") // writes 8.log
       |   txn->Commit() // writes to 8.log
       |                 // writes to memtable 2
       |                                               compute min_log_number_to_keep_2pc, this
       |                                               will be 8 (incorrect).
       |
       |                                             Purge obsolete wals, including 5.log
       |
       V
      ```
      
      At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
      memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
      only the commit marker of txn is in 8.log
      
      The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
      In the above example, when bg flush thread tries to find obsolete wals, it uses the information
      computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
      depends on three components
      - `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
      - `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
      - `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.
      
      The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
      that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
      only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
      We should return this information via `FindMinPrepLogReferencedByMemTable()`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571
      
      Test Plan:
      ```
      ./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
      make check
      ```
      
      Reviewed By: siying
      
      Differential Revision: D34235236
      
      Pulled By: riversand963
      
      fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1
      1cda273d