1. 01 2月, 2019 2 次提交
    • Y
      Use correct FileMeta for atomic flush result install (#4932) · 842cdc11
      Yanqin Jin 提交于
      Summary:
      1. this commit fixes our handling of a combination of two separate edge
      cases. If a flush job does not pick any memtable to flush (because another
      flush job has already picked the same memtables), and the column family
      assigned to the flush job is dropped right before RocksDB calls
      rocksdb::InstallMemtableAtomicFlushResults, our original code passes
      a FileMetaData object whose file number is 0, failing the assertion in
      rocksdb::InstallMemtableAtomicFlushResults (assert(m->GetFileNumber() > 0)).
      2. Also piggyback a small change: since we already create a local copy of column family's mutable CF options to eliminate potential race condition with `SetOptions` call, we might as well use the local copy in other function calls in the same scope.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4932
      
      Differential Revision: D13901322
      
      Pulled By: riversand963
      
      fbshipit-source-id: b936580af7c127ea0c6c19ea10cd5fcede9fb0f9
      842cdc11
    • M
      Take snapshots once for all cf flushes (#4934) · 35e5689e
      Maysam Yabandeh 提交于
      Summary:
      FlushMemTablesToOutputFiles calls FlushMemTableToOutputFile for each column family. The patch moves the take-snapshot logic to outside FlushMemTableToOutputFile so that it does it once for all the flushes. This also addresses a deadlock issue for resetting the managed snapshot of job_snapshot in the 2nd call to FlushMemTableToOutputFile.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4934
      
      Differential Revision: D13900747
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f3cd650c5fff24cf95c1aaf8a10c149d42bf042c
      35e5689e
  2. 16 1月, 2019 1 次提交
    • Y
      WritePrepared: Fix visible key compacted out by compaction (#4883) · 5d4fddfa
      Yi Wu 提交于
      Summary:
      With WritePrepared transaction, flush/compaction can contain uncommitted keys, and those keys can get committed during compaction. If a snapshot is taken before the key is committed, it should not see the key. On the other hand, compaction grab the list of snapshots at its beginning, and only consider those snapshots to dedup keys. Consider the case:
      ```
      seq = 1: put "foo" = "bar"
      seq = 2: transaction T: delete "foo", prepare
      seq = 3: compaction start
      seq = 4: take snapshot S
      seq = 5: transaction T: commit.
      ...
      seq = N: compaction iterator reached key "foo".
      ```
      When compaction start, the list of snapshot is empty. Compaction doesn't take snapshot S into account. When it reached "foo", transaction T is committed. Compaction may think the value "foo=bar" is not visible by any snapshot (which is wrong), and compact the value out.
      
      The fix is to explicitly take a snapshot before compaction grabbing the list of snapshots. Compaction will then has to keep keys visible to this snapshot.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4883
      
      Differential Revision: D13668775
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 1cab9615f94b7d3e8522cc3d44c3a14c7d4720e4
      5d4fddfa
  3. 12 1月, 2019 1 次提交
    • Y
      Make a copy of MutableCFOptions to avoid race condition (#4876) · 301da345
      Yanqin Jin 提交于
      Summary:
      If we do not do this, then reading MutableCFOptions may have a race condition
      with SetOptions which modifies MutableCFOptions.
      
      Also reserve space in advance for vectors to avoid reallocation changing the
      address of its elements.
      
      Test plan
      ```
      $make clean && make -j32 all check
      $make clean && COMPILE_WITH_TSAN=1 make -j32 all check
      $make clean && COMPILE_WITH_ASAN=1 make -j32 all check
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4876
      
      Differential Revision: D13644500
      
      Pulled By: riversand963
      
      fbshipit-source-id: 4b8112c5c819d5a2922bb61ad1521b3d2fb2fd47
      301da345
  4. 04 1月, 2019 1 次提交
    • Y
      Refactor atomic flush result installation to MANIFEST (#4791) · a07175af
      Yanqin Jin 提交于
      Summary:
      as titled.
      Since different bg flush threads can flush different sets of column families
      (due to column family creation and drop), we decide not to let one thread
      perform atomic flush result installation for other threads. Bg flush threads
      will install their atomic flush results sequentially to MANIFEST, using
      a conditional variable, i.e. atomic_flush_install_cv_ to coordinate.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4791
      
      Differential Revision: D13498930
      
      Pulled By: riversand963
      
      fbshipit-source-id: dd7482fc41f4bd22dad1e1ef7d4764ef424688d7
      a07175af
  5. 03 1月, 2019 1 次提交
  6. 19 12月, 2018 1 次提交
    • Y
      Avoid switching empty memtable in certain cases (#4792) · 671a7eb3
      Yanqin Jin 提交于
      Summary:
      in certain cases, we do not perform memtable switching if the active
      memtable of the column family is empty. Two exceptions:
      1. In manual flush, if cached_recoverable_state_empty_ is false, then we need
         to switch memtable due to requirement of transaction.
      2. In switch WAL, we need to switch memtable anyway because we have to seal the
         memtable if the WAL on which it depends will be closed.
      
      This change can potentially delay the occurence of write stalls because number
      of memtables increase more slowly.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4792
      
      Differential Revision: D13499501
      
      Pulled By: riversand963
      
      fbshipit-source-id: 91c9b17ae753578578039f3851667d93610005e1
      671a7eb3
  7. 14 12月, 2018 3 次提交
    • Y
      Improve flushing multiple column families (#4708) · 4fce44fc
      Yanqin Jin 提交于
      Summary:
      If one column family is dropped, we should simply skip it and continue to flush
      other active ones.
      Currently we use Status::ShutdownInProgress to notify caller of column families
      being dropped. In the future, we should consider using a different Status code.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4708
      
      Differential Revision: D13378954
      
      Pulled By: riversand963
      
      fbshipit-source-id: 42f248cdf2d32d4c0f677cd39012694b8f1328ca
      4fce44fc
    • D
      Get `CompactionJobInfo` from CompactFiles · 2670fe8c
      DorianZheng 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4716
      
      Differential Revision: D13207677
      
      Pulled By: ajkr
      
      fbshipit-source-id: d0ccf5a66df6cbb07288b0c5ebad81fd9df3926b
      2670fe8c
    • B
      Concurrent task limiter for compaction thread control (#4332) · a8b9891f
      Burton Li 提交于
      Summary:
      The PR is targeting to resolve the issue of:
      https://github.com/facebook/rocksdb/issues/3972#issue-330771918
      
      We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
      When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
      Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
      
      With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
      
      ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
      
      The usage is straight forward:
      e.g.:
      
      //
      // Enable compaction thread limiter thru ColumnFamilyOptions
      //
      std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
      Options options;
      ColumnFamilyOptions cf_opt(options);
      cf_opt.compaction_thread_limiter = ctl;
      ...
      
      //
      // Compaction thread limiter can be tuned or disabled on-the-fly
      //
      ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
      ...
      ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
      ctl->SetMaxOutstandingTask(-1); // Same as above
      ...
      ctl->SetMaxOutstandingTask(0);  // full throttle (0 task)
      
      //
      // Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
      //
      std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
      std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
      Options options;
      ColumnFamilyOptions cf_opt_ssd1(options);
      ColumnFamilyOptions cf_opt_ssd2(options);
      ColumnFamilyOptions cf_opt_hdd1(options);
      ColumnFamilyOptions cf_opt_hdd2(options);
      ColumnFamilyOptions cf_opt_hdd3(options);
      
      // SSD CFs
      cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
      cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
      
      // HDD CFs
      cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
      cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
      cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
      
      ...
      
      //
      // The limiter is disabled by default (or set to nullptr explicitly)
      //
      Options options;
      ColumnFamilyOptions cf_opt(options);
      cf_opt.compaction_thread_limiter = nullptr;
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
      
      Differential Revision: D13226590
      
      Pulled By: siying
      
      fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
      a8b9891f
  8. 12 12月, 2018 1 次提交
  9. 06 12月, 2018 1 次提交
  10. 30 11月, 2018 1 次提交
    • Y
      Fix a flaky test DBFlushTest.SyncFail (#4633) · 8d7bc76f
      Yanqin Jin 提交于
      Summary:
      There is a race condition in DBFlushTest.SyncFail, as illustrated below.
      ```
      time         thread1                             bg_flush_thread
        |     Flush(wait=false, cfd)
        |     refs_before=cfd->current()->TEST_refs()   PickMemtable calls cfd->current()->Ref()
        V
      ```
      The race condition between thread1 getting the ref count of cfd's current
      version and bg_flush_thread incrementing the cfd's current version makes it
      possible for later assertion on refs_before to fail. Therefore, we add test
      sync points to enforce the order and assert on the ref count before and after
      PickMemtable is called in bg_flush_thread.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4633
      
      Differential Revision: D12967131
      
      Pulled By: riversand963
      
      fbshipit-source-id: a99d2bacb7869ec5d8d03b24ef2babc0e6ae1a3b
      8d7bc76f
  11. 15 11月, 2018 1 次提交
    • Y
      Rollback memtable flush upon atomic flush fail (#4641) · 14769742
      Yanqin Jin 提交于
      Summary:
      This fixes an assertion.
      
      An atomic flush can have multiple flush jobs. Some of them may fail. If any of
      them fails, we need to rollback all of them.
      For the flush jobs that do fail, we already call `RollbackMemTableFlush` in
      `FlushJob::Run`. The tricky part is for flush jobs that have completed
      successfully. We need to call `RollbackMemTableFlush` for them as well.
      
      The newly added DBAtomicFlushTest.AtomicFlushRollbackSomeJobs will SigAbort
      without the corresponding change in AtomicFlushMemTablesToOutputFiles.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4641
      
      Differential Revision: D12943649
      
      Pulled By: riversand963
      
      fbshipit-source-id: c66a4a664a1e0938e938fd41edc5a70c34cdd868
      14769742
  12. 14 11月, 2018 1 次提交
  13. 13 11月, 2018 2 次提交
    • D
      Fix `CompactFiles` bug (#4665) · 0f88160f
      DorianZheng 提交于
      Summary:
      `CompactFiles` gets `SuperVersion` before `WaitForIngestFile`, while `IngestExternalFile` may add files that overlap with `input_file_names`
      
      The timeline of execution flow is as follow:
      
      Let's say that level N has two file [1,2] and [5,6]
      ```
      timeline              user_thread1                             user_thread2
      t0   |      CompactFiles([1, 2], [5, 6]) begin
      t1   |         GetReferencedSuperVersion()
      t2   |                                              IngestExternalFile([3,4]) to level N begin
      t3   |             CompactFiles resume
           V
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4665
      
      Differential Revision: D13030674
      
      Pulled By: ajkr
      
      fbshipit-source-id: 8be19477fd6e505032267a979d32f3097cc3be51
      0f88160f
    • Y
      Remove redundant member var and set options (#4631) · 05dec0c7
      Yanqin Jin 提交于
      Summary:
      In the past, both `DBImpl::atomic_flush_` and
      `DBImpl::immutable_db_options_.atomic_flush` exist. However, we fail to set
      `immutable_db_options_.atomic_flush`, but use `DBImpl::atomic_flush_` which is
      set correctly. This does not lead to incorrect behavior, but is a duplicate of
      information.
      
      Since `immutable_db_options_` is always there and has `atomic_flush`, we should
      use it as source of truth and remove `DBImpl::atomic_flush_`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4631
      
      Differential Revision: D12928371
      
      Pulled By: riversand963
      
      fbshipit-source-id: f85a811959d3828aad4a3a1b05f71facf19c636d
      05dec0c7
  14. 10 11月, 2018 2 次提交
    • Y
      Fix DBTest.SoftLimit flakyness (#4658) · 859dbda6
      Yi Wu 提交于
      Summary:
      The flakyness can be reproduced with the following patch:
      ```
       --- a/db/db_impl_compaction_flush.cc
      +++ b/db/db_impl_compaction_flush.cc
      @@ -2013,6 +2013,9 @@ void DBImpl::BackgroundCallFlush() {
             if (job_context.HaveSomethingToDelete()) {
               PurgeObsoleteFiles(job_context);
             }
      +      static int f_count = 0;
      +      printf("clean flush job context %d\n", ++f_count);
      +      env_->SleepForMicroseconds(1000000);
             job_context.Clean();
             mutex_.Lock();
           }
      ```
      The issue is that FlushMemtable with opt.wait=true does not wait for `OnStallConditionsChanged` being called. The event listener is triggered on `JobContext::Clean`, which happens after flush result is installed. At the time we check for stall condition after flushing memtable, the job context cleanup may not be finished.
      
      To fix the flaykyness, we use sync point to create a custom WaitForFlush that waits for context cleanup.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4658
      
      Differential Revision: D13007301
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: d98395ee7b0ad4c62e83e8d0e9b6028058c61712
      859dbda6
    • S
      Update all unique/shared_ptr instances to be qualified with namespace std (#4638) · dc352807
      Sagar Vemuri 提交于
      Summary:
      Ran the following commands to recursively change all the files under RocksDB:
      ```
      find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
      ```
      Running `make format` updated some formatting on the files touched.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
      
      Differential Revision: D12934992
      
      Pulled By: sagar0
      
      fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
      dc352807
  15. 02 11月, 2018 1 次提交
    • A
      Prevent manual flush hanging in read-only mode (#4615) · 5c794d94
      Andrew Kryczka 提交于
      Summary:
      The logic to wait for stall conditions to clear before beginning a manual flush didn't take into account whether the DB was in read-only mode. In read-only mode the stall conditions would never clear since no background work is happening, so the wait would be never-ending. It's probably better to return an error to the user.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4615
      
      Differential Revision: D12888008
      
      Pulled By: ajkr
      
      fbshipit-source-id: 1c474b42a7ac38d9fd0d0e2340ff1d53e684d83c
      5c794d94
  16. 01 11月, 2018 1 次提交
    • A
      Prevent manual compaction hanging in read-only mode (#4611) · b8f68bac
      Andrew Kryczka 提交于
      Summary:
      A background compaction with pre-picked files (i.e., either a manual compaction or a bottom-pri compaction) fails when the DB is in read-only mode. In the failure handling, we forgot to unregister the compaction and the files it covered. Then subsequent manual compactions could conflict with this zombie compaction (possibly Halloween related) and wait forever for it to finish.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4611
      
      Differential Revision: D12871217
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9d24e921d5bbd2ee8c2c9536a30abfa42a220c6e
      b8f68bac
  17. 27 10月, 2018 1 次提交
  18. 16 10月, 2018 1 次提交
  19. 11 10月, 2018 1 次提交
    • P
      support OnCompactionBegin (#4431) · 09814f2c
      Peter Pei 提交于
      Summary:
      fix #4288
      
      Add `OnCompactionBegin` support to `rocksdb::EventListener`.
      
      Currently, we only have these three callbacks:
      
      - OnFlushBegin
      - OnFlushCompleted
      - OnCompactionCompleted
      
      As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`.
      
      This PR is a try to implement the support of `OnCompactionBegin`.
      
      Hope it is useful to you.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431
      
      Differential Revision: D10055515
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334
      09814f2c
  20. 10 10月, 2018 1 次提交
  21. 09 10月, 2018 2 次提交
    • Z
      move dump stats to a separate thread (#4382) · cac87fcf
      Zhongyi Xie 提交于
      Summary:
      Currently statistics are supposed to be dumped to info log at intervals of `options.stats_dump_period_sec`. However the implementation choice was to bind it with compaction thread, meaning if the database has been serving very light traffic, the stats may not get dumped at all.
      We decided to separate stats dumping into a new timed thread using `TimerQueue`, which is already used in blob_db. This will allow us schedule new timed tasks with more deterministic behavior.
      
      Tested with db_bench using `--stats_dump_period_sec=20` in command line:
      > LOG:2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      LOG:2018/09/17-14:08:05.643286 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      LOG:2018/09/17-14:08:25.691325 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      LOG:2018/09/17-14:08:45.740989 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      
      LOG content:
      > 2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      2018/09/17-14:07:45.575080 7fe99fbfe700 [WARN] [db/db_impl.cc:606]
      ** DB Stats **
      Uptime(secs): 20.0 total, 20.0 interval
      Cumulative writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5.57 GB, 285.01 MB/s
      Cumulative WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 GB, 285.01 MB/s
      Cumulative stall: 00:00:0.012 H:M:S, 0.1 percent
      Interval writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5700.71 MB, 285.01 MB/s
      Interval WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 MB, 285.01 MB/s
      Interval stall: 00:00:0.012 H:M:S, 0.1 percent
      ** Compaction Stats [default] **
      Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4382
      
      Differential Revision: D9933051
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 6d12bb1e4977674eea4bf2d2ac6d486b814bb2fa
      cac87fcf
    • D
      Expose column family id to OnCompactionCompleted (#4466) · e0f05754
      DorianZheng 提交于
      Summary:
      The controller you requested could not be found. PTAL
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4466
      
      Differential Revision: D10241358
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 99664eb286860a6c8844d50efeb0ef6f0e10dd1e
      e0f05754
  22. 16 9月, 2018 1 次提交
    • A
      Auto recovery from out of space errors (#4164) · a27fce40
      Anand Ananthabhotla 提交于
      Summary:
      This commit implements automatic recovery from a Status::NoSpace() error
      during background operations such as write callback, flush and
      compaction. The broad design is as follows -
      1. Compaction errors are treated as soft errors and don't put the
      database in read-only mode. A compaction is delayed until enough free
      disk space is available to accomodate the compaction outputs, which is
      estimated based on the input size. This means that users can continue to
      write, and we rely on the WriteController to delay or stop writes if the
      compaction debt becomes too high due to persistent low disk space
      condition
      2. Errors during write callback and flush are treated as hard errors,
      i.e the database is put in read-only mode and goes back to read-write
      only fater certain recovery actions are taken.
      3. Both types of recovery rely on the SstFileManagerImpl to poll for
      sufficient disk space. We assume that there is a 1-1 mapping between an
      SFM and the underlying OS storage container. For cases where multiple
      DBs are hosted on a single storage container, the user is expected to
      allocate a single SFM instance and use the same one for all the DBs. If
      no SFM is specified by the user, DBImpl::Open() will allocate one, but
      this will be one per DB and each DB will recover independently. The
      recovery implemented by SFM is as follows -
        a) On the first occurance of an out of space error during compaction,
      subsequent
        compactions will be delayed until the disk free space check indicates
        enough available space. The required space is computed as the sum of
        input sizes.
        b) The free space check requirement will be removed once the amount of
        free space is greater than the size reserved by in progress
        compactions when the first error occured
        c) If the out of space error is a hard error, a background thread in
        SFM will poll for sufficient headroom before triggering the recovery
        of the database and putting it in write-only mode. The headroom is
        calculated as the sum of the write_buffer_size of all the DB instances
        associated with the SFM
      4. EventListener callbacks will be called at the start and completion of
      automatic recovery. Users can disable the auto recov ery in the start
      callback, and later initiate it manually by calling DB::Resume()
      
      Todo:
      1. More extensive testing
      2. Add disk full condition to db_stress (follow-on PR)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
      
      Differential Revision: D9846378
      
      Pulled By: anand1976
      
      fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
      a27fce40
  23. 30 8月, 2018 1 次提交
    • M
      Avoiding write stall caused by manual flushes (#4297) · 927f2749
      Mikhail Antonov 提交于
      Summary:
      Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening.
      
      One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable.
      
      This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions().
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297
      
      Differential Revision: D9420705
      
      Pulled By: mikhail-antonov
      
      fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc
      927f2749
  24. 25 8月, 2018 1 次提交
    • Y
      Refactor flush request queueing and processing (#3952) · 7daae512
      Yanqin Jin 提交于
      Summary:
      RocksDB currently queues individual column family for flushing. This is not sufficient to support the needs of some applications that want to enforce order/dependency between column families, given that multiple foreground and background activities can trigger flushing in RocksDB.
      
      This PR aims to address this limitation. Each flush request is described as a `FlushRequest` that can contain multiple column families. A background flushing thread pops one flush request from the queue at a time and processes it.
      
      This PR does not enable atomic_flush yet, but is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3952
      
      Differential Revision: D8529933
      
      Pulled By: riversand963
      
      fbshipit-source-id: 78908a21e389a3a3f7de2a79bae0cd13af5f3539
      7daae512
  25. 04 8月, 2018 1 次提交
    • Y
      Update JobContext. (#3949) · 1f802773
      Yanqin Jin 提交于
      Summary:
      In the past, we assume that a job modifies a single column family. Therefore, a job can create at most one superversion since each superversion corresponds to one column family. This assumption leads to the fact that a `JobContext` has only one member variable called `superversion_context`.
      Now we want to support group flush of column families, indicating that each job can create multiple superversions. Therefore, we need to make the following change to accommodate this new feature.
      
      Add a vector of `SuperVersionContext` to `JobContext` to support installing
      superversions for multiple column families in one job context.
      
      This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3949
      
      Differential Revision: D8864895
      
      Pulled By: riversand963
      
      fbshipit-source-id: 5937a48817276370d3c8172db9c8aafc826d97ca
      1f802773
  26. 28 7月, 2018 2 次提交
    • Y
      Remove random writes from SST file ingestion (#4172) · 54de5684
      Yanqin Jin 提交于
      Summary:
      RocksDB used to store global_seqno in external SST files written by
      SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the
      `global_seqno`. Since random write is not supported in some non-POSIX compliant
      file systems, external SST file ingestion is not supported on these file
      systems. To address this limitation, we no longer update `global_seqno` during
      file ingestion. Later RocksDB uses the MANIFEST and other information in table
      properties to deduce global seqno for externally-ingested SST files.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172
      
      Differential Revision: D8961465
      
      Pulled By: riversand963
      
      fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
      54de5684
    • D
      Protect external file when ingesting (#4099) · f5e46354
      DorianZheng 提交于
      Summary:
      If crash happen after a hard link established, Recover function may reuse the file number that has already assigned to the internal file, and this will overwrite the external file. To protect the external file, we have to make sure the file number will never being reused.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4099
      
      Differential Revision: D9034092
      
      Pulled By: riversand963
      
      fbshipit-source-id: 3f1a737440b86aa2ef01673e5013aacbb7c33e28
      f5e46354
  27. 29 6月, 2018 1 次提交
    • A
      Allow DB resume after background errors (#3997) · 52d4c9b7
      Anand Ananthabhotla 提交于
      Summary:
      Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
      1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
      2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
      3. Provide an API for the user to clear the error and resume the DB instance
      
      This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
      Closes https://github.com/facebook/rocksdb/pull/3997
      
      Differential Revision: D8653831
      
      Pulled By: anand1976
      
      fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
      52d4c9b7
  28. 15 5月, 2018 1 次提交
    • A
      Bottommost level-based compactions in bottom-pri pool · 3d7dc75b
      Andrew Kryczka 提交于
      Summary:
      This feature was introduced for universal compaction in cc01985d. At that point we thought it'd be used only to prevent long-running universal full compactions from blocking short-lived upper-level compactions. Now we have a level compaction user who could benefit from it since they use more expensive compression algorithm in the bottom level. So enable it for level.
      Closes https://github.com/facebook/rocksdb/pull/3835
      
      Differential Revision: D7957179
      
      Pulled By: ajkr
      
      fbshipit-source-id: 177285d2cef3b650b6a4d81dc5db84bc441c9fe4
      3d7dc75b
  29. 04 5月, 2018 1 次提交
    • S
      Skip deleted WALs during recovery · d5954929
      Siying Dong 提交于
      Summary:
      This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
      
      Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
      
      This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
      Closes https://github.com/facebook/rocksdb/pull/3765
      
      Differential Revision: D7747618
      
      Pulled By: siying
      
      fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
      d5954929
  30. 28 4月, 2018 2 次提交
    • H
      Add max_subcompactions as a compaction option · ed7a95b2
      Huachao Huang 提交于
      Summary:
      Sometimes we want to compact files as fast as possible, but don't want to set a large `max_subcompactions` in the `DBOptions` by default.
      I add a `max_subcompactions` options to `CompactionOptions` so that we can choose a proper concurrency dynamically.
      Closes https://github.com/facebook/rocksdb/pull/3775
      
      Differential Revision: D7792357
      
      Pulled By: ajkr
      
      fbshipit-source-id: 94f54c3784dce69e40a229721a79a97e80cd6a6c
      ed7a95b2
    • Y
      Rename pending_compaction_ to queued_for_compaction_. · 7dfbe335
      Yanqin Jin 提交于
      Summary:
      We use `queued_for_flush_` to indicate a column family has been added to the
      flush queue. Similarly and to be consistent in our naming, we need to use `queued_for_compaction_` to indicate a column family has been added to the compaction queue. In the past we used
      `pending_compaction_` which can also be ambiguous.
      Closes https://github.com/facebook/rocksdb/pull/3781
      
      Differential Revision: D7790063
      
      Pulled By: riversand963
      
      fbshipit-source-id: 6786b11a4fcaea36dc9b4672233dbe042f921804
      7dfbe335
  31. 27 4月, 2018 2 次提交