db4o 8.0 is faster than any previous db4o version. By how much depends on the size of the database and on the fragmentation level. Testing with the Poleposition benchmarking framework we found that the new BTreeIdSystem completes our "Silverstone" circuit three times faster than the old PointerBasedIdSystem. This circuit repeatedly stores and deletes 3 million objects.
You are invited to download the 8.0 development release and to try out how it works with your existing databases. db4o 8.0 will read old databases as is but you will have to defragment once to switch to the new BTreeIdSystem.
We do not yet recommend 8.0 for production use. You will be the first users to try out the new functionality. It would be great if you could report back to our forums how the first 8.0 builds work for your usecases.
We expect that you will experience the following benefits:
- Fragmentation of databases has been dramatically reduced. Space reuse will work better by an order of magnitude.
- You do not have to run Defragment as often as before but you should do it occasionally if you want to reclaim IDs of deleted objects.
- The FreespaceManager runs with far less load now. If your systems are often turned off without a correct shutdown we recommend to use the BTreeFreespaceManager. The performance penalty in comparison to the RamFreespaceManager is no longer as high as it used to be.
- BTree caching has become a lot more efficient and query performance also benefits from the change, if field indexes can be used and especially if similar queries that use the same indexes are run multiple times.
- The file system cache works faster than before, especially if you increase the number of pages of the cache.
Whatever you do, db4o 8.0 should perform better for you than previous db4o versions. If not, please tell us!
In case you are interested how we built this:
The IdSystem is one of the most central and important components of db4o. It is responsible for:
- mapping in-memory IDs of objects to physical addresses of slots in the database file
- maintaining changes to these addresses
- performing ACID commit operations on this mapping
For a long time db4o has worked very well with an implementation where IDs were physical addresses in the database file without any intermediate mapping. The simplicity of this approach made it quite scalable because it guaranteed only one read operation to find the slot for an ID. However it had some drawbacks:
- Pointers can never be reused, once a new object is stored and committed (until you run defragment).
- The database file becomes fragmented over time because pointer slots can not be moved.
- Because of fragmentation the freespace manager has to maintain a large number of entries and it becomes a bottleneck.
- Defragment is frequently necessary, it takes quite a long time and online defragment is hard to implement.
- Commit needs to rewrite many pointers that may be scattered over the database file and it requires 4 (slow) calls to flush the file to the hard drive to make commits ACID.
It seemed obvious that our existing class and field index BTree implementation would work very good for the IdSystem as well. Our first Jira entry to attempt to use it goes back to 2006:
http://tracker.db4o.com/browse/COR-233
Finally we had the opportunity to tackle this task and I think it is a very nice first feature of our coming db4o 8.0 version.
As a first step we tried to factor out a clean interface. Doing that we found that there are actually two interfaces, a transactional one and a global one. If you are interested in the technical details, please see the following two classes in our sources:
com.db4o.internal.ids.IdSystem
com.db4o.internal.ids.TransactionalIdSystem
As a next step to test our design we wrote an InMemoryIdSystem. This InMemoryIdSysem also was necessary for the implementation of our BTreeIdSystem: BTree nodes again have IDs and slots and obviously the mapping for the main BTree can't be stored recursively in itself.
Using our existing BTrees to implement a new BTreeIdSystem on top of the InMemoryIdSystem went very smooth. Everything we needed was already there and worked great for us.
When we were "done" it was really exciting to try things out. How fast would the new system be? We used the Poleposition benchmarking framework to write a couple of new "circuits" that directly check the performance of the IdSystem. Our first results were quite disappointing. We saw a drop of 80x for single lookups between the new BTreeIdSystem and the old PointerBasedIdSystem.
With extensive profiling we identified the key bottlenecks and did quite a lot of tuning:
- We added a new LRU cache to BTree nodes.
- BTree node state transitioning (empty, read, write) was improved to reuse past information where possible.
- We identified our existing Cache implementation as a bottleneck when using a larger cache. We improved the implementation very much by using a doubly linked list for the CircularBuffer and by storing the list elements in a Hashtable.
- With larger amounts of objects (starting at about 1 million) we found that our underlying InMemoryIdSystem would have to write a lot of bytes on commit for the mappings of all the BTree nodes. Thinking about how to make this more scalable we came up with a tricky solution. We tested stacking three IdSystems on top of eachother:
new BTreeIdSystem(new BTreeIdSystem(new InMemoryIdSystem()))
Since the results turned out really excellent with this setup we decided to stay with this strategy.
Now the synthetic benchmarks between the old PointerBasedIdSystem (which is of course still there to read old database files) and the new BTreeIdSystem looked like a close match. At this point we were very happy since we knew that we would improve performance in real life scenarios by a lot: Our tests did not take fragmentation into account (which always occurs with real objects).
Now came the hard part: We wanted to make absolutely sure that the new system is perfectly robust and transactional. We also wanted to reduce the number of necessary flush calls during commit from four to two since single pointers no longer have to be overwritten as a separate step. The abstract algorithm of the new approach looks like this:
(1) Write all modified objects and all components of one big tree of ID information (consisting of multiple IdSystems)
(2) Write one version of the new address of this tree to the header
(3) Flush
(4) Write another version of the new address of the tree to the header
(5) Flush again
If we end up with both addresses of the tree being the same, the commit operation must have completed successfully. If we end up with different addresses, we check the checksum of both addresses and preferably use the second version if it's not corrupted. If the second version is corrupted, we use the first version.
This way we account for all cases:
If the system goes down before (4) the old address is still in place and it will be used. If something goes wrong during (4) or (5), the broken checksum will tell us and we will use the first address which will correctly point to all the new data.
This algorithm works because we keep both versions (before and after commit) of all objects readable in the database file and only reuse freed space after comitting successfully.
ACID transaction on systems that may be turned off at any point in time (mobile phones!) are a little tricky. The operating system may change the order of writes around and it may write data partially. Basically any write operation of an integer to a file may end up writing the wrong value if only some of the four bytes are written.
To simulate this problem we enhanced our CrashSimulatingTestCase to write complete trash on any write between two flush operations. At first this test corrupted database files with our new "only two flushes" commit algorithm. There were a couple of hen-egg problems that we had to solve to get the BTreeFreespaceManager to work. It also made sense to review the FileHeader system completely. We made the header supersafe by actually writing four versions(2 old, 2 new) of the header information. This way even accidental checksum equality can't hurt.
Now all tests with the CrashSimulatingTestCase run GREEN and we are quite sure that the new system is just as solid as the old one.
The combination of all of the above makes db4o 8.0 the fastest db4o version ever.
Enjoy db4o!
...hopefully as much as we did working on this cool fast new stuff.
Your db4o team here at Versant.